From 8feb93be396665b2cad92e0b944bbef6fbf90d53 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Mon, 27 Jan 2020 16:20:44 +0200 Subject: [PATCH] Add notes for 2020-01-27 --- content/posts/2020-01.md | 37 + docs/2015-11/index.html | 22 +- docs/2015-12/index.html | 26 +- docs/2016-01/index.html | 16 +- docs/2016-02/index.html | 56 +- docs/2016-03/index.html | 40 +- docs/2016-04/index.html | 60 +- docs/2016-05/index.html | 40 +- docs/2016-06/index.html | 32 +- docs/2016-07/index.html | 16 +- docs/2016-08/index.html | 44 +- docs/2016-09/index.html | 50 +- docs/2016-10/index.html | 32 +- docs/2016-11/index.html | 62 +- docs/2016-12/index.html | 98 +- docs/2017-01/index.html | 40 +- docs/2017-02/index.html | 40 +- docs/2017-03/index.html | 38 +- docs/2017-04/index.html | 72 +- docs/2017-05/index.html | 46 +- docs/2017-06/index.html | 40 +- docs/2017-07/index.html | 38 +- docs/2017-08/index.html | 76 +- docs/2017-09/index.html | 98 +- docs/2017-10/index.html | 74 +- docs/2017-11/index.html | 116 +- docs/2017-12/index.html | 110 +- docs/2018-01/index.html | 180 +- docs/2018-02/index.html | 144 +- docs/2018-03/index.html | 76 +- docs/2018-04/index.html | 74 +- docs/2018-05/index.html | 66 +- docs/2018-06/index.html | 70 +- docs/2018-07/index.html | 58 +- docs/2018-08/index.html | 76 +- docs/2018-09/index.html | 144 +- docs/2018-10/index.html | 92 +- docs/2018-11/index.html | 52 +- docs/2018-12/index.html | 62 +- docs/2019-01/index.html | 94 +- docs/2019-02/index.html | 118 +- docs/2019-03/index.html | 134 +- docs/2019-04/index.html | 100 +- docs/2019-05/index.html | 38 +- docs/2019-06/index.html | 20 +- docs/2019-07/index.html | 46 +- docs/2019-08/index.html | 38 +- docs/2019-09/index.html | 52 +- docs/2019-10/index.html | 38 +- docs/2019-11/index.html | 72 +- docs/2019-12/index.html | 30 +- docs/2020-01/index.html | 76 +- docs/404.html | 144 - docs/categories/index.html | 34 +- docs/categories/notes/index.html | 34 +- docs/categories/notes/index.xml | 46 +- docs/categories/notes/page/2/index.html | 44 +- docs/categories/notes/page/3/index.html | 44 +- docs/categories/page/2/index.html | 44 +- docs/categories/page/3/index.html | 44 +- docs/categories/page/4/index.html | 48 +- docs/categories/page/5/index.html | 38 +- docs/categories/page/6/index.html | 10 +- docs/cgiar-library-migration/index.html | 24 +- docs/cgspace-cgcorev2-migration/index.html | 10 +- ...c211ec94c36f7c4454ea15cf4d3548370042a.css} | 14 +- docs/fonts/FontAwesome.otf | Bin 134808 -> 0 bytes docs/fonts/fontawesome-webfont.eot | Bin 165742 -> 0 bytes docs/fonts/fontawesome-webfont.svg | 2671 ---------- docs/fonts/fontawesome-webfont.ttf | Bin 165548 -> 0 bytes docs/fonts/fontawesome-webfont.woff | Bin 98024 -> 0 bytes docs/fonts/fontawesome-webfont.woff2 | Bin 77160 -> 0 bytes docs/index.html | 34 +- docs/index.xml | 84 +- docs/page/2/index.html | 44 +- docs/page/3/index.html | 44 +- docs/page/4/index.html | 48 +- docs/page/5/index.html | 38 +- docs/page/6/index.html | 10 +- docs/posts/index.html | 34 +- docs/posts/index.xml | 84 +- docs/posts/page/2/index.html | 44 +- docs/posts/page/3/index.html | 44 +- docs/posts/page/4/index.html | 48 +- docs/posts/page/5/index.html | 38 +- docs/posts/page/6/index.html | 10 +- docs/tags/index.html | 34 +- docs/tags/migration/index.html | 12 +- docs/tags/notes/index.html | 48 +- docs/tags/notes/index.xml | 38 +- docs/tags/notes/page/2/index.html | 38 +- docs/tags/notes/page/3/index.html | 10 +- docs/tags/page/2/index.html | 44 +- docs/tags/page/3/index.html | 44 +- docs/tags/page/4/index.html | 48 +- docs/tags/page/5/index.html | 38 +- docs/tags/page/6/index.html | 10 +- docs/webfonts/fa-brands-400.eot | Bin 0 -> 131930 bytes docs/webfonts/fa-brands-400.svg | 3535 +++++++++++++ docs/webfonts/fa-brands-400.ttf | Bin 0 -> 131624 bytes docs/webfonts/fa-brands-400.woff | Bin 0 -> 89100 bytes docs/webfonts/fa-brands-400.woff2 | Bin 0 -> 75936 bytes docs/webfonts/fa-regular-400.eot | Bin 0 -> 34390 bytes docs/webfonts/fa-regular-400.svg | 803 +++ docs/webfonts/fa-regular-400.ttf | Bin 0 -> 34092 bytes docs/webfonts/fa-regular-400.woff | Bin 0 -> 16800 bytes docs/webfonts/fa-regular-400.woff2 | Bin 0 -> 13576 bytes docs/webfonts/fa-solid-900.eot | Bin 0 -> 194066 bytes docs/webfonts/fa-solid-900.svg | 4700 +++++++++++++++++ docs/webfonts/fa-solid-900.ttf | Bin 0 -> 193780 bytes docs/webfonts/fa-solid-900.woff | Bin 0 -> 98996 bytes docs/webfonts/fa-solid-900.woff2 | Bin 0 -> 76084 bytes 112 files changed, 11466 insertions(+), 5158 deletions(-) delete mode 100644 docs/404.html rename docs/css/{style.a20c1a4367639632cdb341d23c27ca44fedcc75b0f8b3cbea6203010da153d3c.css => style.23e2c3298bcc8c1136c19aba330c211ec94c36f7c4454ea15cf4d3548370042a.css} (97%) delete mode 100644 docs/fonts/FontAwesome.otf delete mode 100644 docs/fonts/fontawesome-webfont.eot delete mode 100644 docs/fonts/fontawesome-webfont.svg delete mode 100644 docs/fonts/fontawesome-webfont.ttf delete mode 100644 docs/fonts/fontawesome-webfont.woff delete mode 100644 docs/fonts/fontawesome-webfont.woff2 create mode 100644 docs/webfonts/fa-brands-400.eot create mode 100644 docs/webfonts/fa-brands-400.svg create mode 100644 docs/webfonts/fa-brands-400.ttf create mode 100644 docs/webfonts/fa-brands-400.woff create mode 100644 docs/webfonts/fa-brands-400.woff2 create mode 100644 docs/webfonts/fa-regular-400.eot create mode 100644 docs/webfonts/fa-regular-400.svg create mode 100644 docs/webfonts/fa-regular-400.ttf create mode 100644 docs/webfonts/fa-regular-400.woff create mode 100644 docs/webfonts/fa-regular-400.woff2 create mode 100644 docs/webfonts/fa-solid-900.eot create mode 100644 docs/webfonts/fa-solid-900.svg create mode 100644 docs/webfonts/fa-solid-900.ttf create mode 100644 docs/webfonts/fa-solid-900.woff create mode 100644 docs/webfonts/fa-solid-900.woff2 diff --git a/content/posts/2020-01.md b/content/posts/2020-01.md index a76efdd54..2bfdfd1d1 100644 --- a/content/posts/2020-01.md +++ b/content/posts/2020-01.md @@ -264,4 +264,41 @@ $ ./fix-metadata-values.py -i /tmp/2020-01-22-fix-1113-affiliations.csv -db dspa $ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 ``` +## 2020-01-26 + +- Add "Gender" to controlled vocabulary for CRPs ([#442](https://github.com/ilri/DSpace/pull/442)) +- Deploy the changes on CGSpace and run all updates on the server and reboot it + - I had to restart the `tomcat7` service several times until all Solr statistics cores came up OK +- I spent a few hours writing a script ([create-thumbnails](https://gist.github.com/alanorth/1c7c8b2131a19559e273fbc1e58d6a71)) to compare the default DSpace thumbnails with the improved parameters above and actually when comparing them at size 600px I don't really notice much difference, other than the new ones have slightly crisper text + - So that was a waste of time, though I think our 300px thumbnails are a bit small now + - [Another thread on the ImageMagick forum](https://www.imagemagick.org/discourse-server/viewtopic.php?t=14561) mentions that you need to set the density, then read the image, then set the density again: + +``` +$ convert -density 288 10568-97925.pdf\[0\] -density 72 -filter lagrange -flatten 10568-97925-density.jpg +``` + +- One thing worth mentioning was this syntax for extracting bits from JSON in bash using `jq`: + +``` +$ RESPONSE=$(curl -s 'https://dspacetest.cgiar.org/rest/handle/10568/103447?expand=bitstreams') +$ echo $RESPONSE | jq '.bitstreams[] | select(.bundleName=="ORIGINAL") | .retrieveLink' +"/bitstreams/172559/retrieve" +``` + +## 2020-01-27 + +- Bizu has been having problems when she logs into CGSpace, she can't see the community list on the front page + - This last happened for another user in [2016-11]({{< ref "2016-11.md" >}}), and it was related to the Tomcat `maxHttpHeaderSize` being too small because the user was in too many groups + - I see that it is similar, with this message appearing in the DSpace log just after she logs in: + +``` +2020-01-27 06:02:23,681 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page +org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'read:(g0 OR e610 OR g0 OR g3 OR g5 OR g4102 OR g9 OR g4105 OR g10 OR g4107 OR g4108 OR g13 OR g4109 OR g14 OR g15 OR g16 OR g18 OR g20 OR g23 OR g24 OR g2072 OR g2074 OR g28 OR g2076 OR g29 OR g2078 OR g2080 OR g34 OR g2082 OR g2084 OR g38 OR g2086 OR g2088 OR g43 OR g2093 OR g2095 OR g2097 OR g50 OR g51 OR g2101 OR g2103 OR g62 OR g65 OR g77 OR g78 OR g2127 OR g2142 OR g2151 OR g2152 OR g2153 OR g2154 OR g2156 OR g2165 OR g2171 OR g2174 OR g2175 OR g129 OR g2178 OR g2182 OR g2186 OR g153 OR g155 OR g158 OR g166 OR g167 OR g168 OR g169 OR g2225 OR g179 OR g2227 OR g2229 OR g183 OR g2231 OR g184 OR g2233 OR g186 OR g2235 OR g2237 OR g191 OR g192 OR g193 OR g2242 OR g2244 OR g2246 OR g2250 OR g204 OR g205 OR g207 OR g208 OR g2262 OR g2265 OR g218 OR g2268 OR g222 OR g223 OR g2271 OR g2274 OR g2277 OR g230 OR g231 OR g2280 OR g2283 OR g238 OR g2286 OR g241 OR g2289 OR g244 OR g2292 OR g2295 OR g2298 OR g2301 OR g254 OR g255 OR g2305 OR g2308 OR g262 OR g2311 OR g265 OR g268 OR g269 OR g273 OR g276 OR g277 OR g279 OR g282 OR g292 OR g293 OR g296 OR g297 OR g301 OR g303 OR g305 OR g2353 OR g310 OR g311 OR g313 OR g321 OR g325 OR g328 OR g333 OR g334 OR g342 OR g343 OR g345 OR g348 OR g2409 [...] ': too many boolean clauses +``` + +- Now this appears to be a Solr limit of some kind ("too many boolean clauses") + - I changed the `maxBooleanClauses` for all Solr cores on DSpace Test from 1024 to 2048 and then she was able to see her communities... + - I made a [pull request](https://github.com/ilri/DSpace/pull/443) and merged it to the `5_x-prod` branch and will deploy on CGSpace later tonight + - I am curious if anyone on the dspace-tech mailing list has run into this, so I will try to send a message about this there when I get a chance + diff --git a/docs/2015-11/index.html b/docs/2015-11/index.html index c9cfc9cd9..1aa3d7293 100644 --- a/docs/2015-11/index.html +++ b/docs/2015-11/index.html @@ -31,7 +31,7 @@ Last week I had increased the limit from 30 to 60, which seemed to help, but now $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 78 "/> - + @@ -61,7 +61,7 @@ $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspac - + @@ -109,7 +109,7 @@ $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspac

November, 2015

by Alan Orth in -  + 

@@ -127,7 +127,7 @@ $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspac

2015-11-24

$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
@@ -145,9 +145,9 @@ location ~ /(themes|static|aspects/ReportingSuite) {
     try_files $uri @tomcat;
 ...
 
@@ -157,7 +157,7 @@ location ~ /(themes|static|aspects/ReportingSuite) {
location ~ /(themes|aspects/ReportingSuite|aspects/Statistics) {
 
$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
@@ -173,7 +173,7 @@ datid | datname  |  pid  | usesysid | usename  | application_name | client_addr
 ...
 

CCAFS item

2015-12-03

2016-01-19

2016-01-21

diff --git a/docs/2016-02/index.html b/docs/2016-02/index.html index 142e6cead..47cdbd618 100644 --- a/docs/2016-02/index.html +++ b/docs/2016-02/index.html @@ -35,7 +35,7 @@ I noticed we have a very interesting list of countries on CGSpace: Not only are there 49,000 countries, we have some blanks (25)… Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE” "/> - + @@ -65,7 +65,7 @@ Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE&r - + @@ -113,7 +113,7 @@ Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE&r

February, 2016

@@ -144,7 +144,7 @@ Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE&r
dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
 
dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
 DELETE 25
@@ -157,7 +157,7 @@ DELETE 25
 
 

2016-02-07

    -
  • Working on cleaning up Abenet's DAGRIS data with OpenRefine
  • +
  • Working on cleaning up Abenet’s DAGRIS data with OpenRefine
  • I discovered two really nice functions in OpenRefine: value.trim() and value.escape("javascript") which shows whitespace characters like \r\n!
  • For some reason when you import an Excel file into OpenRefine it exports dates like 1949 to 1949.0 in the CSV
  • I re-import the resulting CSV and run a GREL on the date issued column: value.replace("\.0", "")
  • @@ -178,7 +178,7 @@ postgres=# \q $ vacuumdb dspacetest $ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost
$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig
 $ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT
@@ -199,7 +199,7 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
 

2016-02-08

ILRI submission buttons Drylands submission buttons

@@ -207,7 +207,7 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
$ cd ~/src/git
 $ git clone https://github.com/letsencrypt/letsencrypt
@@ -222,7 +222,7 @@ $ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-becom
 
  • I had to export some CIAT items that were being cleaned up on the test server and I noticed their dc.contributor.author fields have DSpace 5 authority index UUIDs…
  • To clean those up in OpenRefine I used this GREL expression: value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,"")
  • Getting more and more hangs on DSpace Test, seemingly random but also during CSV import
  • -
  • Logs don't always show anything right when it fails, but eventually one of these appears:
  • +
  • Logs don’t always show anything right when it fails, but eventually one of these appears:
  • org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
     
      @@ -230,7 +230,7 @@ $ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-becom
    Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
     
      -
    • Right now DSpace Test's Tomcat heap is set to 1536m and we have quite a bit of free RAM:
    • +
    • Right now DSpace Test’s Tomcat heap is set to 1536m and we have quite a bit of free RAM:
    # free -m
                  total       used       free     shared    buffers     cached
    @@ -238,7 +238,7 @@ Mem:          3950       3902         48          9         37       1311
     -/+ buffers/cache:       2552       1397
     Swap:          255         57        198
     
      -
    • So I'll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)
    • +
    • So I’ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)

    2016-02-11

      @@ -259,16 +259,16 @@ Processing 64195.pdf > Creating thumbnail for 64195.pdf

    2016-02-12

    2016-02-12

    $ ls | grep -c -E "%"
    @@ -291,7 +291,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
     

    2016-02-20

    java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory)
     
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace filter-media -f -v -i 10568/71249 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
    @@ -471,11 +471,11 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this Index
     
    [dspace]/bin/dspace index-discovery
     
    • Now everything is ok
    • -
    • Finally finished manually running the cleanup task over and over and null'ing the conflicting IDs:
    • +
    • Finally finished manually running the cleanup task over and over and null’ing the conflicting IDs:
    dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1132, 1136, 1220, 1236, 3002, 3255, 5322, 5098, 5982, 5897, 6245, 6184, 4927, 6070, 4925, 6888, 7368, 7136, 7294, 7698, 7864, 10799, 10839, 11765, 13241, 13634, 13642, 14127, 14146, 15582, 16116, 16254, 17136, 17486, 17824, 18098, 22091, 22149, 22206, 22449, 22548, 22559, 22454, 22253, 22553, 22897, 22941, 30262, 33657, 39796, 46943, 56561, 58237, 58739, 58734, 62020, 62535, 64149, 64672, 66988, 66919, 76005, 79780, 78545, 81078, 83620, 84492, 92513, 93915);
     
      -
    • Now running the cleanup script on DSpace Test and already seeing 11GB freed from the assetstore—it's likely we haven't had a cleanup task complete successfully in years…
    • +
    • Now running the cleanup script on DSpace Test and already seeing 11GB freed from the assetstore—it’s likely we haven’t had a cleanup task complete successfully in years…

    2017-04-25

      @@ -548,7 +548,7 @@ Caused by: java.lang.ClassNotFoundException: org.dspace.statistics.content.DSpac

      2017-04-26

      • The size of the CGSpace database dump went from 111MB to 96MB, not sure about actual database size though
      • -
      • Update RVM's Ruby from 2.3.0 to 2.4.0 on DSpace Test:
      • +
      • Update RVM’s Ruby from 2.3.0 to 2.4.0 on DSpace Test:
      $ gpg --keyserver hkp://keys.gnupg.net --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
       $ \curl -sSL https://raw.githubusercontent.com/wayneeseguin/rvm/master/binscripts/rvm-installer | bash -s stable --ruby
      diff --git a/docs/2017-05/index.html b/docs/2017-05/index.html
      index b1871605f..2b84b6759 100644
      --- a/docs/2017-05/index.html
      +++ b/docs/2017-05/index.html
      @@ -6,7 +6,7 @@
       
       
       
      -
      +
       
       
       
      @@ -14,8 +14,8 @@
       
       
       
      -
      -
      +
      +
       
       
           
      @@ -45,7 +45,7 @@
           
           
           
      -    
      +    
           
       
           
      @@ -93,7 +93,7 @@
           

      May, 2017

      @@ -109,12 +109,12 @@

    2017-05-02

      -
    • Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request
    • +
    • Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request

    2017-05-04

    • Sync DSpace Test with database and assetstore from CGSpace
    • -
    • Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server
    • +
    • Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server
    • Now I can see the workflow statistics and am able to select users, but everything returns 0 items
    • Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b
    • Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.cgiar.org/handle/10568/80731
    • @@ -149,8 +149,8 @@
    • We decided to use AIP export to preserve the hierarchies and handles of communities and collections
    • When ingesting some collections I was getting java.lang.OutOfMemoryError: GC overhead limit exceeded, which can be solved by disabling the GC timeout with -XX:-UseGCOverheadLimit
    • Other times I was getting an error about heap space, so I kept bumping the RAM allocation by 512MB each time (up to 4096m!) it crashed
    • -
    • This leads to tens of thousands of abandoned files in the assetstore, which need to be cleaned up using dspace cleanup -v, or else you'll run out of disk space
    • -
    • In the end I realized it's better to use submission mode (-s) to ingest the community object as a single AIP without its children, followed by each of the collections:
    • +
    • This leads to tens of thousands of abandoned files in the assetstore, which need to be cleaned up using dspace cleanup -v, or else you’ll run out of disk space
    • +
    • In the end I realized it’s better to use submission mode (-s) to ingest the community object as a single AIP without its children, followed by each of the collections:
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m -XX:-UseGCOverheadLimit"
     $ [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10568/87775 /home/aorth/10947-1/10947-1.zip
    @@ -162,14 +162,14 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
     
  • Give feedback to CIFOR about their data quality:
    • Suggestion: uppercase dc.subject, cg.coverage.region, and cg.coverage.subregion in your crosswalk so they match CGSpace and therefore can be faceted / reported on easier
    • -
    • Suggestion: use CGSpace's CRP names (cg.contributor.crp), see: dspace/config/input-forms.xml
    • +
    • Suggestion: use CGSpace’s CRP names (cg.contributor.crp), see: dspace/config/input-forms.xml
    • Suggestion: clean up duplicates and errors in funders, perhaps use a controlled vocabulary like ours, see: dspace/config/controlled-vocabularies/dc-description-sponsorship.xml
    • Suggestion: use dc.type “Blog Post” instead of “Blog” for your blog post items (we are also adding a “Blog Post” type to CGSpace soon)
    • Question: many of your items use dc.document.uri AND cg.identifier.url with the same text value?
  • Help Marianne from WLE with an Open Search query to show the latest WLE CRP outputs: https://cgspace.cgiar.org/open-search/discover?query=crpsubject:WATER%2C+LAND+AND+ECOSYSTEMS&sort_by=2&order=DESC
  • -
  • This uses the webui's item list sort options, see webui.itemlist.sort-option in dspace.cfg
  • +
  • This uses the webui’s item list sort options, see webui.itemlist.sort-option in dspace.cfg
  • The equivalent Discovery search would be: https://cgspace.cgiar.org/discover?filtertype_1=crpsubject&filter_relational_operator_1=equals&filter_1=WATER%2C+LAND+AND+ECOSYSTEMS&submit_apply_filter=&query=&rpp=10&sort_by=dc.date.issued_dt&order=desc
  • 2017-05-09

    @@ -191,7 +191,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager

    2017-05-10

      -
    • Atmire says they are willing to extend the ORCID implementation, and I've asked them to provide a quote
    • +
    • Atmire says they are willing to extend the ORCID implementation, and I’ve asked them to provide a quote
    • I clarified that the scope of the implementation should be that ORCIDs are stored in the database and exposed via REST / API like other fields
    • Finally finished importing all the CGIAR Library content, final method was:
    @@ -239,7 +239,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager

    Reboot DSpace Test

  • -

    Fix cron jobs for log management on DSpace Test, as they weren't catching dspace.log.* files correctly and we had over six months of them and they were taking up many gigs of disk space

    +

    Fix cron jobs for log management on DSpace Test, as they weren’t catching dspace.log.* files correctly and we had over six months of them and they were taking up many gigs of disk space

  • 2017-05-16

    @@ -253,7 +253,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
    ERROR: duplicate key value violates unique constraint "handle_pkey" Detail: Key (handle_id)=(84834) already exists.
     
      -
    • I tried updating the sequences a few times, with Tomcat running and stopped, but it hasn't helped
    • +
    • I tried updating the sequences a few times, with Tomcat running and stopped, but it hasn’t helped
    • It appears item with handle_id 84834 is one of the imported CGIAR Library items:
    dspace=# select * from handle where handle_id=84834;
    @@ -269,16 +269,16 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
          86873 | 10947/99 |                2 |       89153
     (1 row)
     
      -
    • I've posted on the dspace-test mailing list to see if I can just manually set the handle_seq to that value
    • +
    • I’ve posted on the dspace-test mailing list to see if I can just manually set the handle_seq to that value
    • Actually, it seems I can manually set the handle sequence using:
    dspace=# select setval('handle_seq',86873);
     
      -
    • After that I can create collections just fine, though I'm not sure if it has other side effects
    • +
    • After that I can create collections just fine, though I’m not sure if it has other side effects

    2017-05-21

      -
    • Start creating a basic theme for the CGIAR System Organization's community on CGSpace
    • +
    • Start creating a basic theme for the CGIAR System Organization’s community on CGSpace
    • Using colors from the CGIAR Branding guidelines (2014)
    • Make a GitHub issue to track this work: #324
    @@ -315,14 +315,14 @@ AND resource_id IN (select item_id from collection2item where collection_id IN (

    2017-05-23

    • Add Affiliation to filters on Listing and Reports module (#325)
    • -
    • Start looking at WLE's Phase II metadata updates but it seems they are not tagging their items properly, as their website importer infers which theme to use based on the name of the CGSpace collection!
    • -
    • For now I've suggested that they just change the collection names and that we fix their metadata manually afterwards
    • +
    • Start looking at WLE’s Phase II metadata updates but it seems they are not tagging their items properly, as their website importer infers which theme to use based on the name of the CGSpace collection!
    • +
    • For now I’ve suggested that they just change the collection names and that we fix their metadata manually afterwards
    • Also, they have a lot of messed up values in their cg.subject.wle field so I will clean up some of those first:
    dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id=119) to /tmp/wle.csv with csv;
     COPY 111
     
      -
    • Respond to Atmire message about ORCIDs, saying that right now we'd prefer to just have them available via REST API like any other metadata field, and that I'm available for a Skype
    • +
    • Respond to Atmire message about ORCIDs, saying that right now we’d prefer to just have them available via REST API like any other metadata field, and that I’m available for a Skype

    2017-05-26

      @@ -334,7 +334,7 @@ COPY 111
    • File an issue on GitHub to explore/track migration to proper country/region codes (ISO 2/3 and UN M.49): #326
    • Ask Peter how the Landportal.info people should acknowledge us as the source of data on their website
    • Communicate with MARLO people about progress on exposing ORCIDs via the REST API, as it is set to be discussed in the June, 2017 DCAT meeting
    • -
    • Find all of Amos Omore's author name variations so I can link them to his authority entry that has an ORCID:
    • +
    • Find all of Amos Omore’s author name variations so I can link them to his authority entry that has an ORCID:
    dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Omore, A%';
     
      @@ -347,7 +347,7 @@ UPDATE 187
    dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Twine, E%';
     
      -
    • But it doesn't look like any of his existing entries are linked to an authority which has an ORCID, so I edited the metadata via “Edit this Item” and looked up his ORCID and linked it there
    • +
    • But it doesn’t look like any of his existing entries are linked to an authority which has an ORCID, so I edited the metadata via “Edit this Item” and looked up his ORCID and linked it there
    • Now I should be able to set his name variations to the new authority:
    dspace=# update metadatavalue set authority='f70d0a01-d562-45b8-bca3-9cf7f249bc8b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Twine, E%';
    @@ -359,7 +359,7 @@ UPDATE 187
     
    • Discuss WLE themes and subjects with Mia and Macaroni Bros
    • We decided we need to create metadata fields for Phase I and II themes
    • -
    • I've updated the existing GitHub issue for Phase II (#322) and created a new one to track the changes for Phase I themes (#327)
    • +
    • I’ve updated the existing GitHub issue for Phase II (#322) and created a new one to track the changes for Phase I themes (#327)
    • After Macaroni Bros update the WLE website importer we will rename the WLE collections to reflect Phase II
    • Also, we need to have Mia and Udana look through the existing metadata in cg.subject.wle as it is quite a mess
    diff --git a/docs/2017-06/index.html b/docs/2017-06/index.html index bad2692d0..4380d0109 100644 --- a/docs/2017-06/index.html +++ b/docs/2017-06/index.html @@ -6,7 +6,7 @@ - + @@ -14,8 +14,8 @@ - - + + @@ -45,7 +45,7 @@ - + @@ -93,7 +93,7 @@

    June, 2017

    @@ -101,7 +101,7 @@
    • After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes
    • The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes
    • -
    • Then we'll create a new sub-community for Phase II and create collections for the research themes there
    • +
    • Then we’ll create a new sub-community for Phase II and create collections for the research themes there
    • The current “Research Themes” community will be renamed to “WLE Phase I Research Themes”
    • Tagged all items in the current Phase I collections with their appropriate themes
    • Create pull request to add Phase II research themes to the submission form: #328
    • @@ -111,15 +111,15 @@
      • After adding cg.identifier.wletheme to 1106 WLE items I can see the field on XMLUI but not in REST!
      • Strangely it happens on DSpace Test AND on CGSpace!
      • -
      • I tried to re-index Discovery but it didn't fix it
      • +
      • I tried to re-index Discovery but it didn’t fix it
      • Run all system updates on DSpace Test and reboot the server
      • After rebooting the server (and therefore restarting Tomcat) the new metadata field is available
      • -
      • I've sent a message to the dspace-tech mailing list to ask if this is a bug and whether I should file a Jira ticket
      • +
      • I’ve sent a message to the dspace-tech mailing list to ask if this is a bug and whether I should file a Jira ticket

      2016-06-05

        -
      • Rename WLE's “Research Themes” sub-community to “WLE Phase I Research Themes” on DSpace Test so Macaroni Bros can continue their testing
      • -
      • Macaroni Bros tested it and said it's fine, so I renamed it on CGSpace as well
      • +
      • Rename WLE’s “Research Themes” sub-community to “WLE Phase I Research Themes” on DSpace Test so Macaroni Bros can continue their testing
      • +
      • Macaroni Bros tested it and said it’s fine, so I renamed it on CGSpace as well
      • Working on how to automate the extraction of the CIAT Book chapters, doing some magic in OpenRefine to extract page from–to from cg.identifier.url and dc.format.extent, respectively:
        • cg.identifier.url: value.split("page=", "")[1]
        • @@ -144,7 +144,7 @@
      • 17 of the items have issues with incorrect page number ranges, and upon closer inspection they do not appear in the referenced PDF
      • -
      • I've flagged them and proceeded without them (752 total) on DSpace Test:
      • +
      • I’ve flagged them and proceeded without them (752 total) on DSpace Test:
      $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &> /tmp/ciat-books.log
       
        @@ -154,9 +154,9 @@

      2017-06-07

        -
      • Testing Atmire's patch for the CUA Workflow Statistics again
      • -
      • Still doesn't seem to give results I'd expect, like there are no results for Maria Garruccio, or for the ILRI community!
      • -
      • Then I'll file an update to the issue on Atmire's tracker
      • +
      • Testing Atmire’s patch for the CUA Workflow Statistics again
      • +
      • Still doesn’t seem to give results I’d expect, like there are no results for Maria Garruccio, or for the ILRI community!
      • +
      • Then I’ll file an update to the issue on Atmire’s tracker
      • Created a new branch with just the relevant changes, so I can send it to them
      • One thing I noticed is that there is a failed database migration related to CUA:
      @@ -194,7 +194,7 @@

    2017-06-20

      -
    • Import Abenet and Peter's changes to the CGIAR Library CRP community
    • +
    • Import Abenet and Peter’s changes to the CGIAR Library CRP community
    • Due to them using Windows and renaming some columns there were formatting, encoding, and duplicate metadata value issues
    • I had to remove some fields from the CSV and rename some back to, ie, dc.subject[en_US] just so DSpace would detect changes properly
    • Now it looks much better: https://dspacetest.cgiar.org/handle/10947/2517
    • @@ -212,7 +212,7 @@ $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace impo
      • WLE has said that one of their Phase II research themes is being renamed from Regenerating Degraded Landscapes to Restoring Degraded Landscapes
      • Pull request with the changes to input-forms.xml: #329
      • -
      • As of now it doesn't look like there are any items using this research theme so we don't need to do any updates:
      • +
      • As of now it doesn’t look like there are any items using this research theme so we don’t need to do any updates:
      dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=237 and text_value like 'Regenerating Degraded Landscapes%';
        text_value
      @@ -229,15 +229,15 @@ $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace impo
       
      Java stacktrace: java.util.NoSuchElementException: Timeout waiting for idle object
       
      • After looking at the Tomcat logs, Munin graphs, and PostgreSQL connection stats, it seems there is just a high load
      • -
      • Might be a good time to adjust DSpace's database connection settings, like I first mentioned in April, 2017 after reading the 2017-04 DCAT comments
      • -
      • I've adjusted the following in CGSpace's config: +
      • Might be a good time to adjust DSpace’s database connection settings, like I first mentioned in April, 2017 after reading the 2017-04 DCAT comments
      • +
      • I’ve adjusted the following in CGSpace’s config:
          -
        • db.maxconnections 30→70 (the default PostgreSQL config allows 100 connections, so DSpace's default of 30 is quite low)
        • +
        • db.maxconnections 30→70 (the default PostgreSQL config allows 100 connections, so DSpace’s default of 30 is quite low)
        • db.maxwait 5000→10000
        • db.maxidle 8→20 (DSpace default is -1, unlimited, but we had set it to 8 earlier)
      • -
      • We will need to adjust this again (as well as the pg_hba.conf settings) when we deploy tsega's REST API
      • +
      • We will need to adjust this again (as well as the pg_hba.conf settings) when we deploy tsega’s REST API
      • Whip up a test for Marianne of WLE to be able to show both their Phase I and II research themes in the CGSpace item submission form:

      Test A for displaying the Phase I and II research themes diff --git a/docs/2017-07/index.html b/docs/2017-07/index.html index 752ad42e3..9704ee0b4 100644 --- a/docs/2017-07/index.html +++ b/docs/2017-07/index.html @@ -13,8 +13,8 @@ Run system updates and reboot DSpace Test 2017-07-04 Merge changes for WLE Phase II theme rename (#329) -Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace -We can use PostgreSQL's extended output format (-x) plus sed to format the output into quasi XML: +Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace +We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML: " /> @@ -30,10 +30,10 @@ Run system updates and reboot DSpace Test 2017-07-04 Merge changes for WLE Phase II theme rename (#329) -Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace -We can use PostgreSQL's extended output format (-x) plus sed to format the output into quasi XML: +Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace +We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML: "/> - + @@ -63,7 +63,7 @@ We can use PostgreSQL's extended output format (-x) plus sed to format the o - + @@ -111,7 +111,7 @@ We can use PostgreSQL's extended output format (-x) plus sed to format the o

      July, 2017

      @@ -122,19 +122,19 @@ We can use PostgreSQL's extended output format (-x) plus sed to format the o

      2017-07-04

      • Merge changes for WLE Phase II theme rename (#329)
      • -
      • Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace
      • -
      • We can use PostgreSQL's extended output format (-x) plus sed to format the output into quasi XML:
      • +
      • Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
      • +
      • We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML:
      $ psql dspacenew -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=5 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:</dc-type>\n<dc-type>\n<schema>cg</schema>:;s:([^ ]*) +\| (.*):  <\1>\2</\1>:;s:^$:</dc-type>:;1s:</dc-type>\n::'
       
      • The sed script is from a post on the PostgreSQL mailing list
      • -
      • Abenet says the ILRI board wants to be able to have “lead author” for every item, so I've whipped up a WIP test in the 5_x-lead-author branch
      • -
      • It works but is still very rough and we haven't thought out the whole lifecycle yet
      • +
      • Abenet says the ILRI board wants to be able to have “lead author” for every item, so I’ve whipped up a WIP test in the 5_x-lead-author branch
      • +
      • It works but is still very rough and we haven’t thought out the whole lifecycle yet

      Testing lead author in submission form

      • I assume that “lead author” would actually be the first question on the item submission form
      • -
      • We also need to check to see which ORCID authority core this uses, because it seems to be using an entirely new one rather than the one for dc.contributor.author (which makes sense of course, but fuck, all the author problems aren't bad enough?!)
      • +
      • We also need to check to see which ORCID authority core this uses, because it seems to be using an entirely new one rather than the one for dc.contributor.author (which makes sense of course, but fuck, all the author problems aren’t bad enough?!)
      • Also would need to edit XMLUI item displays to incorporate this into authors list
      • And fuck, then anyone consuming our data via REST / OAI will not notice that we have an author outside of dc.contributor.authors… ugh
      • What if we modify the item submission form to use type-bind fields to show/hide certain fields depending on the type?
      • @@ -152,8 +152,8 @@ We can use PostgreSQL's extended output format (-x) plus sed to format the o org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserved for non-replication superuser connections
      • Looking at the pg_stat_activity table I saw there were indeed 98 active connections to PostgreSQL, and at this time the limit is 100, so that makes sense
      • -
      • Tsega restarted Tomcat and it's working now
      • -
      • Abenet said she was generating a report with Atmire's CUA module, so it could be due to that?
      • +
      • Tsega restarted Tomcat and it’s working now
      • +
      • Abenet said she was generating a report with Atmire’s CUA module, so it could be due to that?
      • Looking in the logs I see this random error again that I should report to DSpace:
      2017-07-05 13:50:07,196 ERROR org.dspace.statistics.SolrLogger @ COUNTRY ERROR: EU
      @@ -171,7 +171,7 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
       

    2017-07-14

      -
    • Sisay sent me a patch to add “Photo Report” to dc.type so I've added it to the 5_x-prod branch
    • +
    • Sisay sent me a patch to add “Photo Report” to dc.type so I’ve added it to the 5_x-prod branch

    2017-07-17

      @@ -193,7 +193,7 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
    • Talk to Tsega and Danny about exporting/injesting the blog posts from Drupal into DSpace?
    • Followup meeting on August 8/9?
    • -
    • Sent Abenet the 2415 records from CGIAR Library's Historical Archive (10947/1) after cleaning up the author authorities and HTML entities in dc.contributor.author and dc.description.abstract using OpenRefine: +
    • Sent Abenet the 2415 records from CGIAR Library’s Historical Archive (10947/1) after cleaning up the author authorities and HTML entities in dc.contributor.author and dc.description.abstract using OpenRefine:
      • Authors: value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,"")
      • Abstracts: replace(value,/<\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?>/,'')
      • @@ -210,10 +210,10 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve

      2017-07-27

        -
      • Help Sisay with some transforms to add descriptions to the filename column of some CIAT Presentations he's working on in OpenRefine
      • +
      • Help Sisay with some transforms to add descriptions to the filename column of some CIAT Presentations he’s working on in OpenRefine
      • Marianne emailed a few days ago to ask why “Integrating Ecosystem Solutions” was not in the list of WLE Phase I Research Themes on the input form
      • I told her that I only added the themes that I saw in the WLE Phase I Research Themes community
      • -
      • Then Mia from WLE also emailed to ask where some WLE focal regions went, and I said I didn't understand what she was talking about, as all we did in our previous work was rename the old “Research Themes” subcommunity to “WLE Phase I Research Themes” and add a new subcommunity for “WLE Phase II Research Themes”.
      • +
      • Then Mia from WLE also emailed to ask where some WLE focal regions went, and I said I didn’t understand what she was talking about, as all we did in our previous work was rename the old “Research Themes” subcommunity to “WLE Phase I Research Themes” and add a new subcommunity for “WLE Phase II Research Themes”.
      • Discuss some modifications to the CCAFS project tags in CGSpace submission form and in the database

      2017-07-28

      @@ -228,7 +228,7 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve

      2017-07-30

      • Start working on CCAFS project tag cleanup
      • -
      • More questions about inconsistencies and spelling mistakes in their tags, so I've sent some questions for followup
      • +
      • More questions about inconsistencies and spelling mistakes in their tags, so I’ve sent some questions for followup

      2017-07-31

        diff --git a/docs/2017-08/index.html b/docs/2017-08/index.html index 6fb0bb62e..536c124b9 100644 --- a/docs/2017-08/index.html +++ b/docs/2017-08/index.html @@ -20,7 +20,7 @@ But many of the bots are browsing dynamic URLs like: The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these! Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962 -It turns out that we're already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it! +It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it! Also, the bot has to successfully browse the page first so it can receive the HTTP header… We might actually have to block these requests with HTTP 403 depending on the user agent Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415 @@ -49,7 +49,7 @@ But many of the bots are browsing dynamic URLs like: The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these! Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962 -It turns out that we're already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it! +It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it! Also, the bot has to successfully browse the page first so it can receive the HTTP header… We might actually have to block these requests with HTTP 403 depending on the user agent Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415 @@ -57,7 +57,7 @@ This was due to newline characters in the dc.description.abstract column, which I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet "/> - + @@ -87,7 +87,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s - + @@ -135,7 +135,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s

        August, 2017

        @@ -153,7 +153,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
      • The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
      • Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
      • -
      • It turns out that we're already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
      • +
      • It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
      • Also, the bot has to successfully browse the page first so it can receive the HTTP header…
      • We might actually have to block these requests with HTTP 403 depending on the user agent
      • Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
      • @@ -164,9 +164,9 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s

        2017-08-02

        • Magdalena from CCAFS asked if there was a way to get the top ten items published in 2016 (note: not the top items in 2016!)
        • -
        • I think Atmire's Content and Usage Analysis module should be able to do this but I will have to look at the configuration and maybe email Atmire if I can't figure it out
        • -
        • I had a look at the moduel configuration and couldn't figure out a way to do this, so I opened a ticket on the Atmire tracker
        • -
        • Atmire responded about the missing workflow statistics issue a few weeks ago but I didn't see it for some reason
        • +
        • I think Atmire’s Content and Usage Analysis module should be able to do this but I will have to look at the configuration and maybe email Atmire if I can’t figure it out
        • +
        • I had a look at the moduel configuration and couldn’t figure out a way to do this, so I opened a ticket on the Atmire tracker
        • +
        • Atmire responded about the missing workflow statistics issue a few weeks ago but I didn’t see it for some reason
        • They said they added a publication and saw the workflow stat for the user, so I should try again and let them know

        2017-08-05

        @@ -176,17 +176,17 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s

      CIFOR OAI harvesting

        -
      • I don't see anything related in our logs, so I asked him to check for our server's IP in their logs
      • -
      • Also, in the mean time I stopped the harvesting process, reset the status, and restarted the process via the Admin control panel (note: I didn't reset the collection, just the harvester status!)
      • +
      • I don’t see anything related in our logs, so I asked him to check for our server’s IP in their logs
      • +
      • Also, in the mean time I stopped the harvesting process, reset the status, and restarted the process via the Admin control panel (note: I didn’t reset the collection, just the harvester status!)

      2017-08-07

        -
      • Apply Abenet's corrections for the CGIAR Library's Consortium subcommunity (697 records)
      • -
      • I had to fix a few small things, like moving the dc.title column away from the beginning of the row, delete blank spaces in the abstract in vim using :g/^$/d, add the dc.subject[en_US] column back, as she had deleted it and DSpace didn't detect the changes made there (we needed to blank the values instead)
      • +
      • Apply Abenet’s corrections for the CGIAR Library’s Consortium subcommunity (697 records)
      • +
      • I had to fix a few small things, like moving the dc.title column away from the beginning of the row, delete blank spaces in the abstract in vim using :g/^$/d, add the dc.subject[en_US] column back, as she had deleted it and DSpace didn’t detect the changes made there (we needed to blank the values instead)

      2017-08-08

        -
      • Apply Abenet's corrections for the CGIAR Library's historic archive subcommunity (2415 records)
      • +
      • Apply Abenet’s corrections for the CGIAR Library’s historic archive subcommunity (2415 records)
      • I had to add the dc.subject[en_US] column back with blank values so that DSpace could detect the changes
      • I applied the changes in 500 item batches
      @@ -196,13 +196,13 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
    • Help ICARDA upgrade their MELSpace to DSpace 5.7 using the docker-dspace container
      • We had to import the PostgreSQL dump to the PostgreSQL container using: pg_restore -U postgres -d dspace blah.dump
      • -
      • Otherwise, when using -O it messes up the permissions on the schema and DSpace can't read it
      • +
      • Otherwise, when using -O it messes up the permissions on the schema and DSpace can’t read it

    2017-08-10

      -
    • Apply last updates to the CGIAR Library's Fund community (812 items)
    • +
    • Apply last updates to the CGIAR Library’s Fund community (812 items)
    • Had to do some quality checks and column renames before importing, as either Sisay or Abenet renamed a few columns and the metadata importer wanted to remove/add new metadata for title, abstract, etc.
    • Also I applied the HTML entities unescape transform on the abstract column in Open Refine
    • I need to get an author list from the database for only the CGIAR Library community to send to Peter
    • @@ -243,7 +243,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s 85736 70.32.83.92
    • The top offender is 70.32.83.92 which is actually the same IP as ccafs.cgiar.org, so I will email the Macaroni Bros to see if they can test on DSpace Test instead
    • -
    • I've enabled logging of /oai requests on nginx as well so we can potentially determine bad actors here (also to see if anyone is actually using OAI!)
    • +
    • I’ve enabled logging of /oai requests on nginx as well so we can potentially determine bad actors here (also to see if anyone is actually using OAI!)
        # log oai requests
         location /oai {
    @@ -268,7 +268,7 @@ DELETE 1
     dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='WSSD';
     
    • Generate a new list of authors from the CGIAR Library community for Peter to look through now that the initial corrections have been done
    • -
    • Thinking about resource limits for PostgreSQL again after last week's CGSpace crash and related to a recently discussion I had in the comments of the April, 2017 DCAT meeting notes
    • +
    • Thinking about resource limits for PostgreSQL again after last week’s CGSpace crash and related to a recently discussion I had in the comments of the April, 2017 DCAT meeting notes
    • In that thread Chris Wilper suggests a new default of 35 max connections for db.maxconnections (from the current default of 30), knowing that each DSpace web application gets to use up to this many on its own
    • It would be good to approximate what the theoretical maximum number of connections on a busy server would be, perhaps by looking to see which apps use SQL:
    @@ -283,21 +283,21 @@ $ grep -rsI SQLException dspace-solr | wc -l $ grep -rsI SQLException dspace-xmlui | wc -l 866
      -
    • Of those five applications we're running, only solr appears not to use the database directly
    • -
    • And JSPUI is only used internally (so it doesn't really count), leaving us with OAI, REST, and XMLUI
    • -
    • Assuming each takes a theoretical maximum of 35 connections during a heavy load (35 * 3 = 105), that would put the connections well above PostgreSQL's default max of 100 connections (remember a handful of connections are reserved for the PostgreSQL super user, see superuser_reserved_connections)
    • -
    • So we should adjust PostgreSQL's max connections to be DSpace's db.maxconnections * 3 + 3
    • -
    • This would allow each application to use up to db.maxconnections and not to go over the system's PostgreSQL limit
    • +
    • Of those five applications we’re running, only solr appears not to use the database directly
    • +
    • And JSPUI is only used internally (so it doesn’t really count), leaving us with OAI, REST, and XMLUI
    • +
    • Assuming each takes a theoretical maximum of 35 connections during a heavy load (35 * 3 = 105), that would put the connections well above PostgreSQL’s default max of 100 connections (remember a handful of connections are reserved for the PostgreSQL super user, see superuser_reserved_connections)
    • +
    • So we should adjust PostgreSQL’s max connections to be DSpace’s db.maxconnections * 3 + 3
    • +
    • This would allow each application to use up to db.maxconnections and not to go over the system’s PostgreSQL limit
    • Perhaps since CGSpace is a busy site with lots of resources we could actually use something like 40 for db.maxconnections
    • -
    • Also worth looking into is to set up a database pool using JNDI, as apparently DSpace's db.poolname hasn't been used since around DSpace 1.7 (according to Chris Wilper's comments in the thread)
    • +
    • Also worth looking into is to set up a database pool using JNDI, as apparently DSpace’s db.poolname hasn’t been used since around DSpace 1.7 (according to Chris Wilper’s comments in the thread)
    • Need to go check the PostgreSQL connection stats in Munin on CGSpace from the past week to get an idea if 40 is appropriate
    • Looks like connections hover around 50:

    PostgreSQL connections 2017-08

      -
    • Unfortunately I don't have the breakdown of which DSpace apps are making those connections (I'll assume XMLUI)
    • -
    • So I guess a limit of 30 (DSpace default) is too low, but 70 causes problems when the load increases and the system's PostgreSQL max_connections is too low
    • -
    • For now I think maybe setting DSpace's db.maxconnections to 40 and adjusting the system's max_connections might be a good starting point: 40 * 3 + 3 = 123
    • +
    • Unfortunately I don’t have the breakdown of which DSpace apps are making those connections (I’ll assume XMLUI)
    • +
    • So I guess a limit of 30 (DSpace default) is too low, but 70 causes problems when the load increases and the system’s PostgreSQL max_connections is too low
    • +
    • For now I think maybe setting DSpace’s db.maxconnections to 40 and adjusting the system’s max_connections might be a good starting point: 40 * 3 + 3 = 123
    • Apply 223 more author corrections from Peter on CGIAR Library
    • Help Magdalena from CCAFS with some CUA statistics questions
    @@ -320,7 +320,7 @@ $ grep -rsI SQLException dspace-xmlui | wc -l
    dspace=# update metadatavalue set text_lang='en_US' where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'abstract') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')))
     
    • And on others like dc.language.iso, dc.relation.ispartofseries, dc.type, dc.title, etc…
    • -
    • Also, to move fields from dc.identifier.url to cg.identifier.url[en_US] (because we don't use the Dublin Core one for some reason):
    • +
    • Also, to move fields from dc.identifier.url to cg.identifier.url[en_US] (because we don’t use the Dublin Core one for some reason):
    dspace=# update metadatavalue set metadata_field_id = 219, text_lang = 'en_US' where resource_type_id = 2 AND metadata_field_id = 237;
     UPDATE 15
    @@ -339,8 +339,8 @@ UPDATE 4899
     
     
    isNotNull(value.match(/(CGIAR .+?)\|\|\1/))
     
      -
    • This would be true if the authors were like CGIAR System Management Office||CGIAR System Management Office, which some of the CGIAR Library's were
    • -
    • Unfortunately when you fix these in OpenRefine and then submit the metadata to DSpace it doesn't detect any changes, so you have to edit them all manually via DSpace's “Edit Item”
    • +
    • This would be true if the authors were like CGIAR System Management Office||CGIAR System Management Office, which some of the CGIAR Library’s were
    • +
    • Unfortunately when you fix these in OpenRefine and then submit the metadata to DSpace it doesn’t detect any changes, so you have to edit them all manually via DSpace’s “Edit Item”
    • Ooh! And an even more interesting regex would match any duplicated author:
    isNotNull(value.match(/(.+?)\|\|\1/))
    @@ -354,7 +354,7 @@ UPDATE 4899
     
     

    2017-08-17

      -
    • Run Peter's edits to the CGIAR System Organization community on DSpace Test
    • +
    • Run Peter’s edits to the CGIAR System Organization community on DSpace Test
    • Uptime Robot said CGSpace went down for 1 minute, not sure why
    • Looking in dspace.log.2017-08-17 I see some weird errors that might be related?
    @@ -386,7 +386,7 @@ dspace.log.2017-08-17:584
  • A few posts on the dspace-tech mailing list say this is related to the Cocoon cache somehow
  • I will clear the XMLUI cache for now and see if the errors continue (though perpaps shutting down Tomcat and removing the cache is more effective somehow?)
  • We tested the option for limiting restricted items from the RSS feeds on DSpace Test
  • -
  • I created four items, and only the two with public metadata showed up in the community's RSS feed: +
  • I created four items, and only the two with public metadata showed up in the community’s RSS feed:
    • Public metadata, public bitstream ✓
    • Public metadata, restricted bitstream ✓
    • @@ -394,7 +394,7 @@ dspace.log.2017-08-17:584
    • Private item ✗
  • -
  • Peter responded and said that he doesn't want to limit items to be restricted just so we can change the RSS feeds
  • +
  • Peter responded and said that he doesn’t want to limit items to be restricted just so we can change the RSS feeds
  • 2017-08-18

    $ ./sparql-query http://202.45.139.84:10035/catalogs/fao/repositories/agrovoc
     sparql$ PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    @@ -442,7 +442,7 @@ WHERE {
     
     

    2017-08-20

      -
    • Since I cleared the XMLUI cache on 2017-08-17 there haven't been any more ERROR net.sf.ehcache.store.DiskStore errors
    • +
    • Since I cleared the XMLUI cache on 2017-08-17 there haven’t been any more ERROR net.sf.ehcache.store.DiskStore errors
    • Look at the CGIAR Library to see if I can find the items that have been submitted since May:
    dspace=# select * from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z';
    @@ -474,13 +474,13 @@ WHERE {
     

    2017-08-28

    • Bram had written to me two weeks ago to set up a chat about ORCID stuff but the email apparently bounced and I only found out when he emaiiled me on another account
    • -
    • I told him I can chat in a few weeks when I'm back
    • +
    • I told him I can chat in a few weeks when I’m back

    2017-08-31

    • I notice that in many WLE collections Marianne Gadeberg is in the edit or approval steps, but she is also in the groups for those steps.
    • I think we need to have a process to go back and check / fix some of these scenarios—to remove her user from the step and instead add her to the group—because we have way too many authorizations and in late 2016 we had performance issues with Solr because of this
    • -
    • I asked Sisay about this and hinted that he should go back and fix these things, but let's see what he says
    • +
    • I asked Sisay about this and hinted that he should go back and fix these things, but let’s see what he says
    • Saw CGSpace go down briefly today and noticed SQL connection pool errors in the dspace log file:
    ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error
    @@ -488,7 +488,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
     
    • Looking at the logs I see we have been having hundreds or thousands of these errors a few times per week in 2017-07 and almost every day in 2017-08
    • It seems that I changed the db.maxconnections setting from 70 to 40 around 2017-08-14, but Macaroni Bros also reduced their hourly hammering of the REST API then
    • -
    • Nevertheless, it seems like a connection limit is not enough and that I should increase it (as well as the system's PostgreSQL max_connections)
    • +
    • Nevertheless, it seems like a connection limit is not enough and that I should increase it (as well as the system’s PostgreSQL max_connections)
    diff --git a/docs/2017-09/index.html b/docs/2017-09/index.html index 315b5b58b..a6d29bbfe 100644 --- a/docs/2017-09/index.html +++ b/docs/2017-09/index.html @@ -12,7 +12,7 @@ Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two 2017-09-07 -Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group +Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group " /> @@ -27,9 +27,9 @@ Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two 2017-09-07 -Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group +Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group "/> - + @@ -59,7 +59,7 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is - + @@ -107,7 +107,7 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is

    September, 2017

    @@ -117,7 +117,7 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is

    2017-09-07

      -
    • Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
    • +
    • Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group

    2017-09-10

      @@ -126,17 +126,17 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is
      dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
       DELETE 58
       
        -
      • I also ran it on DSpace Test because we'll be migrating the CGIAR Library soon and it would be good to catch these before we migrate
      • +
      • I also ran it on DSpace Test because we’ll be migrating the CGIAR Library soon and it would be good to catch these before we migrate
      • Run system updates and restart DSpace Test
      • We only have 7.7GB of free space on DSpace Test so I need to copy some data off of it before doing the CGIAR Library migration (requires lots of exporting and creating temp files)
      • -
      • I still have the original data from the CGIAR Library so I've zipped it up and sent it off to linode18 for now
      • +
      • I still have the original data from the CGIAR Library so I’ve zipped it up and sent it off to linode18 for now
      • sha256sum of original-cgiar-library-6.6GB.tar.gz is: bcfabb52f51cbdf164b61b7e9b3a0e498479e4c1ed1d547d32d11f44c0d5eb8a
      • Start doing a test run of the CGIAR Library migration locally
      • Notes and todo checklist here for now: https://gist.github.com/alanorth/3579b74e116ab13418d187ed379abd9c
      • Create pull request for Phase I and II changes to CCAFS Project Tags: #336
      • -
      • We've been discussing with Macaroni Bros and CCAFS for the past month or so and the list of tags was recently finalized
      • -
      • There will need to be some metadata updates — though if I recall correctly it is only about seven records — for that as well, I had made some notes about it in 2017-07, but I've asked for more clarification from Lili just in case
      • -
      • Looking at the DSpace logs to see if we've had a change in the “Cannot get a connection” errors since last month when we adjusted the db.maxconnections parameter on CGSpace:
      • +
      • We’ve been discussing with Macaroni Bros and CCAFS for the past month or so and the list of tags was recently finalized
      • +
      • There will need to be some metadata updates — though if I recall correctly it is only about seven records — for that as well, I had made some notes about it in 2017-07, but I’ve asked for more clarification from Lili just in case
      • +
      • Looking at the DSpace logs to see if we’ve had a change in the “Cannot get a connection” errors since last month when we adjusted the db.maxconnections parameter on CGSpace:
      # grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-09-*
       dspace.log.2017-09-01:0
      @@ -150,11 +150,11 @@ dspace.log.2017-09-08:10
       dspace.log.2017-09-09:0
       dspace.log.2017-09-10:0
       
        -
      • Also, since last month (2017-08) Macaroni Bros no longer runs their REST API scraper every hour, so I'm sure that helped
      • +
      • Also, since last month (2017-08) Macaroni Bros no longer runs their REST API scraper every hour, so I’m sure that helped
      • There are still some errors, though, so maybe I should bump the connection limit up a bit
      • -
      • I remember seeing that Munin shows that the average number of connections is 50 (which is probably mostly from the XMLUI) and we're currently allowing 40 connections per app, so maybe it would be good to bump that value up to 50 or 60 along with the system's PostgreSQL max_connections (formula should be: webapps * 60 + 3, or 3 * 60 + 3 = 183 in our case)
      • +
      • I remember seeing that Munin shows that the average number of connections is 50 (which is probably mostly from the XMLUI) and we’re currently allowing 40 connections per app, so maybe it would be good to bump that value up to 50 or 60 along with the system’s PostgreSQL max_connections (formula should be: webapps * 60 + 3, or 3 * 60 + 3 = 183 in our case)
      • I updated both CGSpace and DSpace Test to use these new settings (60 connections per web app and 183 for system PostgreSQL limit)
      • -
      • I'm expecting to see 0 connection errors for the next few months
      • +
      • I’m expecting to see 0 connection errors for the next few months

      2017-09-11

        @@ -163,7 +163,7 @@ dspace.log.2017-09-10:0

      2017-09-12

        -
      • I was testing the METS XSD caching during AIP ingest but it doesn't seem to help actually
      • +
      • I was testing the METS XSD caching during AIP ingest but it doesn’t seem to help actually
      • The import process takes the same amount of time with and without the caching
      • Also, I captured TCP packets destined for port 80 and both imports only captured ONE packet (an update check from some component in Java):
      @@ -182,8 +182,8 @@ dspace.log.2017-09-10:0
    • I had a Skype call with Bram Luyten from Atmire to discuss various issues related to ORCID in DSpace
      • First, ORCID is deprecating their version 1 API (which DSpace uses) and in version 2 API they have removed the ability to search for users by name
      • -
      • The logic is that searching by name actually isn't very useful because ORCID is essentially a global phonebook and there are tons of legitimately duplicate and ambiguous names
      • -
      • Atmire's proposed integration would work by having users lookup and add authors to the authority core directly using their ORCID ID itself (this would happen during the item submission process or perhaps as a standalone / batch process, for example to populate the authority core with a list of known ORCIDs)
      • +
      • The logic is that searching by name actually isn’t very useful because ORCID is essentially a global phonebook and there are tons of legitimately duplicate and ambiguous names
      • +
      • Atmire’s proposed integration would work by having users lookup and add authors to the authority core directly using their ORCID ID itself (this would happen during the item submission process or perhaps as a standalone / batch process, for example to populate the authority core with a list of known ORCIDs)
      • Once the association between name and ORCID is made in the authority then it can be autocompleted in the lookup field
      • Ideally there could also be a user interface for cleanup and merging of authorities
      • He will prepare a quote for us with keeping in mind that this could be useful to contribute back to the community for a 5.x release
      • @@ -194,8 +194,8 @@ dspace.log.2017-09-10:0

        2017-09-13

        • Last night Linode sent an alert about CGSpace (linode18) that it has exceeded the outbound traffic rate threshold of 10Mb/s for the last two hours
        • -
        • I wonder what was going on, and looking into the nginx logs I think maybe it's OAI…
        • -
        • Here is yesterday's top ten IP addresses making requests to /oai:
        • +
        • I wonder what was going on, and looking into the nginx logs I think maybe it’s OAI…
        • +
        • Here is yesterday’s top ten IP addresses making requests to /oai:
        # awk '{print $1}' /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
               1 213.136.89.78
        @@ -208,7 +208,7 @@ dspace.log.2017-09-10:0
           15825 35.161.215.53
           16704 54.70.51.7
         
          -
        • Compared to the previous day's logs it looks VERY high:
        • +
        • Compared to the previous day’s logs it looks VERY high:
        # awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
               1 207.46.13.39
        @@ -260,7 +260,7 @@ dspace.log.2017-09-10:0
         /var/log/nginx/oai.log.8.gz:0
         /var/log/nginx/oai.log.9.gz:0
         
          -
        • Some of these heavy users are also using XMLUI, and their user agent isn't matched by the Tomcat Session Crawler valve, so each request uses a different session
        • +
        • Some of these heavy users are also using XMLUI, and their user agent isn’t matched by the Tomcat Session Crawler valve, so each request uses a different session
        • Yesterday alone the IP addresses using the API scraper user agent were responsible for 16,000 sessions in XMLUI:
        # grep -a -E "(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)" /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
        @@ -273,7 +273,7 @@ dspace.log.2017-09-10:0
         
        WARN  org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
         
        • Looking at the spreadsheet with deletions and corrections that CCAFS sent last week
        • -
        • It appears they want to delete a lot of metadata, which I'm not sure they realize the implications of:
        • +
        • It appears they want to delete a lot of metadata, which I’m not sure they realize the implications of:
        dspace=# select text_value, count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange') group by text_value;                                                                                                                                                                                                                  
                 text_value        | count                              
        @@ -300,12 +300,12 @@ dspace.log.2017-09-10:0
         (19 rows)
         
        • I sent CCAFS people an email to ask if they really want to remove these 200+ tags
        • -
        • She responded yes, so I'll at least need to do these deletes in PostgreSQL:
        • +
        • She responded yes, so I’ll at least need to do these deletes in PostgreSQL:
        dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
         DELETE 207
         
          -
        • When we discussed this in late July there were some other renames they had requested, but I don't see them in the current spreadsheet so I will have to follow that up
        • +
        • When we discussed this in late July there were some other renames they had requested, but I don’t see them in the current spreadsheet so I will have to follow that up
        • I talked to Macaroni Bros and they said to just go ahead with the other corrections as well as their spreadsheet was evolved organically rather than systematically!
        • The final list of corrections and deletes should therefore be:
        @@ -319,7 +319,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
      • Although it looks like there was a previous attempt to disable these update checks that was merged in DSpace 4.0 (although it only affects XMLUI): https://jira.duraspace.org/browse/DS-1492
      • I commented there suggesting that we disable it globally
      • I merged the changes to the CCAFS project tags (#336) but still need to finalize the metadata deletions/renames
      • -
      • I merged the CGIAR Library theme changes (#338) to the 5_x-prod branch in preparation for next week's migration
      • +
      • I merged the CGIAR Library theme changes (#338) to the 5_x-prod branch in preparation for next week’s migration
      • I emailed the Handle administrators (hdladmin@cnri.reston.va.us) to ask them what the process for changing their prefix to be resolved by our resolver
      • They responded and said that they need email confirmation from the contact of record of the other prefix, so I should have the CGIAR System Organization people email them before I send the new sitebndl.zip
      • Testing to see how we end up with all these new authorities after we keep cleaning and merging them in the database
      • @@ -354,7 +354,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134 Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 | 600 (9 rows)
          -
        • It created a new authority… let's try to add another item and select the same existing author and see what happens in the database:
        • +
        • It created a new authority… let’s try to add another item and select the same existing author and see what happens in the database:
        dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
          text_value |              authority               | confidence 
        @@ -387,7 +387,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
          Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 |        600
         (10 rows)
         
          -
        • Shit, it created another authority! Let's try it again!
        • +
        • Shit, it created another authority! Let’s try it again!
        dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';                                                                                             
          text_value |              authority               | confidence
        @@ -413,7 +413,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
         
      • Michael Marus is the contact for their prefix but he has left CGIAR, but as I actually have access to the CGIAR Library server I think I can just generate a new sitebndl.zip file from their server and send it to Handle.net
      • Also, Handle.net says their prefix is up for annual renewal next month so we might want to just pay for it and take it over
      • CGSpace was very slow and Uptime Robot even said it was down at one time
      • -
      • I didn't see any abnormally high usage in the REST or OAI logs, but looking at Munin I see the average JVM usage was at 4.9GB and the heap is only 5GB (5120M), so I think it's just normal growing pains
      • +
      • I didn’t see any abnormally high usage in the REST or OAI logs, but looking at Munin I see the average JVM usage was at 4.9GB and the heap is only 5GB (5120M), so I think it’s just normal growing pains
      • Every few months I generally try to increase the JVM heap to be 512M higher than the average usage reported by Munin, so now I adjusted it to 5632M

      2017-09-15

      @@ -480,16 +480,16 @@ DELETE 207
    • Abenet wants to be able to filter by ISI Journal in advanced search on queries like this: https://cgspace.cgiar.org/discover?filtertype_0=dateIssued&filtertype_1=dateIssued&filter_relational_operator_1=equals&filter_relational_operator_0=equals&filter_1=%5B2010+TO+2017%5D&filter_0=2017&filtertype=type&filter_relational_operator=equals&filter=Journal+Article
    • I opened an issue to track this (#340) and will test it on DSpace Test soon
    • Marianne Gadeberg from WLE asked if I would add an account for Adam Hunt on CGSpace and give him permissions to approve all WLE publications
    • -
    • I told him to register first, as he's a CGIAR user and needs an account to be created before I can add him to the groups
    • +
    • I told him to register first, as he’s a CGIAR user and needs an account to be created before I can add him to the groups

    2017-09-20

    • Abenet and I noticed that hdl.handle.net is blocked by ETC at ILRI Addis so I asked Biruk Debebe to route it over the satellite
    • -
    • Force thumbnail regeneration for the CGIAR System Organization's Historic Archive community (2000 items):
    • +
    • Force thumbnail regeneration for the CGIAR System Organization’s Historic Archive community (2000 items):
    $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -f -i 10947/1 -p "ImageMagick PDF Thumbnail"
     
      -
    • I'm still waiting (over 1 day later) to hear back from the CGIAR System Organization about updating the DNS for library.cgiar.org
    • +
    • I’m still waiting (over 1 day later) to hear back from the CGIAR System Organization about updating the DNS for library.cgiar.org

    2017-09-21

      @@ -507,29 +507,29 @@ DELETE 207
      • Start investigating other platforms for CGSpace due to linear instance pricing on Linode
      • We need to figure out how much memory is used by applications, caches, etc, and how much disk space the asset store needs
      • -
      • First, here's the last week of memory usage on CGSpace and DSpace Test:
      • +
      • First, here’s the last week of memory usage on CGSpace and DSpace Test:

      CGSpace memory week DSpace Test memory week

        -
      • 8GB of RAM seems to be good for DSpace Test for now, with Tomcat's JVM heap taking 3GB, caches and buffers taking 3–4GB, and then ~1GB unused
      • -
      • 24GB of RAM is way too much for CGSpace, with Tomcat's JVM heap taking 5.5GB and caches and buffers happily using 14GB or so
      • +
      • 8GB of RAM seems to be good for DSpace Test for now, with Tomcat’s JVM heap taking 3GB, caches and buffers taking 3–4GB, and then ~1GB unused
      • +
      • 24GB of RAM is way too much for CGSpace, with Tomcat’s JVM heap taking 5.5GB and caches and buffers happily using 14GB or so
      • As far as disk space, the CGSpace assetstore currently uses 51GB and Solr cores use 86GB (mostly in the statistics core)
      • -
      • DSpace Test currently doesn't even have enough space to store a full copy of CGSpace, as its Linode instance only has 96GB of disk space
      • -
      • I've heard Google Cloud is nice (cheap and performant) but it's definitely more complicated than Linode and instances aren't that much cheaper to make it worth it
      • +
      • DSpace Test currently doesn’t even have enough space to store a full copy of CGSpace, as its Linode instance only has 96GB of disk space
      • +
      • I’ve heard Google Cloud is nice (cheap and performant) but it’s definitely more complicated than Linode and instances aren’t that much cheaper to make it worth it
      • Here are some theoretical instances on Google Cloud:
        • DSpace Test, n1-standard-2 with 2 vCPUs, 7.5GB RAM, 300GB persistent SSD: $99/month
        • CGSpace, n1-standard-4 with 4 vCPUs, 15GB RAM, 300GB persistent SSD: $148/month
      • -
      • Looking at Linode's instance pricing, for DSpace Test it seems we could use the same 8GB instance for $40/month, and then add block storage of ~300GB for $30 (block storage is currently in beta and priced at $0.10/GiB)
      • +
      • Looking at Linode’s instance pricing, for DSpace Test it seems we could use the same 8GB instance for $40/month, and then add block storage of ~300GB for $30 (block storage is currently in beta and priced at $0.10/GiB)
      • For CGSpace we could use the cheaper 12GB instance for $80 and then add block storage of 500GB for $50
      • -
      • I've sent Peter a message about moving DSpace Test to the New Jersey data center so we can test the block storage beta
      • +
      • I’ve sent Peter a message about moving DSpace Test to the New Jersey data center so we can test the block storage beta
      • Create pull request for adding ISI Journal to search filters (#341)
      • Peter asked if we could map all the items of type Journal Article in ILRI Archive to ILRI articles in journals and newsletters
      • It is easy to do via CSV using OpenRefine but I noticed that on CGSpace ~1,000 of the expected 2,500 are already mapped, while on DSpace Test they were not
      • -
      • I've asked Peter if he knows what's going on (or who mapped them)
      • +
      • I’ve asked Peter if he knows what’s going on (or who mapped them)
      • Turns out he had already mapped some, but requested that I finish the rest
      • With this GREL in OpenRefine I can find items that are mapped, ie they have 10568/3|| or 10568/3$ in their collection field:
      @@ -543,7 +543,7 @@ DELETE 207
      • Email Rosemary Kande from ICT to ask about the administrative / finance procedure for moving DSpace Test from EU to US region on Linode
      • Communicate (finally) with Tania and Tunji from the CGIAR System Organization office to tell them to request CGNET make the DNS updates for library.cgiar.org
      • -
      • Peter wants me to clean up the text values for Delia Grace's metadata, as the authorities are all messed up again since we cleaned them up in 2016-12:
      • +
      • Peter wants me to clean up the text values for Delia Grace’s metadata, as the authorities are all messed up again since we cleaned them up in 2016-12:
      dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';                                  
         text_value  |              authority               | confidence              
      @@ -554,7 +554,7 @@ DELETE 207
        Grace, D.    | 6a8ddca3-33c1-45f9-aa00-6fa9fc91e3fc |         -1
       
      • Strangely, none of her authority entries have ORCIDs anymore…
      • -
      • I'll just fix the text values and forget about it for now:
      • +
      • I’ll just fix the text values and forget about it for now:
      dspace=# update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
       UPDATE 610
      @@ -593,24 +593,24 @@ real    6m6.447s
       user    1m34.010s
       sys     0m12.113s
       
        -
      • The index-authority script always seems to fail, I think it's the same old bug
      • -
      • Something interesting for my notes about JNDI database pool—since I couldn't determine if it was working or not when I tried it locally the other day—is this error message that I just saw in the DSpace logs today:
      • +
      • The index-authority script always seems to fail, I think it’s the same old bug
      • +
      • Something interesting for my notes about JNDI database pool—since I couldn’t determine if it was working or not when I tried it locally the other day—is this error message that I just saw in the DSpace logs today:
      ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspaceLocal
       ...
       INFO  org.dspace.storage.rdbms.DatabaseManager @ Unable to locate JNDI dataSource: jdbc/dspaceLocal
       INFO  org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Database pool
       
        -
      • So it's good to know that something gets printed when it fails because I didn't see any mention of JNDI before when I was testing!
      • +
      • So it’s good to know that something gets printed when it fails because I didn’t see any mention of JNDI before when I was testing!

      2017-09-26

      • Adam Hunt from WLE finally registered so I added him to the editor and approver groups
      • -
      • Then I noticed that Sisay never removed Marianne's user accounts from the approver steps in the workflow because she is already in the WLE groups, which are in those steps
      • -
      • For what it's worth, I had asked him to remove them on 2017-09-14
      • +
      • Then I noticed that Sisay never removed Marianne’s user accounts from the approver steps in the workflow because she is already in the WLE groups, which are in those steps
      • +
      • For what it’s worth, I had asked him to remove them on 2017-09-14
      • I also went and added the WLE approvers and editors groups to the appropriate steps of all the Phase I and Phase II research theme collections
      • -
      • A lot of CIAT's items have manually generated thumbnails which have an incorrect aspect ratio and an ugly black border
      • -
      • I communicated with Elizabeth from CIAT to tell her she should use DSpace's automatically generated thumbnails
      • +
      • A lot of CIAT’s items have manually generated thumbnails which have an incorrect aspect ratio and an ugly black border
      • +
      • I communicated with Elizabeth from CIAT to tell her she should use DSpace’s automatically generated thumbnails
      • Start discussiong with ICT about Linode server update for DSpace Test
      • Rosemary said I need to work with Robert Okal to destroy/create the server, and then let her and Lilian Masigah from finance know the updated Linode asset names for their records
      @@ -618,7 +618,7 @@ INFO org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Da
      • Tunji from the System Organization finally sent the DNS request for library.cgiar.org to CGNET
      • Now the redirects work
      • -
      • I quickly registered a Let's Encrypt certificate for the domain:
      • +
      • I quickly registered a Let’s Encrypt certificate for the domain:
      # systemctl stop nginx
       # /opt/certbot-auto certonly --standalone --email aorth@mjanja.ch -d library.cgiar.org
      diff --git a/docs/2017-10/index.html b/docs/2017-10/index.html
      index 5e9be4429..732b802c2 100644
      --- a/docs/2017-10/index.html
      +++ b/docs/2017-10/index.html
      @@ -12,7 +12,7 @@ Peter emailed to point out that many items in the ILRI archive collection have m
       
       http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
       
      -There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
      +There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
       Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
       " />
       
      @@ -28,10 +28,10 @@ Peter emailed to point out that many items in the ILRI archive collection have m
       
       http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
       
      -There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
      +There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
       Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
       "/>
      -
      +
       
       
           
      @@ -61,7 +61,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
           
           
           
      -    
      +    
           
       
           
      @@ -108,7 +108,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
         

      October, 2017

      @@ -119,7 +119,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
     
      -
    • There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
    • +
    • There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
    • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections

    2017-10-02

    @@ -130,13 +130,13 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
    2017-10-01 20:24:57,928 WARN  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:ldap_attribute_lookup:type=failed_search javax.naming.CommunicationException\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is java.net.ConnectException\colon; Connection timed out (Connection timed out)]
     2017-10-01 20:22:37,982 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:failed_login:no DN found for user pballantyne
     
      -
    • I thought maybe his account had expired (seeing as it's was the first of the month) but he says he was finally able to log in today
    • +
    • I thought maybe his account had expired (seeing as it’s was the first of the month) but he says he was finally able to log in today
    • The logs for yesterday show fourteen errors related to LDAP auth failures:
    $ grep -c "ldap_authentication:type=failed_auth" dspace.log.2017-10-01
     14
     
      -
    • For what it's worth, there are no errors on any other recent days, so it must have been some network issue on Linode or CGNET's LDAP server
    • +
    • For what it’s worth, there are no errors on any other recent days, so it must have been some network issue on Linode or CGNET’s LDAP server
    • Linode emailed to say that linode578611 (DSpace Test) needs to migrate to a new host for a security update so I initiated the migration immediately rather than waiting for the scheduled time in two weeks

    2017-10-04

    @@ -147,7 +147,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
    http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject → https://cgspace.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject
     
      -
    • We'll need to check for browse links and handle them properly, including swapping the subject parameter for systemsubject (which doesn't exist in Discovery yet, but we'll need to add it) as we have moved their poorly curated subjects from dc.subject to cg.subject.system
    • +
    • We’ll need to check for browse links and handle them properly, including swapping the subject parameter for systemsubject (which doesn’t exist in Discovery yet, but we’ll need to add it) as we have moved their poorly curated subjects from dc.subject to cg.subject.system
    • The second link was a direct link to a bitstream which has broken due to the sequence being updated, so I told him he should link to the handle of the item instead
    • Help Sisay proof sixty-two IITA records on DSpace Test
    • Lots of inconsistencies and errors in subjects, dc.format.extent, regions, countries
    • @@ -155,8 +155,8 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG

    2017-10-05

      -
    • Twice in the past twenty-four hours Linode has warned that CGSpace's outbound traffic rate was exceeding the notification threshold
    • -
    • I had a look at yesterday's OAI and REST logs in /var/log/nginx but didn't see anything unusual:
    • +
    • Twice in the past twenty-four hours Linode has warned that CGSpace’s outbound traffic rate was exceeding the notification threshold
    • +
    • I had a look at yesterday’s OAI and REST logs in /var/log/nginx but didn’t see anything unusual:
    # awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
         141 157.55.39.240
    @@ -183,7 +183,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
     
    • Working on the nginx redirects for CGIAR Library
    • We should start using 301 redirects and also allow for /sitemap to work on the library.cgiar.org domain so the CGIAR System Organization people can update their Google Search Console and allow Google to find their content in a structured way
    • -
    • Remove eleven occurrences of ACP in IITA's cg.coverage.region using the Atmire batch edit module from Discovery
    • +
    • Remove eleven occurrences of ACP in IITA’s cg.coverage.region using the Atmire batch edit module from Discovery
    • Need to investigate how we can verify the library.cgiar.org using the HTML or DNS methods
    • Run corrections on 143 ILRI Archive items that had two dc.identifier.uri values (Handle) that Peter had pointed out earlier this week
    • I used OpenRefine to isolate them and then fixed and re-imported them into CGSpace
    • @@ -197,7 +197,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG

      Original flat thumbnails Tweaked with border and box shadow

        -
      • I'll post it to the Yammer group to see what people think
      • +
      • I’ll post it to the Yammer group to see what people think
      • I figured out at way to do the HTML verification for Google Search console for library.cgiar.org
      • We can drop the HTML file in their XMLUI theme folder and it will get copied to the webapps directory during build/install
      • Then we add an nginx alias for that URL in the library.cgiar.org vhost
      • @@ -213,7 +213,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG Google Search Console 2 Google Search results

          -
        • I tried to submit a “Change of Address” request in the Google Search Console but I need to be an owner on CGSpace's console (currently I'm just a user) in order to do that
        • +
        • I tried to submit a “Change of Address” request in the Google Search Console but I need to be an owner on CGSpace’s console (currently I’m just a user) in order to do that
        • Manually clean up some communities and collections that Peter had requested a few weeks ago
        • Delete Community 10568/102 (ILRI Research and Development Issues)
        • Move five collections to 10568/27629 (ILRI Projects) using move-collections.sh with the following configuration:
        • @@ -233,8 +233,8 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG

        Change of Address error

          -
        • We are sending top-level CGIAR Library traffic to their specific community hierarchy in CGSpace so this type of change of address won't work—we'll just need to wait for Google to slowly index everything and take note of the HTTP 301 redirects
        • -
        • Also the Google Search Console doesn't work very well with Google Analytics being blocked, so I had to turn off my ad blocker to get the “Change of Address” tool to work!
        • +
        • We are sending top-level CGIAR Library traffic to their specific community hierarchy in CGSpace so this type of change of address won’t work—we’ll just need to wait for Google to slowly index everything and take note of the HTTP 301 redirects
        • +
        • Also the Google Search Console doesn’t work very well with Google Analytics being blocked, so I had to turn off my ad blocker to get the “Change of Address” tool to work!

        2017-10-12

          @@ -245,7 +245,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
          • Run system updates on DSpace Test and reboot server
          • Merge changes adding a search/browse index for CGIAR System subject to 5_x-prod (#344)
          • -
          • I checked the top browse links in Google's search results for site:library.cgiar.org inurl:browse and they are all redirected appropriately by the nginx rewrites I worked on last week
          • +
          • I checked the top browse links in Google’s search results for site:library.cgiar.org inurl:browse and they are all redirected appropriately by the nginx rewrites I worked on last week

          2017-10-22

            @@ -256,12 +256,12 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG

          2017-10-26

            -
          • In the last 24 hours we've gotten a few alerts from Linode that there was high CPU and outgoing traffic on CGSpace
          • +
          • In the last 24 hours we’ve gotten a few alerts from Linode that there was high CPU and outgoing traffic on CGSpace
          • Uptime Robot even noticed CGSpace go “down” for a few minutes
          • In other news, I was trying to look at a question about stats raised by Magdalena and then CGSpace went down due to SQL connection pool
          • Looking at the PostgreSQL activity I see there are 93 connections, but after a minute or two they went down and CGSpace came back up
          • Annnd I reloaded the Atmire Usage Stats module and the connections shot back up and CGSpace went down again
          • -
          • Still not sure where the load is coming from right now, but it's clear why there were so many alerts yesterday on the 25th!
          • +
          • Still not sure where the load is coming from right now, but it’s clear why there were so many alerts yesterday on the 25th!
          # grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
           18022
          @@ -274,12 +274,12 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
           7851
           
          • I still have no idea what was causing the load to go up today
          • -
          • I finally investigated Magdalena's issue with the item download stats and now I can't reproduce it: I get the same number of downloads reported in the stats widget on the item page, the “Most Popular Items” page, and in Usage Stats
          • +
          • I finally investigated Magdalena’s issue with the item download stats and now I can’t reproduce it: I get the same number of downloads reported in the stats widget on the item page, the “Most Popular Items” page, and in Usage Stats
          • I think it might have been an issue with the statistics not being fresh
          • I added the admin group for the systems organization to the admin role of the top-level community of CGSpace because I guess Sisay had forgotten
          • Magdalena asked if there was a way to reuse data in item submissions where items have a lot of similar data
          • I told her about the possibility to use per-collection item templates, and asked if her items in question were all from a single collection
          • -
          • We've never used it but it could be worth looking at
          • +
          • We’ve never used it but it could be worth looking at

          2017-10-27

            @@ -292,24 +292,24 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG

            2017-10-29

            • Linode alerted about high CPU usage again on CGSpace around 2AM and 4AM
            • -
            • I'm still not sure why this started causing alerts so repeatadely the past week
            • -
            • I don't see any tell tale signs in the REST or OAI logs, so trying to do rudimentary analysis in DSpace logs:
            • +
            • I’m still not sure why this started causing alerts so repeatadely the past week
            • +
            • I don’t see any tell tale signs in the REST or OAI logs, so trying to do rudimentary analysis in DSpace logs:
            # grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
             2049
             
            • So there were 2049 unique sessions during the hour of 2AM
            • Looking at my notes, the number of unique sessions was about the same during the same hour on other days when there were no alerts
            • -
            • I think I'll need to enable access logging in nginx to figure out what's going on
            • -
            • After enabling logging on requests to XMLUI on / I see some new bot I've never seen before:
            • +
            • I think I’ll need to enable access logging in nginx to figure out what’s going on
            • +
            • After enabling logging on requests to XMLUI on / I see some new bot I’ve never seen before:
            137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] "GET /discover?filtertype_0=type&filter_relational_operator_0=equals&filter_0=Internal+Document&filtertype=author&filter_relational_operator=equals&filter=CGIAR+Secretariat HTTP/1.1" 200 7776 "-" "Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)"
             
            • CORE seems to be some bot that is “Aggregating the world’s open access research papers”
            • -
            • The contact address listed in their bot's user agent is incorrect, correct page is simply: https://core.ac.uk/contact
            • -
            • I will check the logs in a few days to see if they are harvesting us regularly, then add their bot's user agent to the Tomcat Crawler Session Valve
            • +
            • The contact address listed in their bot’s user agent is incorrect, correct page is simply: https://core.ac.uk/contact
            • +
            • I will check the logs in a few days to see if they are harvesting us regularly, then add their bot’s user agent to the Tomcat Crawler Session Valve
            • After browsing the CORE site it seems that the CGIAR Library is somehow a member of CORE, so they have probably only been harvesting CGSpace since we did the migration, as library.cgiar.org directs to us now
            • -
            • For now I will just contact them to have them update their contact info in the bot's user agent, but eventually I think I'll tell them to swap out the CGIAR Library entry for CGSpace
            • +
            • For now I will just contact them to have them update their contact info in the bot’s user agent, but eventually I think I’ll tell them to swap out the CGIAR Library entry for CGSpace

            2017-10-30

              @@ -333,7 +333,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG 137.108.70.6 137.108.70.7
      -
    • I will add their user agent to the Tomcat Session Crawler Valve but it won't help much because they are only using two sessions:
    • +
    • I will add their user agent to the Tomcat Session Crawler Valve but it won’t help much because they are only using two sessions:
    # grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
     session_id=5771742CABA3D0780860B8DA81E0551B
    @@ -346,7 +346,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
     # grep 137.108.70 /var/log/nginx/access.log | grep -c "GET /discover"
     24055
     
      -
    • Just because I'm curious who the top IPs are:
    • +
    • Just because I’m curious who the top IPs are:
    # awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
         496 62.210.247.93
    @@ -362,7 +362,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
     
    • At least we know the top two are CORE, but who are the others?
    • 190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine
    • -
    • Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don't reuse their session variable, creating thousands of new sessions!
    • +
    • Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don’t reuse their session variable, creating thousands of new sessions!
    # grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     1419
    @@ -372,7 +372,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
     
  • From looking at the requests, it appears these are from CIAT and CCAFS
  • I wonder if I could somehow instruct them to use a user agent so that we could apply a crawler session manager valve to them
  • Actually, according to the Tomcat docs, we could use an IP with crawlerIps: https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve
  • -
  • Ah, wait, it looks like crawlerIps only came in 2017-06, so probably isn't in Ubuntu 16.04's 7.0.68 build!
  • +
  • Ah, wait, it looks like crawlerIps only came in 2017-06, so probably isn’t in Ubuntu 16.04’s 7.0.68 build!
  • That would explain the errors I was getting when trying to set it:
  • WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
    @@ -389,14 +389,14 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
     

    2017-10-31

    • Very nice, Linode alerted that CGSpace had high CPU usage at 2AM again
    • -
    • Ask on the dspace-tech mailing list if it's possible to use an existing item as a template for a new item
    • +
    • Ask on the dspace-tech mailing list if it’s possible to use an existing item as a template for a new item
    • To follow up on the CORE bot traffic, there were almost 300,000 request yesterday:
    # grep "CORE/0.6" /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
      139109 137.108.70.6
      139253 137.108.70.7
     
      -
    • I've emailed the CORE people to ask if they can update the repository information from CGIAR Library to CGSpace
    • +
    • I’ve emailed the CORE people to ask if they can update the repository information from CGIAR Library to CGSpace
    • Also, I asked if they could perhaps use the sitemap.xml, OAI-PMH, or REST APIs to index us more efficiently, because they mostly seem to be crawling the nearly endless Discovery facets
    • I added GoAccess to the list of package to install in the DSpace role of the Ansible infrastructure scripts
    • It makes it very easy to analyze nginx logs from the command line, to see where traffic is coming from:
    • @@ -406,14 +406,14 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
    • According to Uptime Robot CGSpace went down and up a few times
    • I had a look at goaccess and I saw that CORE was actively indexing
    • Also, PostgreSQL connections were at 91 (with the max being 60 per web app, hmmm)
    • -
    • I'm really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable
    • -
    • Actually, come to think of it, they aren't even obeying robots.txt, because we actually disallow /discover and /search-filter URLs but they are hitting those massively:
    • +
    • I’m really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable
    • +
    • Actually, come to think of it, they aren’t even obeying robots.txt, because we actually disallow /discover and /search-filter URLs but they are hitting those massively:
    # grep "CORE/0.6" /var/log/nginx/access.log | grep -o -E "GET /(discover|search-filter)" | sort -n | uniq -c | sort -rn 
      158058 GET /discover
       14260 GET /search-filter
     
      -
    • I tested a URL of pattern /discover in Google's webmaster tools and it was indeed identified as blocked
    • +
    • I tested a URL of pattern /discover in Google’s webmaster tools and it was indeed identified as blocked
    • I will send feedback to the CORE bot team
    diff --git a/docs/2017-11/index.html b/docs/2017-11/index.html index 90438795f..00db8fa24 100644 --- a/docs/2017-11/index.html +++ b/docs/2017-11/index.html @@ -45,7 +45,7 @@ Generate list of authors on CGSpace for Peter to go through and correct: dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; COPY 54701 "/> - + @@ -75,7 +75,7 @@ COPY 54701 - + @@ -122,7 +122,7 @@ COPY 54701

    November, 2017

    @@ -160,15 +160,15 @@ COPY 54701

    2017-11-03

    • Atmire got back to us to say that they estimate it will take two days of labor to implement the change to Listings and Reports
    • -
    • I said I'd ask Abenet if she wants that feature
    • +
    • I said I’d ask Abenet if she wants that feature

    2017-11-04

      -
    • I finished looking through Sisay's CIAT records for the “Alianzas de Aprendizaje” data
    • +
    • I finished looking through Sisay’s CIAT records for the “Alianzas de Aprendizaje” data
    • I corrected about half of the authors to standardize them
    • Linode emailed this morning to say that the CPU usage was high again, this time at 6:14AM
    • -
    • It's the first time in a few days that this has happened
    • -
    • I had a look to see what was going on, but it isn't the CORE bot:
    • +
    • It’s the first time in a few days that this has happened
    • +
    • I had a look to see what was going on, but it isn’t the CORE bot:
    # awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
         306 68.180.229.31
    @@ -193,11 +193,11 @@ COPY 54701
     /var/log/nginx/access.log.5.gz:0
     /var/log/nginx/access.log.6.gz:0
     
      -
    • It's clearly a bot as it's making tens of thousands of requests, but it's using a “normal” user agent:
    • +
    • It’s clearly a bot as it’s making tens of thousands of requests, but it’s using a “normal” user agent:
    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
     
      -
    • For now I don't know what this user is!
    • +
    • For now I don’t know what this user is!

    2017-11-05

      @@ -222,8 +222,8 @@ COPY 54701 International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | 500 (8 rows)
      -
    • So I'm not sure if this is just a graphical glitch or if editors have to edit this metadata field prior to approval
    • -
    • Looking at monitoring Tomcat's JVM heap with Prometheus, it looks like we need to use JMX + jmx_exporter
    • +
    • So I’m not sure if this is just a graphical glitch or if editors have to edit this metadata field prior to approval
    • +
    • Looking at monitoring Tomcat’s JVM heap with Prometheus, it looks like we need to use JMX + jmx_exporter
    • This guide shows how to enable JMX in Tomcat by modifying CATALINA_OPTS
    • I was able to successfully connect to my local Tomcat with jconsole!
    @@ -268,8 +268,8 @@ $ grep 104.196.152.243 dspace.log.2017-11-03 | grep -o -E 'session_id=[A-Z0-9]{3 $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l 7051
      -
    • The worst thing is that this user never specifies a user agent string so we can't lump it in with the other bots using the Tomcat Session Crawler Manager Valve
    • -
    • They don't request dynamic URLs like “/discover” but they seem to be fetching handles from XMLUI instead of REST (and some with //handle, note the regex below):
    • +
    • The worst thing is that this user never specifies a user agent string so we can’t lump it in with the other bots using the Tomcat Session Crawler Manager Valve
    • +
    • They don’t request dynamic URLs like “/discover” but they seem to be fetching handles from XMLUI instead of REST (and some with //handle, note the regex below):
    # grep -c 104.196.152.243 /var/log/nginx/access.log.1
     4681
    @@ -277,7 +277,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     4618
     
    • I just realized that ciat.cgiar.org points to 104.196.152.243, so I should contact Leroy from CIAT to see if we can change their scraping behavior
    • -
    • The next IP (207.46.13.36) seem to be Microsoft's bingbot, but all its requests specify the “bingbot” user agent and there are no requests for dynamic URLs that are forbidden, like “/discover”:
    • +
    • The next IP (207.46.13.36) seem to be Microsoft’s bingbot, but all its requests specify the “bingbot” user agent and there are no requests for dynamic URLs that are forbidden, like “/discover”:
    $ grep -c 207.46.13.36 /var/log/nginx/access.log.1 
     2034
    @@ -328,18 +328,18 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     
  • Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)
  • -
  • I'll just keep an eye on that one for now, as it only made a few hundred requests to dynamic discovery URLs
  • -
  • While it's not in the top ten, Baidu is one bot that seems to not give a fuck:
  • +
  • I’ll just keep an eye on that one for now, as it only made a few hundred requests to dynamic discovery URLs
  • +
  • While it’s not in the top ten, Baidu is one bot that seems to not give a fuck:
  • # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "7/Nov/2017" | grep -c Baiduspider
     8912
     # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "7/Nov/2017" | grep Baiduspider | grep -c -E "GET /(browse|discover|search-filter)"
     2521
     
      -
    • According to their documentation their bot respects robots.txt, but I don't see this being the case
    • +
    • According to their documentation their bot respects robots.txt, but I don’t see this being the case
    • I think I will end up blocking Baidu as well…
    • Next is for me to look and see what was happening specifically at 3AM and 7AM when the server crashed
    • -
    • I should look in nginx access.log, rest.log, oai.log, and DSpace's dspace.log.2017-11-07
    • +
    • I should look in nginx access.log, rest.log, oai.log, and DSpace’s dspace.log.2017-11-07
    • Here are the top IPs making requests to XMLUI from 2 to 8 AM:
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    @@ -389,8 +389,8 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
         462 ip_addr=104.196.152.243
         488 ip_addr=66.249.66.90
     
      -
    • These aren't actually very interesting, as the top few are Google, CIAT, Bingbot, and a few other unknown scrapers
    • -
    • The number of requests isn't even that high to be honest
    • +
    • These aren’t actually very interesting, as the top few are Google, CIAT, Bingbot, and a few other unknown scrapers
    • +
    • The number of requests isn’t even that high to be honest
    • As I was looking at these logs I noticed another heavy user (124.17.34.59) that was not active during this time period, but made many requests today alone:
    # zgrep -c 124.17.34.59 /var/log/nginx/access.log*
    @@ -405,13 +405,13 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     /var/log/nginx/access.log.8.gz:0
     /var/log/nginx/access.log.9.gz:1
     
      -
    • The whois data shows the IP is from China, but the user agent doesn't really give any clues:
    • +
    • The whois data shows the IP is from China, but the user agent doesn’t really give any clues:
    # grep 124.17.34.59 /var/log/nginx/access.log | awk -F'" ' '{print $3}' | sort | uniq -c | sort -h
         210 "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
       22610 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)"
     
      -
    • A Google search for “LCTE bot” doesn't return anything interesting, but this Stack Overflow discussion references the lack of information
    • +
    • A Google search for “LCTE bot” doesn’t return anything interesting, but this Stack Overflow discussion references the lack of information
    • So basically after a few hours of looking at the log files I am not closer to understanding what is going on!
    • I do know that we want to block Baidu, though, as it does not respect robots.txt
    • And as we speak Linode alerted that the outbound traffic rate is very high for the past two hours (about 12–14 hours)
    • @@ -479,13 +479,13 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
      $ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=124.17.34.59' | sort | uniq | wc -l
       20733
       
        -
      • I'm getting really sick of this
      • +
      • I’m getting really sick of this
      • Sisay re-uploaded the CIAT records that I had already corrected earlier this week, erasing all my corrections
      • I had to re-correct all the publishers, places, names, dates, etc and apply the changes on DSpace Test
      • Run system updates on DSpace Test and reboot the server
      • Magdalena had written to say that two of their Phase II project tags were missing on CGSpace, so I added them (#346)
      • -
      • I figured out a way to use nginx's map function to assign a “bot” user agent to misbehaving clients who don't define a user agent
      • -
      • Most bots are automatically lumped into one generic session by Tomcat's Crawler Session Manager Valve but this only works if their user agent matches a pre-defined regular expression like .*[bB]ot.*
      • +
      • I figured out a way to use nginx’s map function to assign a “bot” user agent to misbehaving clients who don’t define a user agent
      • +
      • Most bots are automatically lumped into one generic session by Tomcat’s Crawler Session Manager Valve but this only works if their user agent matches a pre-defined regular expression like .*[bB]ot.*
      • Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process
      • Basically, we modify the nginx config to add a mapping with a modified user agent $ua:
      @@ -495,15 +495,15 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017- default $http_user_agent; }
      -
    • If the client's address matches then the user agent is set, otherwise the default $http_user_agent variable is used
    • -
    • Then, in the server's / block we pass this header to Tomcat:
    • +
    • If the client’s address matches then the user agent is set, otherwise the default $http_user_agent variable is used
    • +
    • Then, in the server’s / block we pass this header to Tomcat:
    proxy_pass http://tomcat_http;
     proxy_set_header User-Agent $ua;
     
      -
    • Note to self: the $ua variable won't show up in nginx access logs because the default combined log format doesn't show it, so don't run around pulling your hair out wondering with the modified user agents aren't showing in the logs!
    • +
    • Note to self: the $ua variable won’t show up in nginx access logs because the default combined log format doesn’t show it, so don’t run around pulling your hair out wondering with the modified user agents aren’t showing in the logs!
    • If a client matching one of these IPs connects without a session, it will be assigned one by the Crawler Session Manager Valve
    • -
    • You can verify by cross referencing nginx's access.log and DSpace's dspace.log.2017-11-08, for example
    • +
    • You can verify by cross referencing nginx’s access.log and DSpace’s dspace.log.2017-11-08, for example
    • I will deploy this on CGSpace later this week
    • I am interested to check how this affects the number of sessions used by the CIAT and Chinese bots (see above on 2017-11-07 for example)
    • I merged the clickable thumbnails code to 5_x-prod (#347) and will deploy it later along with the new bot mapping stuff (and re-run the Asible nginx and tomcat tags)
    • @@ -522,7 +522,7 @@ proxy_set_header User-Agent $ua; 1134
    • I have been looking for a reason to ban Baidu and this is definitely a good one
    • -
    • Disallowing Baiduspider in robots.txt probably won't work because this bot doesn't seem to respect the robot exclusion standard anyways!
    • +
    • Disallowing Baiduspider in robots.txt probably won’t work because this bot doesn’t seem to respect the robot exclusion standard anyways!
    • I will whip up something in nginx later
    • Run system updates on CGSpace and reboot the server
    • Re-deploy latest 5_x-prod branch on CGSpace and DSpace Test (includes the clickable thumbnails, CCAFS phase II project tags, and updated news text)
    • @@ -548,7 +548,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{3 3506
    • The number of sessions is over ten times less!
    • -
    • This gets me thinking, I wonder if I can use something like nginx's rate limiter to automatically change the user agent of clients who make too many requests
    • +
    • This gets me thinking, I wonder if I can use something like nginx’s rate limiter to automatically change the user agent of clients who make too many requests
    • Perhaps using a combination of geo and map, like illustrated here: https://www.nginx.com/blog/rate-limiting-nginx/

    2017-11-11

    @@ -560,7 +560,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{3

    2017-11-12

    • Update the Ansible infrastructure templates to be a little more modular and flexible
    • -
    • Looking at the top client IPs on CGSpace so far this morning, even though it's only been eight hours:
    • +
    • Looking at the top client IPs on CGSpace so far this morning, even though it’s only been eight hours:
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "12/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         243 5.83.120.111
    @@ -579,7 +579,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{3
     
    # grep 5.9.6.51 /var/log/nginx/access.log | tail -n 1
     5.9.6.51 - - [12/Nov/2017:08:13:13 +0000] "GET /handle/10568/16515/recent-submissions HTTP/1.1" 200 5097 "-" "Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"
     
      -
    • What's amazing is that it seems to reuse its Java session across all requests:
    • +
    • What’s amazing is that it seems to reuse its Java session across all requests:
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2017-11-12
     1558
    @@ -587,7 +587,7 @@ $ grep 5.9.6.51 dspace.log.2017-11-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | s
     1
     
    • Bravo to MegaIndex.ru!
    • -
    • The same cannot be said for 95.108.181.88, which appears to be YandexBot, even though Tomcat's Crawler Session Manager valve regex should match ‘YandexBot’:
    • +
    • The same cannot be said for 95.108.181.88, which appears to be YandexBot, even though Tomcat’s Crawler Session Manager valve regex should match ‘YandexBot’:
    # grep 95.108.181.88 /var/log/nginx/access.log | tail -n 1
     95.108.181.88 - - [12/Nov/2017:08:33:17 +0000] "GET /bitstream/handle/10568/57004/GenebankColombia_23Feb2015.pdf HTTP/1.1" 200 972019 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
    @@ -600,8 +600,8 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2017-11-
     10947/34   10947/1 10568/83389
     10947/2512 10947/1 10568/83389
     
    @@ -664,7 +664,7 @@ Server: nginx
    • Deploy some nginx configuration updates to CGSpace
    • They had been waiting on a branch for a few months and I think I just forgot about them
    • -
    • I have been running them on DSpace Test for a few days and haven't seen any issues there
    • +
    • I have been running them on DSpace Test for a few days and haven’t seen any issues there
    • Started testing DSpace 6.2 and a few things have changed
    • Now PostgreSQL needs pgcrypto:
    @@ -672,21 +672,21 @@ Server: nginx dspace6=# CREATE EXTENSION pgcrypto;
    • Also, local settings are no longer in build.properties, they are now in local.cfg
    • -
    • I'm not sure if we can use separate profiles like we did before with mvn -Denv=blah to use blah.properties
    • +
    • I’m not sure if we can use separate profiles like we did before with mvn -Denv=blah to use blah.properties
    • It seems we need to use “system properties” to override settings, ie: -Ddspace.dir=/Users/aorth/dspace6

    2017-11-15

    • Send Adam Hunt an invite to the DSpace Developers network on Yammer
    • He is the new head of communications at WLE, since Michael left
    • -
    • Merge changes to item view's wording of link metadata (#348)
    • +
    • Merge changes to item view’s wording of link metadata (#348)

    2017-11-17

    • Uptime Robot said that CGSpace went down today and I see lots of Timeout waiting for idle object errors in the DSpace logs
    • I looked in PostgreSQL using SELECT * FROM pg_stat_activity; and saw that there were 73 active connections
    • After a few minutes the connecitons went down to 44 and CGSpace was kinda back up, it seems like Tsega restarted Tomcat
    • -
    • Looking at the REST and XMLUI log files, I don't see anything too crazy:
    • +
    • Looking at the REST and XMLUI log files, I don’t see anything too crazy:
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep "17/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
          13 66.249.66.223
    @@ -712,7 +712,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
        2020 66.249.66.219
     
    • I need to look into using JMX to analyze active sessions I think, rather than looking at log files
    • -
    • After adding appropriate JMX listener options to Tomcat's JAVA_OPTS and restarting Tomcat, I can connect remotely using an SSH dynamic port forward (SOCKS) on port 7777 for example, and then start jconsole locally like:
    • +
    • After adding appropriate JMX listener options to Tomcat’s JAVA_OPTS and restarting Tomcat, I can connect remotely using an SSH dynamic port forward (SOCKS) on port 7777 for example, and then start jconsole locally like:
    $ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=7777 service:jmx:rmi:///jndi/rmi://localhost:9000/jmxrmi -J-DsocksNonProxyHosts=
     
      @@ -760,14 +760,14 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
      2017-11-19 03:00:32,806 INFO  org.apache.pdfbox.pdfparser.PDFParser @ Document is encrypted
       2017-11-19 03:00:32,807 ERROR org.apache.pdfbox.filter.FlateFilter @ FlateFilter: stop reading corrupt stream due to a DataFormatException
       
        -
      • It's been a few days since I enabled the G1GC on DSpace Test and the JVM graph definitely changed:
      • +
      • It’s been a few days since I enabled the G1GC on DSpace Test and the JVM graph definitely changed:

      Tomcat G1GC

      2017-11-20

      • I found an article about JVM tuning that gives some pointers how to enable logging and tools to analyze logs for you
      • Also notes on rotating GC logs
      • -
      • I decided to switch DSpace Test back to the CMS garbage collector because it is designed for low pauses and high throughput (like G1GC!) and because we haven't even tried to monitor or tune it
      • +
      • I decided to switch DSpace Test back to the CMS garbage collector because it is designed for low pauses and high throughput (like G1GC!) and because we haven’t even tried to monitor or tune it

      2017-11-21

        @@ -777,7 +777,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19

    2017-11-22

    • Linode sent an alert that the CPU usage on the CGSpace server was very high around 4 to 6 AM
    • -
    • The logs don't show anything particularly abnormal between those hours:
    • +
    • The logs don’t show anything particularly abnormal between those hours:
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "22/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         136 31.6.77.23
    @@ -791,7 +791,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
         696 66.249.66.90
         707 104.196.152.243
     
      -
    • I haven't seen 54.144.57.183 before, it is apparently the CCBot from commoncrawl.org
    • +
    • I haven’t seen 54.144.57.183 before, it is apparently the CCBot from commoncrawl.org
    • In other news, it looks like the JVM garbage collection pattern is back to its standard jigsaw pattern after switching back to CMS a few days ago:

    Tomcat JVM with CMS GC

    @@ -826,22 +826,22 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19 942 45.5.184.196 3995 70.32.83.92
    $ grep 70.32.83.92 dspace.log.2017-11-23 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2
     

    2017-11-24

    PostgreSQL connections after tweak (week)

    PostgreSQL connections after tweak (month)

    @@ -893,29 +893,29 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19 6053 45.5.184.196
    $ cat dspace.log.2017-11-29 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     10037
     
    $ cat dspace.log.2017-11-27 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     12377
     $ cat dspace.log.2017-11-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     16984
     

    2017-11-30

    diff --git a/docs/2017-12/index.html b/docs/2017-12/index.html index 87089c0d1..9931eb6a7 100644 --- a/docs/2017-12/index.html +++ b/docs/2017-12/index.html @@ -27,7 +27,7 @@ The logs say “Timeout waiting for idle object” PostgreSQL activity says there are 115 connections currently The list of connections to XMLUI and REST API for today: "/> - + @@ -57,7 +57,7 @@ The list of connections to XMLUI and REST API for today: - + @@ -104,7 +104,7 @@ The list of connections to XMLUI and REST API for today:

    December, 2017

    @@ -128,7 +128,7 @@ The list of connections to XMLUI and REST API for today: 4007 70.32.83.92 6061 45.5.184.196
    $ cat /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     5815
    @@ -148,7 +148,7 @@ The list of connections to XMLUI and REST API for today:
         314 2.86.122.76
     
    $ grep 2.86.122.76 /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     822
    @@ -169,20 +169,20 @@ The list of connections to XMLUI and REST API for today:
         319 2001:4b99:1:1:216:3eff:fe76:205b
     

    2017-12-03

    2017-12-04

    DSpace Test PostgreSQL connections month

    CGSpace PostgreSQL connections month

    2017-12-05

    @@ -196,8 +196,8 @@ The list of connections to XMLUI and REST API for today:
  • Linode alerted again that the CPU usage on CGSpace was high this morning from 6 to 8 AM
  • Uptime Robot alerted that the server went down and up around 8:53 this morning
  • Uptime Robot alerted that CGSpace was down and up again a few minutes later
  • -
  • I don't see any errors in the DSpace logs but I see in nginx's access.log that UptimeRobot was returned with HTTP 499 status (Client Closed Request)
  • -
  • Looking at the REST API logs I see some new client IP I haven't noticed before:
  • +
  • I don’t see any errors in the DSpace logs but I see in nginx’s access.log that UptimeRobot was returned with HTTP 499 status (Client Closed Request)
  • +
  • Looking at the REST API logs I see some new client IP I haven’t noticed before:
  • # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "6/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
          18 95.108.181.88
    @@ -233,7 +233,7 @@ The list of connections to XMLUI and REST API for today:
        2662 66.249.66.219
        5110 124.17.34.60
     
    Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)
    @@ -243,7 +243,7 @@ The list of connections to XMLUI and REST API for today:
     
    $ grep 124.17.34.60 /home/cgspace.cgiar.org/log/dspace.log.2017-12-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     4574
     
      -
    • I've adjusted the nginx IP mapping that I set up last month to account for 124.17.34.60 and 124.17.34.59 using a regex, as it's the same bot on the same subnet
    • +
    • I’ve adjusted the nginx IP mapping that I set up last month to account for 124.17.34.60 and 124.17.34.59 using a regex, as it’s the same bot on the same subnet
    • I was running the DSpace cleanup task manually and it hit an error:
    $ /home/cgspace.cgiar.org/bin/dspace cleanup -v
    @@ -261,7 +261,7 @@ UPDATE 1
     
     

    2017-12-16

      -
    • Re-work the XMLUI base theme to allow child themes to override the header logo's image and link destination: #349
    • +
    • Re-work the XMLUI base theme to allow child themes to override the header logo’s image and link destination: #349
    • This required a little bit of work to restructure the XSL templates
    • Optimize PNG and SVG image assets in the CGIAR base theme using pngquant and svgo: #350
    @@ -276,7 +276,7 @@ UPDATE 1
  • I also had to add the .jpg to the thumbnail string in the CSV
  • The thumbnail11.jpg is missing
  • The dates are in super long ISO8601 format (from Excel?) like 2016-02-07T00:00:00Z so I converted them to simpler forms in GREL: value.toString("yyyy-MM-dd")
  • -
  • I trimmed the whitespaces in a few fields but it wasn't many
  • +
  • I trimmed the whitespaces in a few fields but it wasn’t many
  • Rename her thumbnail column to filename, and format it so SAFBuilder adds the files to the thumbnail bundle with this GREL in OpenRefine: value + "__bundle:THUMBNAIL"
  • Rename dc.identifier.status and dc.identifier.url columns to cg.identifier.status and cg.identifier.url
  • Item 4 has weird characters in citation, ie: Nagoya et de Trait
  • @@ -289,7 +289,7 @@ UPDATE 1
    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/89338 --source /Users/aorth/Downloads/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat --mapfile=/tmp/ccafs.map &> /tmp/ccafs.log
     
      -
    • It's the same on DSpace Test, I can't import the SAF bundle without specifying the collection:
    • +
    • It’s the same on DSpace Test, I can’t import the SAF bundle without specifying the collection:
    $ dspace import --add --eperson=aorth@mjanja.ch --mapfile=/tmp/ccafs.map --source=/tmp/ccafs-2016/SimpleArchiveFormat
     No collections given. Assuming 'collections' file inside item directory
    @@ -317,7 +317,7 @@ Elapsed time: 2 secs (2559 msecs)
     
    -Dlog4j.configuration=file:/Users/aorth/dspace/config/log4j-console.properties -Ddspace.log.init.disable=true
     
    • … but the error message was the same, just with more INFO noise around it
    • -
    • For now I'll import into a collection in DSpace Test but I'm really not sure what's up with this!
    • +
    • For now I’ll import into a collection in DSpace Test but I’m really not sure what’s up with this!
    • Linode alerted that CGSpace was using high CPU from 4 to 6 PM
    • The logs for today show the CORE bot (137.108.70.7) being active in XMLUI:
    @@ -347,7 +347,7 @@ Elapsed time: 2 secs (2559 msecs) 4014 70.32.83.92 11030 45.5.184.196
      -
    • That's probably ok, as I don't think the REST API connections use up a Tomcat session…
    • +
    • That’s probably ok, as I don’t think the REST API connections use up a Tomcat session…
    • CIP emailed a few days ago to ask about unique IDs for authors and organizations, and if we can provide them via an API
    • Regarding the import issue above it seems to be a known issue that has a patch in DSpace 5.7:
    • -
    • We're on DSpace 5.5 but there is a one-word fix to the addItem() function here: https://github.com/DSpace/DSpace/pull/1731
    • +
    • We’re on DSpace 5.5 but there is a one-word fix to the addItem() function here: https://github.com/DSpace/DSpace/pull/1731
    • I will apply it on our branch but I need to make a note to NOT cherry-pick it when I rebase on to the latest 5.x upstream later
    • Pull request: #351
    @@ -393,7 +393,7 @@ Elapsed time: 2 secs (2559 msecs)
  • I need to keep an eye on this issue because it has nice fixes for reducing the number of database connections in DSpace 5.7: https://jira.duraspace.org/browse/DS-3551
  • Update text on CGSpace about page to give some tips to developers about using the resources more wisely (#352)
  • Linode alerted that CGSpace was using 396.3% CPU from 12 to 2 PM
  • -
  • The REST and OAI API logs look pretty much the same as earlier this morning, but there's a new IP harvesting XMLUI:
  • +
  • The REST and OAI API logs look pretty much the same as earlier this morning, but there’s a new IP harvesting XMLUI:
  • # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail            
         360 95.108.181.88
    @@ -416,8 +416,8 @@ Elapsed time: 2 secs (2559 msecs)
     
    $ grep 2.86.72.181 dspace.log.2017-12-18 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l                                                                                          
     1
     
      -
    • I guess there's nothing I can do to them for now
    • -
    • In other news, I am curious how many PostgreSQL connection pool errors we've had in the last month:
    • +
    • I guess there’s nothing I can do to them for now
    • +
    • In other news, I am curious how many PostgreSQL connection pool errors we’ve had in the last month:
    $ grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-1* | grep -v :0
     dspace.log.2017-11-07:15695
    @@ -430,9 +430,9 @@ dspace.log.2017-12-01:1601
     dspace.log.2017-12-02:1274
     dspace.log.2017-12-07:2769
     
      -
    • I made a small fix to my move-collections.sh script so that it handles the case when a “to” or “from” community doesn't exist
    • +
    • I made a small fix to my move-collections.sh script so that it handles the case when a “to” or “from” community doesn’t exist
    • The script lives here: https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515
    • -
    • Major reorganization of four of CTA's French collections
    • +
    • Major reorganization of four of CTA’s French collections
    • Basically moving their items into the English ones, then moving the English ones to the top-level of the CTA community, and deleting the old sub-communities
    • Move collection 10568/51821 from 10568/42212 to 10568/42211
    • Move collection 10568/51400 from 10568/42214 to 10568/42211
    • @@ -457,21 +457,21 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery

      2017-12-19

      • Briefly had PostgreSQL connection issues on CGSpace for the millionth time
      • -
      • I'm fucking sick of this!
      • +
      • I’m fucking sick of this!
      • The connection graph on CGSpace shows shit tons of connections idle

      Idle PostgreSQL connections on CGSpace

        -
      • And I only now just realized that DSpace's db.maxidle parameter is not seconds, but number of idle connections to allow.
      • +
      • And I only now just realized that DSpace’s db.maxidle parameter is not seconds, but number of idle connections to allow.
      • So theoretically, because each webapp has its own pool, this could be 20 per app—so no wonder we have 50 idle connections!
      • I notice that this number will be set to 10 by default in DSpace 6.1 and 7.0: https://jira.duraspace.org/browse/DS-3564
      • -
      • So I'm going to reduce ours from 20 to 10 and start trying to figure out how the hell to supply a database pool using Tomcat JNDI
      • +
      • So I’m going to reduce ours from 20 to 10 and start trying to figure out how the hell to supply a database pool using Tomcat JNDI
      • I re-deployed the 5_x-prod branch on CGSpace, applied all system updates, and restarted the server
      • Looking through the dspace.log I see this error:
      2017-12-19 08:17:15,740 ERROR org.dspace.statistics.SolrLogger @ Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
       
        -
      • I don't have time now to look into this but the Solr sharding has long been an issue!
      • +
      • I don’t have time now to look into this but the Solr sharding has long been an issue!
      • Looking into using JDBC / JNDI to provide a database pool to DSpace
      • The DSpace 6.x configuration docs have more notes about setting up the database pool than the 5.x ones (which actually have none!)
      • First, I uncomment db.jndi in dspace/config/dspace.cfg
      • @@ -496,7 +496,7 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
        <ResourceLink global="jdbc/dspace" name="jdbc/dspace" type="javax.sql.DataSource"/>
         
        • I am not sure why several guides show configuration snippets for server.xml and web application contexts that use a Local and Global jdbc…
        • -
        • When DSpace can't find the JNDI context (for whatever reason) you will see this in the dspace logs:
        • +
        • When DSpace can’t find the JNDI context (for whatever reason) you will see this in the dspace logs:
        2017-12-19 13:12:08,796 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
         javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Context. Unable to find [jdbc].
        @@ -547,31 +547,31 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
            <version>9.1-901-1.jdbc4</version>
         </dependency>
         
          -
        • So WTF? Let's try copying one to Tomcat's lib folder and restarting Tomcat:
        • +
        • So WTF? Let’s try copying one to Tomcat’s lib folder and restarting Tomcat:
        $ cp ~/dspace/lib/postgresql-9.1-901-1.jdbc4.jar /usr/local/opt/tomcat@7/libexec/lib
         
          -
        • Oh that's fantastic, now at least Tomcat doesn't print an error during startup so I guess it succeeds to create the JNDI pool
        • -
        • DSpace starts up but I have no idea if it's using the JNDI configuration because I see this in the logs:
        • +
        • Oh that’s fantastic, now at least Tomcat doesn’t print an error during startup so I guess it succeeds to create the JNDI pool
        • +
        • DSpace starts up but I have no idea if it’s using the JNDI configuration because I see this in the logs:
        2017-12-19 13:26:54,271 INFO  org.dspace.storage.rdbms.DatabaseManager @ DBMS is '{}'PostgreSQL
         2017-12-19 13:26:54,277 INFO  org.dspace.storage.rdbms.DatabaseManager @ DBMS driver version is '{}'9.5.10
         2017-12-19 13:26:54,293 INFO  org.dspace.storage.rdbms.DatabaseUtils @ Loading Flyway DB migrations from: filesystem:/Users/aorth/dspace/etc/postgres, classpath:org.dspace.storage.rdbms.sqlmigration.postgres, classpath:org.dspace.storage.rdbms.migration
         2017-12-19 13:26:54,306 INFO  org.flywaydb.core.internal.dbsupport.DbSupportFactory @ Database: jdbc:postgresql://localhost:5432/dspacetest (PostgreSQL 9.5)
         
          -
        • Let's try again, but this time explicitly blank the PostgreSQL connection parameters in dspace.cfg and see if DSpace starts…
        • -
        • Wow, ok, that works, but having to copy the PostgreSQL JDBC JAR to Tomcat's lib folder totally blows
        • -
        • Also, it's likely this is only a problem on my local macOS + Tomcat test environment
        • -
        • Ubuntu's Tomcat distribution will probably handle this differently
        • +
        • Let’s try again, but this time explicitly blank the PostgreSQL connection parameters in dspace.cfg and see if DSpace starts…
        • +
        • Wow, ok, that works, but having to copy the PostgreSQL JDBC JAR to Tomcat’s lib folder totally blows
        • +
        • Also, it’s likely this is only a problem on my local macOS + Tomcat test environment
        • +
        • Ubuntu’s Tomcat distribution will probably handle this differently
        • So for reference I have:
          • a <Resource> defined globally in server.xml
          • -
          • a <ResourceLink> defined in each web application's context XML
          • +
          • a <ResourceLink> defined in each web application’s context XML
          • unset the db.url, db.username, and db.password parameters in dspace.cfg
          • set the db.jndi in dspace.cfg to the name specified in the web application context
        • -
        • After adding the Resource to server.xml on Ubuntu I get this in Catalina's logs:
        • +
        • After adding the Resource to server.xml on Ubuntu I get this in Catalina’s logs:
        SEVERE: Unable to create initial connections of pool.
         java.sql.SQLException: org.postgresql.Driver
        @@ -579,8 +579,8 @@ java.sql.SQLException: org.postgresql.Driver
         Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
         
        • The username and password are correct, but maybe I need to copy the fucking lib there too?
        • -
        • I tried installing Ubuntu's libpostgresql-jdbc-java package but Tomcat still can't find the class
        • -
        • Let me try to symlink the lib into Tomcat's libs:
        • +
        • I tried installing Ubuntu’s libpostgresql-jdbc-java package but Tomcat still can’t find the class
        • +
        • Let me try to symlink the lib into Tomcat’s libs:
        # ln -sv /usr/share/java/postgresql.jar /usr/share/tomcat7/lib
         
          @@ -589,17 +589,17 @@ Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
          SEVERE: Exception sending context initialized event to listener instance of class org.dspace.app.util.DSpaceContextListener
           java.lang.AbstractMethodError: Method org/postgresql/jdbc3/Jdbc3ResultSet.isClosed()Z is abstract
           
            -
          • Could be a version issue or something since the Ubuntu package provides 9.2 and DSpace's are 9.1…
          • -
          • Let me try to remove it and copy in DSpace's:
          • +
          • Could be a version issue or something since the Ubuntu package provides 9.2 and DSpace’s are 9.1…
          • +
          • Let me try to remove it and copy in DSpace’s:
          # rm /usr/share/tomcat7/lib/postgresql.jar
           # cp [dspace]/webapps/xmlui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar /usr/share/tomcat7/lib/
           
          • Wow, I think that actually works…
          • I wonder if I could get the JDBC driver from postgresql.org instead of relying on the one from the DSpace build: https://jdbc.postgresql.org/
          • -
          • I notice our version is 9.1-901, which isn't even available anymore! The latest in the archived versions is 9.1-903
          • +
          • I notice our version is 9.1-901, which isn’t even available anymore! The latest in the archived versions is 9.1-903
          • Also, since I commented out all the db parameters in DSpace.cfg, how does the command line dspace tool work?
          • -
          • Let's try the upstream JDBC driver first:
          • +
          • Let’s try the upstream JDBC driver first:
          # rm /usr/share/tomcat7/lib/postgresql-9.1-901-1.jdbc4.jar
           # wget https://jdbc.postgresql.org/download/postgresql-42.1.4.jar -O /usr/share/tomcat7/lib/postgresql-42.1.4.jar
          @@ -648,8 +648,8 @@ javax.naming.NoInitialContextException: Need to specify class name in environmen
           
          • If I add the db values back to dspace.cfg the dspace database info command succeeds but the log still shows errors retrieving the JNDI connection
          • Perhaps something to report to the dspace-tech mailing list when I finally send my comments
          • -
          • Oh cool! select * from pg_stat_activity shows “PostgreSQL JDBC Driver” for the application name! That's how you know it's working!
          • -
          • If you monitor the pg_stat_activity while you run dspace database info you can see that it doesn't use the JNDI and creates ~9 extra PostgreSQL connections!
          • +
          • Oh cool! select * from pg_stat_activity shows “PostgreSQL JDBC Driver” for the application name! That’s how you know it’s working!
          • +
          • If you monitor the pg_stat_activity while you run dspace database info you can see that it doesn’t use the JNDI and creates ~9 extra PostgreSQL connections!
          • And in the middle of all of this Linode sends an alert that CGSpace has high CPU usage from 2 to 4 PM

          2017-12-20

          @@ -678,14 +678,14 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -i 10568/89287

          2017-12-24

          • Linode alerted that CGSpace was using high CPU this morning around 6 AM
          • -
          • I'm playing with reading all of a month's nginx logs into goaccess:
          • +
          • I’m playing with reading all of a month’s nginx logs into goaccess:
          # find /var/log/nginx -type f -newermt "2017-12-01" | xargs zcat --force | goaccess --log-format=COMBINED -
           
          • I can see interesting things using this approach, for example:
              -
            • 50.116.102.77 checked our status almost 40,000 times so far this month—I think it's the CGNet uptime tool
            • -
            • Also, we've handled 2.9 million requests this month from 172,000 unique IP addresses!
            • +
            • 50.116.102.77 checked our status almost 40,000 times so far this month—I think it’s the CGNet uptime tool
            • +
            • Also, we’ve handled 2.9 million requests this month from 172,000 unique IP addresses!
            • Total bandwidth so far this month is 640GiB
            • The user that made the most requests so far this month is 45.5.184.196 (267,000 requests)
            @@ -720,13 +720,13 @@ UPDATE 5 # delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)'; DELETE 20
      -
    • I need to figure out why we have records with language in because that's not a language!
    • +
    • I need to figure out why we have records with language in because that’s not a language!

    2017-12-30

    • Linode alerted that CGSpace was using 259% CPU from 4 to 6 AM
    • Uptime Robot noticed that the server went down for 1 minute a few hours later, around 9AM
    • -
    • Here's the XMLUI logs:
    • +
    • Here’s the XMLUI logs:
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "30/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         637 207.46.13.106
    @@ -740,14 +740,14 @@ DELETE 20
        1586 66.249.64.78
        3653 66.249.64.91
     
      -
    • Looks pretty normal actually, but I don't know who 54.175.208.220 is
    • +
    • Looks pretty normal actually, but I don’t know who 54.175.208.220 is
    • They identify as “com.plumanalytics”, which Google says is associated with Elsevier
    • -
    • They only seem to have used one Tomcat session so that's good, I guess I don't need to add them to the Tomcat Crawler Session Manager valve:
    • +
    • They only seem to have used one Tomcat session so that’s good, I guess I don’t need to add them to the Tomcat Crawler Session Manager valve:
    $ grep 54.175.208.220 dspace.log.2017-12-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l          
     1 
     
      -
    • 216.244.66.245 seems to be moz.com's DotBot
    • +
    • 216.244.66.245 seems to be moz.com’s DotBot

    2017-12-31

      diff --git a/docs/2018-01/index.html b/docs/2018-01/index.html index bf99a7005..6fcdbd4d6 100644 --- a/docs/2018-01/index.html +++ b/docs/2018-01/index.html @@ -9,7 +9,7 @@ @@ -83,7 +83,7 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv - + @@ -177,7 +177,7 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv - + @@ -224,7 +224,7 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv

      January, 2018

      @@ -232,7 +232,7 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv

      2018-01-02

      • Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
      • -
      • I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
      • +
      • I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
      • The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
      • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
      • And just before that I see this:
      • @@ -240,8 +240,8 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv
        Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
         
        • Ah hah! So the pool was actually empty!
        • -
        • I need to increase that, let's try to bump it up from 50 to 75
        • -
        • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
        • +
        • I need to increase that, let’s try to bump it up from 50 to 75
        • +
        • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw
        • I notice this error quite a few times in dspace.log:
        2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
        @@ -294,7 +294,7 @@ dspace.log.2017-12-31:53
         dspace.log.2018-01-01:45
         dspace.log.2018-01-02:34
         
          -
        • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains
        • +
        • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains

        2018-01-03

          @@ -326,8 +326,8 @@ dspace.log.2018-01-03:1909
    • 134.155.96.78 appears to be at the University of Mannheim in Germany
    • They identify as: Mozilla/5.0 (compatible; heritrix/3.2.0 +http://ifm.uni-mannheim.de)
    • -
    • This appears to be the Internet Archive's open source bot
    • -
    • They seem to be re-using their Tomcat session so I don't need to do anything to them just yet:
    • +
    • This appears to be the Internet Archive’s open source bot
    • +
    • They seem to be re-using their Tomcat session so I don’t need to do anything to them just yet:
    $ grep 134.155.96.78 dspace.log.2018-01-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2
    @@ -387,8 +387,8 @@ dspace.log.2018-01-03:1909
         139 164.39.7.62
     
    • I have no idea what these are but they seem to be coming from Amazon…
    • -
    • I guess for now I just have to increase the database connection pool's max active
    • -
    • It's currently 75 and normally I'd just bump it by 25 but let me be a bit daring and push it by 50 to 125, because I used to see at least 121 connections in pg_stat_activity before when we were using the shitty default pooling
    • +
    • I guess for now I just have to increase the database connection pool’s max active
    • +
    • It’s currently 75 and normally I’d just bump it by 25 but let me be a bit daring and push it by 50 to 125, because I used to see at least 121 connections in pg_stat_activity before when we were using the shitty default pooling

    2018-01-04

      @@ -420,14 +420,14 @@ dspace.log.2018-01-02:1972 dspace.log.2018-01-03:1909 dspace.log.2018-01-04:1559

    2018-01-05

    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
     dspace.log.2018-01-01:0
    @@ -442,8 +442,8 @@ dspace.log.2018-01-05:0
     
    [Fri Jan 05 09:31:22.965398 2018] [:error] [pid 9340] [client 213.55.99.121:64476] WARNING: Unable to find a match for "9-16-1-RV.doc" in "/home/files/journals/6//articles/9/". Skipping this file., referer: http://dagris.info/reviewtool/index.php/index/install/upgrade
     
    • I will delete the log file for now and tell Danny
    • -
    • Also, I'm still seeing a hundred or so of the “ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer” errors in dspace logs, I need to search the dspace-tech mailing list to see what the cause is
    • -
    • I will run a full Discovery reindex in the mean time to see if it's something wrong with the Discovery Solr core
    • +
    • Also, I’m still seeing a hundred or so of the “ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer” errors in dspace logs, I need to search the dspace-tech mailing list to see what the cause is
    • +
    • I will run a full Discovery reindex in the mean time to see if it’s something wrong with the Discovery Solr core
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
    @@ -456,7 +456,7 @@ sys     3m14.890s
     
     

    2018-01-06

      -
    • I'm still seeing Solr errors in the DSpace logs even after the full reindex yesterday:
    • +
    • I’m still seeing Solr errors in the DSpace logs even after the full reindex yesterday:
    org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1983+TO+1989]': Encountered " "]" "] "" at line 1, column 32.
     
      @@ -471,7 +471,7 @@ sys 3m14.890s COPY 4515

    2018-01-10

      -
    • I looked to see what happened to this year's Solr statistics sharding task that should have run on 2018-01-01 and of course it failed:
    • +
    • I looked to see what happened to this year’s Solr statistics sharding task that should have run on 2018-01-01 and of course it failed:
    Moving: 81742 into core statistics-2010
     Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2010
    @@ -542,9 +542,9 @@ Caused by: org.apache.http.client.ClientProtocolException
             ... 10 more
     
    • There is interesting documentation about this on the DSpace Wiki: https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-SolrShardingByYear
    • -
    • I'm looking to see maybe if we're hitting the issues mentioned in DS-2212 that were apparently fixed in DSpace 5.2
    • +
    • I’m looking to see maybe if we’re hitting the issues mentioned in DS-2212 that were apparently fixed in DSpace 5.2
    • I can apparently search for records in the Solr stats core that have an empty owningColl field using this in the Solr admin query: -owningColl:*
    • -
    • On CGSpace I see 48,000,000 records that have an owningColl field and 34,000,000 that don't:
    • +
    • On CGSpace I see 48,000,000 records that have an owningColl field and 34,000,000 that don’t:
    $ http 'http://localhost:3000/solr/statistics/select?q=owningColl%3A*&wt=json&indent=true' | grep numFound 
       "response":{"numFound":48476327,"start":0,"docs":[
    @@ -552,14 +552,14 @@ $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&wt=js
       "response":{"numFound":34879872,"start":0,"docs":[
     
    • I tested the dspace stats-util -s process on my local machine and it failed the same way
    • -
    • It doesn't seem to be helpful, but the dspace log shows this:
    • +
    • It doesn’t seem to be helpful, but the dspace log shows this:
    2018-01-10 10:51:19,301 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
     2018-01-10 10:51:19,301 INFO  org.dspace.statistics.SolrLogger @ Moving: 3821 records into core statistics-2016
     
    • Terry Brady has written some notes on the DSpace Wiki about Solr sharing issues: https://wiki.duraspace.org/display/%7Eterrywbrady/Statistics+Import+Export+Issues
    • Uptime Robot said that CGSpace went down at around 9:43 AM
    • -
    • I looked at PostgreSQL's pg_stat_activity table and saw 161 active connections, but no pool errors in the DSpace logs:
    • +
    • I looked at PostgreSQL’s pg_stat_activity table and saw 161 active connections, but no pool errors in the DSpace logs:
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-10 
     0
    @@ -583,7 +583,7 @@ $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&wt=js
     
    "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36"
     
    • whois says they come from Perfect IP
    • -
    • I've never seen those top IPs before, but they have created 50,000 Tomcat sessions today:
    • +
    • I’ve never seen those top IPs before, but they have created 50,000 Tomcat sessions today:
    $ grep -E '(2607:fa98:40:9:26b6:fdff:feff:1888|2607:fa98:40:9:26b6:fdff:feff:195d|2607:fa98:40:9:26b6:fdff:feff:1c96|70.36.107.49|70.36.107.190|70.36.107.50)' /home/cgspace.cgiar.org/log/dspace.log.2018-01-10 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l                                                                                                                                                                                                  
     49096
    @@ -599,20 +599,20 @@ $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&wt=js
       23401 2607:fa98:40:9:26b6:fdff:feff:195d 
       47875 2607:fa98:40:9:26b6:fdff:feff:1888
     
      -
    • I added the user agent to nginx's badbots limit req zone but upon testing the config I got an error:
    • +
    • I added the user agent to nginx’s badbots limit req zone but upon testing the config I got an error:
    # nginx -t
     nginx: [emerg] could not build map_hash, you should increase map_hash_bucket_size: 64
     nginx: configuration file /etc/nginx/nginx.conf test failed
     
    # cat /proc/cpuinfo | grep cache_alignment | head -n1
     cache_alignment : 64
     
    • On our servers that is 64, so I increased this parameter to 128 and deployed the changes to nginx
    • Almost immediately the PostgreSQL connections dropped back down to 40 or so, and UptimeRobot said the site was back up
    • -
    • So that's interesting that we're not out of PostgreSQL connections (current pool maxActive is 300!) but the system is “down” to UptimeRobot and very slow to use
    • +
    • So that’s interesting that we’re not out of PostgreSQL connections (current pool maxActive is 300!) but the system is “down” to UptimeRobot and very slow to use
    • Linode continues to test mitigations for Meltdown and Spectre: https://blog.linode.com/2018/01/03/cpu-vulnerabilities-meltdown-spectre/
    • I rebooted DSpace Test to see if the kernel will be updated (currently Linux 4.14.12-x86_64-linode92)… nope.
    • It looks like Linode will reboot the KVM hosts later this week, though
    • @@ -650,7 +650,7 @@ cache_alignment : 64 111535 2607:fa98:40:9:26b6:fdff:feff:1c96 161797 2607:fa98:40:9:26b6:fdff:feff:1888
      -
    • Wow, I just figured out how to set the application name of each database pool in the JNDI config of Tomcat's server.xml:
    • +
    • Wow, I just figured out how to set the application name of each database pool in the JNDI config of Tomcat’s server.xml:
    <Resource name="jdbc/dspaceWeb" auth="Container" type="javax.sql.DataSource"
               driverClassName="org.postgresql.Driver"
    @@ -665,9 +665,9 @@ cache_alignment : 64
               validationQuery='SELECT 1'
               testOnBorrow='true' />
     
      -
    • So theoretically I could name each connection “xmlui” or “dspaceWeb” or something meaningful and it would show up in PostgreSQL's pg_stat_activity table!
    • +
    • So theoretically I could name each connection “xmlui” or “dspaceWeb” or something meaningful and it would show up in PostgreSQL’s pg_stat_activity table!
    • This would be super helpful for figuring out where load was coming from (now I wonder if I could figure out how to graph this)
    • -
    • Also, I realized that the db.jndi parameter in dspace.cfg needs to match the name value in your applicaiton's context—not the global one
    • +
    • Also, I realized that the db.jndi parameter in dspace.cfg needs to match the name value in your applicaiton’s context—not the global one
    • Ah hah! Also, I can name the default DSpace connection pool in dspace.cfg as well, like:
    db.url = jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceDefault
    @@ -676,7 +676,7 @@ cache_alignment : 64
     
     

    2018-01-12

      -
    • I'm looking at the DSpace 6.0 Install docs and notice they tweak the number of threads in their Tomcat connector:
    • +
    • I’m looking at the DSpace 6.0 Install docs and notice they tweak the number of threads in their Tomcat connector:
    <!-- Define a non-SSL HTTP/1.1 Connector on port 8080 -->
     <Connector port="8080"
    @@ -691,8 +691,8 @@ cache_alignment : 64
                URIEncoding="UTF-8"/>
     
    • In Tomcat 8.5 the maxThreads defaults to 200 which is probably fine, but tweaking minSpareThreads could be good
    • -
    • I don't see a setting for maxSpareThreads in the docs so that might be an error
    • -
    • Looks like in Tomcat 8.5 the default URIEncoding for Connectors is UTF-8, so we don't need to specify that manually anymore: https://tomcat.apache.org/tomcat-8.5-doc/config/http.html
    • +
    • I don’t see a setting for maxSpareThreads in the docs so that might be an error
    • +
    • Looks like in Tomcat 8.5 the default URIEncoding for Connectors is UTF-8, so we don’t need to specify that manually anymore: https://tomcat.apache.org/tomcat-8.5-doc/config/http.html
    • Ooh, I just saw the acceptorThreadCount setting (in Tomcat 7 and 8.5):
    The number of threads to be used to accept connections. Increase this value on a multi CPU machine, although you would never really need more than 2. Also, with a lot of non keep alive connections, you might want to increase this value as well. Default value is 1.
    @@ -707,7 +707,7 @@ cache_alignment : 64
     
    13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxActive is not used in DBCP2, use maxTotal instead. maxTotal default value is 8. You have set value of "35" for "maxActive" property, which is being ignored.
     13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxWait is not used in DBCP2 , use maxWaitMillis instead. maxWaitMillis default value is -1. You have set value of "5000" for "maxWait" property, which is being ignored.
     
      -
    • I looked in my Tomcat 7.0.82 logs and I don't see anything about DBCP2 errors, so I guess this a Tomcat 8.0.x or 8.5.x thing
    • +
    • I looked in my Tomcat 7.0.82 logs and I don’t see anything about DBCP2 errors, so I guess this a Tomcat 8.0.x or 8.5.x thing
    • DBCP2 appears to be Tomcat 8.0.x and up according to the Tomcat 8.0 migration guide
    • I have updated our Ansible infrastructure scripts so that it will be ready whenever we switch to Tomcat 8 (probably with Ubuntu 18.04 later this year)
    • When I enable the ResourceLink in the ROOT.xml context I get the following error in the Tomcat localhost log:
    • @@ -735,24 +735,24 @@ Caused by: java.lang.NullPointerException ... 15 more
    • Interesting blog post benchmarking Tomcat JDBC vs Apache Commons DBCP2, with configuration snippets: http://www.tugay.biz/2016/07/tomcat-connection-pool-vs-apache.html
    • -
    • The Tomcat vs Apache pool thing is confusing, but apparently we're using Apache Commons DBCP2 because we don't specify factory="org.apache.tomcat.jdbc.pool.DataSourceFactory" in our global resource
    • -
    • So at least I know that I'm not looking for documentation or troubleshooting on the Tomcat JDBC pool!
    • -
    • I looked at pg_stat_activity during Tomcat's startup and I see that the pool created in server.xml is indeed connecting, just that nothing uses it
    • +
    • The Tomcat vs Apache pool thing is confusing, but apparently we’re using Apache Commons DBCP2 because we don’t specify factory="org.apache.tomcat.jdbc.pool.DataSourceFactory" in our global resource
    • +
    • So at least I know that I’m not looking for documentation or troubleshooting on the Tomcat JDBC pool!
    • +
    • I looked at pg_stat_activity during Tomcat’s startup and I see that the pool created in server.xml is indeed connecting, just that nothing uses it
    • Also, the fallback connection parameters specified in local.cfg (not dspace.cfg) are used
    • Shit, this might actually be a DSpace error: https://jira.duraspace.org/browse/DS-3434
    • -
    • I'll comment on that issue
    • +
    • I’ll comment on that issue

    2018-01-14

    • Looking at the authors Peter had corrected
    • -
    • Some had multiple and he's corrected them by adding || in the correction column, but I can't process those this way so I will just have to flag them and do those manually later
    • +
    • Some had multiple and he’s corrected them by adding || in the correction column, but I can’t process those this way so I will just have to flag them and do those manually later
    • Also, I can flag the values that have “DELETE”
    • Then I need to facet the correction column on isBlank(value) and not flagged

    2018-01-15

    • Help Udana from IWMI export a CSV from DSpace Test so he can start trying a batch upload
    • -
    • I'm going to apply these ~130 corrections on CGSpace:
    • +
    • I’m going to apply these ~130 corrections on CGSpace:
    update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
     delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
    @@ -764,7 +764,7 @@ update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_f
     update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
     delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
     
      -
    • Continue proofing Peter's author corrections that I started yesterday, faceting on non blank, non flagged, and briefly scrolling through the values of the corrections to find encoding errors for French and Spanish names
    • +
    • Continue proofing Peter’s author corrections that I started yesterday, faceting on non blank, non flagged, and briefly scrolling through the values of the corrections to find encoding errors for French and Spanish names

    OpenRefine Authors

      @@ -817,9 +817,9 @@ COPY 4552
    • Looking over the affiliations again I see dozens of CIAT ones with their affiliation formatted like: International Center for Tropical Agriculture (CIAT)
    • For example, this one is from just last month: https://cgspace.cgiar.org/handle/10568/89930
    • Our controlled vocabulary has this in the format without the abbreviation: International Center for Tropical Agriculture
    • -
    • So some submitters don't know to use the controlled vocabulary lookup
    • +
    • So some submitters don’t know to use the controlled vocabulary lookup
    • Help Sisay with some thumbnails for book chapters in Open Refine and SAFBuilder
    • -
    • CGSpace users were having problems logging in, I think something's wrong with LDAP because I see this in the logs:
    • +
    • CGSpace users were having problems logging in, I think something’s wrong with LDAP because I see this in the logs:
    2018-01-15 12:53:15,810 WARN  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2386749547D03E0AA4EC7E44181A7552:ip_addr=x.x.x.x:ldap_authentication:type=failed_auth javax.naming.AuthenticationException\colon; [LDAP\colon; error code 49 - 80090308\colon; LdapErr\colon; DSID-0C090400, comment\colon; AcceptSecurityContext error, data 775, v1db1^@]
     
      @@ -835,7 +835,7 @@ sys 0m2.210s
      • Meeting with CGSpace team, a few action items:
          -
        • Discuss standardized names for CRPs and centers with ICARDA (don't wait for CG Core)
        • +
        • Discuss standardized names for CRPs and centers with ICARDA (don’t wait for CG Core)
        • Re-send DC rights implementation and forward to everyone so we can move forward with it (without the URI field for now)
        • Start looking at where I was with the AGROVOC API
        • Have a controlled vocabulary for CGIAR authors’ names and ORCIDs? Perhaps values like: Orth, Alan S. (0000-0002-1735-7458)
        • @@ -845,15 +845,15 @@ sys 0m2.210s
        • Add Sisay and Danny to Uptime Robot and allow them to restart Tomcat on CGSpace ✔
      • -
      • I removed Tsega's SSH access to the web and DSpace servers, and asked Danny to check whether there is anything he needs from Tsega's home directories so we can delete the accounts completely
      • -
      • I removed Tsega's access to Linode dashboard as well
      • +
      • I removed Tsega’s SSH access to the web and DSpace servers, and asked Danny to check whether there is anything he needs from Tsega’s home directories so we can delete the accounts completely
      • +
      • I removed Tsega’s access to Linode dashboard as well
      • I ended up creating a Jira issue for my db.jndi documentation fix: DS-3803
      • The DSpace developers said they wanted each pull request to be associated with a Jira issue

      2018-01-17

      • Abenet asked me to proof and upload 54 records for LIVES
      • -
      • A few records were missing countries (even though they're all from Ethiopia)
      • +
      • A few records were missing countries (even though they’re all from Ethiopia)
      • Also, there are whitespace issues in many columns, and the items are mapped to the LIVES and ILRI articles collections, not Theses
      • In any case, importing them like this:
      @@ -862,7 +862,7 @@ $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFor
    • And fantastic, before I started the import there were 10 PostgreSQL connections, and then CGSpace crashed during the upload
    • When I looked there were 210 PostgreSQL connections!
    • -
    • I don't see any high load in XMLUI or REST/OAI:
    • +
    • I don’t see any high load in XMLUI or REST/OAI:
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "17/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         381 40.77.167.124
    @@ -892,8 +892,8 @@ $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFor
     
    2018-01-17 07:59:25,856 INFO  org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://localhost:8081: The target server failed to respond
     2018-01-17 07:59:25,856 INFO  org.apache.http.impl.client.SystemDefaultHttpClient @ Retrying request to {}->http://localhost:8081
     
      -
    • I have NEVER seen this error before, and there is no error before or after that in DSpace's solr.log
    • -
    • Tomcat's catalina.out does show something interesting, though, right at that time:
    • +
    • I have NEVER seen this error before, and there is no error before or after that in DSpace’s solr.log
    • +
    • Tomcat’s catalina.out does show something interesting, though, right at that time:
    [====================>                              ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:02
     [====================>                              ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:11
    @@ -933,7 +933,7 @@ Exception in thread "http-bio-127.0.0.1-8081-exec-627" java.lang.OutOf
             at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
             at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     
      -
    • You can see the timestamp above, which is some Atmire nightly task I think, but I can't figure out which one
    • +
    • You can see the timestamp above, which is some Atmire nightly task I think, but I can’t figure out which one
    • So I restarted Tomcat and tried the import again, which finished very quickly and without errors!
    $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives2.map &> lives2.log
    @@ -942,7 +942,7 @@ Exception in thread "http-bio-127.0.0.1-8081-exec-627" java.lang.OutOf
     
     

    Tomcat JVM Heap

    $ docker pull docker.bintray.io/jfrog/artifactory-oss:latest
     $ docker volume create --name artifactory5_data
    @@ -961,10 +961,10 @@ $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/
     
    $ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-sword,\!dspace-swordv2 clean package
     
    • UptimeRobot said CGSpace went down for a few minutes
    • -
    • I didn't do anything but it came back up on its own
    • -
    • I don't see anything unusual in the XMLUI or REST/OAI logs
    • +
    • I didn’t do anything but it came back up on its own
    • +
    • I don’t see anything unusual in the XMLUI or REST/OAI logs
    • Now Linode alert says the CPU load is high, sigh
    • -
    • Regarding the heap space error earlier today, it looks like it does happen a few times a week or month (I'm not sure how far these logs go back, as they are not strictly daily):
    • +
    • Regarding the heap space error earlier today, it looks like it does happen a few times a week or month (I’m not sure how far these logs go back, as they are not strictly daily):
    # zgrep -c java.lang.OutOfMemoryError /var/log/tomcat7/catalina.out* | grep -v :0
     /var/log/tomcat7/catalina.out:2
    @@ -994,14 +994,14 @@ $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/
     

    2018-01-18

    • UptimeRobot said CGSpace was down for 1 minute last night
    • -
    • I don't see any errors in the nginx or catalina logs, so I guess UptimeRobot just got impatient and closed the request, which caused nginx to send an HTTP 499
    • +
    • I don’t see any errors in the nginx or catalina logs, so I guess UptimeRobot just got impatient and closed the request, which caused nginx to send an HTTP 499
    • I realize I never did a full re-index after the SQL author and affiliation updates last week, so I should force one now:
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b
     
    • Maria from Bioversity asked if I could remove the abstracts from all of their Limited Access items in the Bioversity Journal Articles collection
    • -
    • It's easy enough to do in OpenRefine, but you have to be careful to only get those items that are uploaded into Bioversity's collection, not the ones that are mapped from others!
    • +
    • It’s easy enough to do in OpenRefine, but you have to be careful to only get those items that are uploaded into Bioversity’s collection, not the ones that are mapped from others!
    • Use this GREL in OpenRefine after isolating all the Limited Access items: value.startsWith("10568/35501")
    • UptimeRobot said CGSpace went down AGAIN and both Sisay and Danny immediately logged in and restarted Tomcat without talking to me or each other!
    @@ -1011,8 +1011,8 @@ Jan 18 07:01:22 linode18 systemd[1]: Stopping LSB: Start Tomcat.... Jan 18 07:01:22 linode18 sudo[10812]: swebshet : TTY=pts/3 ; PWD=/home/swebshet ; USER=root ; COMMAND=/bin/systemctl restart tomcat7 Jan 18 07:01:22 linode18 sudo[10812]: pam_unix(sudo:session): session opened for user root by swebshet(uid=0)
      -
    • I had to cancel the Discovery indexing and I'll have to re-try it another time when the server isn't so busy (it had already taken two hours and wasn't even close to being done)
    • -
    • For now I've increased the Tomcat JVM heap from 5632 to 6144m, to give ~1GB of free memory over the average usage to hopefully account for spikes caused by load or background jobs
    • +
    • I had to cancel the Discovery indexing and I’ll have to re-try it another time when the server isn’t so busy (it had already taken two hours and wasn’t even close to being done)
    • +
    • For now I’ve increased the Tomcat JVM heap from 5632 to 6144m, to give ~1GB of free memory over the average usage to hopefully account for spikes caused by load or background jobs

    2018-01-19

      @@ -1023,8 +1023,8 @@ Jan 18 07:01:22 linode18 sudo[10812]: pam_unix(sudo:session): session opened for $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b
    • Linode alerted again and said that CGSpace was using 301% CPU
    • -
    • Peter emailed to ask why this item doesn't have an Altmetric badge on CGSpace but does have one on the Altmetric dashboard
    • -
    • Looks like our badge code calls the handle endpoint which doesn't exist:
    • +
    • Peter emailed to ask why this item doesn’t have an Altmetric badge on CGSpace but does have one on the Altmetric dashboard
    • +
    • Looks like our badge code calls the handle endpoint which doesn’t exist:
    https://api.altmetric.com/v1/handle/10568/88090
     
      @@ -1060,7 +1060,7 @@ real 7m2.241s user 1m33.198s sys 0m12.317s
      -
    • I tested the abstract cleanups on Bioversity's Journal Articles collection again that I had started a few days ago
    • +
    • I tested the abstract cleanups on Bioversity’s Journal Articles collection again that I had started a few days ago
    • In the end there were 324 items in the collection that were Limited Access, but only 199 had abstracts
    • I want to document the workflow of adding a production PostgreSQL database to a development instance of DSpace in Docker:
    @@ -1075,7 +1075,7 @@ $ docker cp ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace_db: $ docker exec dspace_db psql -U dspace -f /tmp/update-sequences.sql dspace

    2018-01-22

      -
    • Look over Udana's CSV of 25 WLE records from last week
    • +
    • Look over Udana’s CSV of 25 WLE records from last week
    • I sent him some corrections:
      • The file encoding is Windows-1252
      • @@ -1090,7 +1090,7 @@ $ docker exec dspace_db psql -U dspace -f /tmp/update-sequences.sql dspace
      • I wrote a quick Python script to use the DSpace REST API to find all collections under a given community
      • The source code is here: rest-find-collections.py
      • -
      • Peter had said that found a bunch of ILRI collections that were called “untitled”, but I don't see any:
      • +
      • Peter had said that found a bunch of ILRI collections that were called “untitled”, but I don’t see any:
      $ ./rest-find-collections.py 10568/1 | wc -l
       308
      @@ -1099,17 +1099,17 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
       
    • Looking at the Tomcat connector docs I think we really need to increase maxThreads
    • The default is 200, which can easily be taken up by bots considering that Google and Bing each browse with fifty (50) connections each sometimes!
    • Before I increase this I want to see if I can measure and graph this, and then benchmark
    • -
    • I'll probably also increase minSpareThreads to 20 (its default is 10)
    • +
    • I’ll probably also increase minSpareThreads to 20 (its default is 10)
    • I still want to bump up acceptorThreadCount from 1 to 2 as well, as the documentation says this should be increased on multi-core systems
    • I spent quite a bit of time looking at jvisualvm and jconsole today
    • Run system updates on DSpace Test and reboot it
    • I see I can monitor the number of Tomcat threads and some detailed JVM memory stuff if I install munin-plugins-java
    • -
    • I'd still like to get arbitrary mbeans like activeSessions etc, though
    • -
    • I can't remember if I had to configure the jmx settings in /etc/munin/plugin-conf.d/munin-node or not—I think all I did was re-run the munin-node-configure script and of course enable JMX in Tomcat's JVM options
    • +
    • I’d still like to get arbitrary mbeans like activeSessions etc, though
    • +
    • I can’t remember if I had to configure the jmx settings in /etc/munin/plugin-conf.d/munin-node or not—I think all I did was re-run the munin-node-configure script and of course enable JMX in Tomcat’s JVM options

    2018-01-23

      -
    • Thinking about generating a jmeter test plan for DSpace, along the lines of Georgetown's dspace-performance-test
    • +
    • Thinking about generating a jmeter test plan for DSpace, along the lines of Georgetown’s dspace-performance-test
    • I got a list of all the GET requests on CGSpace for January 21st (the last time Linode complained the load was high), excluding admin calls:
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -c -v "/admin"
    @@ -1208,7 +1208,7 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
     
    $ jmeter -g 2018-01-24-linode5451120-baseline.jtl -o 2018-01-24-linode5451120-baseline
     

    2018-01-25

      -
    • Run another round of tests on DSpace Test with jmeter after changing Tomcat's minSpareThreads to 20 (default is 10) and acceptorThreadCount to 2 (default is 1):
    • +
    • Run another round of tests on DSpace Test with jmeter after changing Tomcat’s minSpareThreads to 20 (default is 10) and acceptorThreadCount to 2 (default is 1):
    $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.log
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads2.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads2.log
    @@ -1221,18 +1221,18 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc2.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc2.log
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc3.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc3.log
     
      -
    • I haven't had time to look at the results yet
    • +
    • I haven’t had time to look at the results yet

    2018-01-26

    • Peter followed up about some of the points from the Skype meeting last week
    • -
    • Regarding the ORCID field issue, I see ICARDA's MELSpace is using cg.creator.ID: 0000-0001-9156-7691
    • +
    • Regarding the ORCID field issue, I see ICARDA’s MELSpace is using cg.creator.ID: 0000-0001-9156-7691
    • I had floated the idea of using a controlled vocabulary with values formatted something like: Orth, Alan S. (0000-0002-1735-7458)
    • Update PostgreSQL JDBC driver version from 42.1.4 to 42.2.1 on DSpace Test, see: https://jdbc.postgresql.org/
    • Reboot DSpace Test to get new Linode kernel (Linux 4.14.14-x86_64-linode94)
    • I am testing my old work on the dc.rights field, I had added a branch for it a few months ago
    • I added a list of Creative Commons and other licenses in input-forms.xml
    • -
    • The problem is that Peter wanted to use two questions, one for CG centers and one for other, but using the same metadata value, which isn't possible (?)
    • +
    • The problem is that Peter wanted to use two questions, one for CG centers and one for other, but using the same metadata value, which isn’t possible (?)
    • So I used some creativity and made several fields display values, but not store any, ie:
    <pair>
    @@ -1240,7 +1240,7 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
       <stored-value></stored-value>
     </pair>
     
      -
    • I was worried that if a user selected this field for some reason that DSpace would store an empty value, but it simply doesn't register that as a valid option:
    • +
    • I was worried that if a user selected this field for some reason that DSpace would store an empty value, but it simply doesn’t register that as a valid option:

    Rights

      @@ -1286,9 +1286,9 @@ Was expecting one of: Maximum: 2771268 Average: 210483
      -
    • I guess responses that don't fit in RAM get saved to disk (a default of 1024M), so this is definitely not the issue here, and that warning is totally unrelated
    • -
    • My best guess is that the Solr search error is related somehow but I can't figure it out
    • -
    • We definitely have enough database connections, as I haven't seen a pool error in weeks:
    • +
    • I guess responses that don’t fit in RAM get saved to disk (a default of 1024M), so this is definitely not the issue here, and that warning is totally unrelated
    • +
    • My best guess is that the Solr search error is related somehow but I can’t figure it out
    • +
    • We definitely have enough database connections, as I haven’t seen a pool error in weeks:
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-2*
     dspace.log.2018-01-20:0
    @@ -1305,7 +1305,7 @@ dspace.log.2018-01-29:0
     
  • Adam Hunt from WLE complained that pages take “1-2 minutes” to load each, from France and Sri Lanka
  • I asked him which particular pages, as right now pages load in 2 or 3 seconds for me
  • UptimeRobot said CGSpace went down again, and I looked at PostgreSQL and saw 211 active database connections
  • -
  • If it's not memory and it's not database, it's gotta be Tomcat threads, seeing as the default maxThreads is 200 anyways, it actually makes sense
  • +
  • If it’s not memory and it’s not database, it’s gotta be Tomcat threads, seeing as the default maxThreads is 200 anyways, it actually makes sense
  • I decided to change the Tomcat thread settings on CGSpace:
    • maxThreads from 200 (default) to 400
    • @@ -1333,8 +1333,8 @@ busy.value 0 idle.value 20 max.value 400
    • -
    • Apparently you can't monitor more than one connector, so I guess the most important to monitor would be the one that nginx is sending stuff to
    • -
    • So for now I think I'll just monitor these and skip trying to configure the jmx plugins
    • +
    • Apparently you can’t monitor more than one connector, so I guess the most important to monitor would be the one that nginx is sending stuff to
    • +
    • So for now I think I’ll just monitor these and skip trying to configure the jmx plugins
    • Although following the logic of /usr/share/munin/plugins/jmx_tomcat_dbpools could be useful for getting the active Tomcat sessions
    • From debugging the jmx_tomcat_db_pools script from the munin-plugins-java package, I see that this is how you call arbitrary mbeans:
    @@ -1343,7 +1343,7 @@ Catalina:type=DataSource,class=javax.sql.DataSource,name="jdbc/dspace"
    [===================>                               ]38% time remaining: 5 hour(s) 21 minute(s) 47 seconds. timestamp: 2018-01-29 06:25:16
     
    @@ -1411,18 +1411,18 @@ javax.ws.rs.WebApplicationException

    CPU usage week

    # port=5400 ip="127.0.0.1" /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=Manager,context=/,host=localhost activeSessions
     Catalina:type=Manager,context=/,host=localhost  activeSessions  8
     

    MBeans in JVisualVM

    diff --git a/docs/2018-02/index.html b/docs/2018-02/index.html index 4adbe47d5..f39111521 100644 --- a/docs/2018-02/index.html +++ b/docs/2018-02/index.html @@ -9,9 +9,9 @@ @@ -23,11 +23,11 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plug - + @@ -57,7 +57,7 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plug - + @@ -104,7 +104,7 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plug

    February, 2018

    @@ -112,9 +112,9 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plug

    2018-02-01

    • Peter gave feedback on the dc.rights proof of concept that I had sent him last week
    • -
    • We don't need to distinguish between internal and external works, so that makes it just a simple list
    • +
    • We don’t need to distinguish between internal and external works, so that makes it just a simple list
    • Yesterday I figured out how to monitor DSpace sessions using JMX
    • -
    • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
    • +
    • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01

    DSpace Sessions

      @@ -163,7 +163,7 @@ sys 0m1.905s
      dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
       UPDATE 20
       
        -
      • I tried the TRIM(TRAILING from text_value) function and it said it changed 20 items but the spaces didn't go away
      • +
      • I tried the TRIM(TRAILING from text_value) function and it said it changed 20 items but the spaces didn’t go away
      • This is on a fresh import of the CGSpace database, but when I tried to apply it on CGSpace there were no changes detected. Weird.
      • Anyways, Peter wants a new list of authors to clean up, so I exported another CSV:
      @@ -200,10 +200,10 @@ Tue Feb 6 09:30:32 UTC 2018 295 197.210.168.174 752 144.76.64.79
        -
      • I did notice in /var/log/tomcat7/catalina.out that Atmire's update thing was running though
      • +
      • I did notice in /var/log/tomcat7/catalina.out that Atmire’s update thing was running though
      • So I restarted Tomcat and now everything is fine
      • Next time I see that many database connections I need to save the output so I can analyze it later
      • -
      • I'm going to re-schedule the taskUpdateSolrStatsMetadata task as Bram detailed in ticket 566 to see if it makes CGSpace stop crashing every morning
      • +
      • I’m going to re-schedule the taskUpdateSolrStatsMetadata task as Bram detailed in ticket 566 to see if it makes CGSpace stop crashing every morning
      • If I move the task from 3AM to 3PM, deally CGSpace will stop crashing in the morning, or start crashing ~12 hours later
      • Eventually Atmire has said that there will be a fix for this high load caused by their script, but it will come with the 5.8 compatability they are already working on
      • I re-deployed CGSpace with the new task time of 3PM, ran all system updates, and restarted the server
      • @@ -211,16 +211,16 @@ Tue Feb 6 09:30:32 UTC 2018
      • I implemented some changes to the pooling in the Ansible infrastructure scripts so that each DSpace web application can use its own pool (web, api, and solr)
      • Each pool uses its own name and hopefully this should help me figure out which one is using too many connections next time CGSpace goes down
      • Also, this will mean that when a search bot comes along and hammers the XMLUI, the REST and OAI applications will be fine
      • -
      • I'm not actually sure if the Solr web application uses the database though, so I'll have to check later and remove it if necessary
      • +
      • I’m not actually sure if the Solr web application uses the database though, so I’ll have to check later and remove it if necessary
      • I deployed the changes on DSpace Test only for now, so I will monitor and make them on CGSpace later this week

      2018-02-07

      • Abenet wrote to ask a question about the ORCiD lookup not working for one CIAT user on CGSpace
      • -
      • I tried on DSpace Test and indeed the lookup just doesn't work!
      • +
      • I tried on DSpace Test and indeed the lookup just doesn’t work!
      • The ORCiD code in DSpace appears to be using http://pub.orcid.org/, but when I go there in the browser it redirects me to https://pub.orcid.org/v2.0/
      • According to the announcement the v1 API was moved from http://pub.orcid.org/ to https://pub.orcid.org/v1.2 until March 1st when it will be discontinued for good
      • -
      • But the old URL is hard coded in DSpace and it doesn't work anyways, because it currently redirects you to https://pub.orcid.org/v2.0/v1.2
      • +
      • But the old URL is hard coded in DSpace and it doesn’t work anyways, because it currently redirects you to https://pub.orcid.org/v2.0/v1.2
      • So I guess we have to disable that shit once and for all and switch to a controlled vocabulary
      • CGSpace crashed again, this time around Wed Feb 7 11:20:28 UTC 2018
      • I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own:
      • @@ -249,7 +249,7 @@ $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity* 1828
        • CGSpace went down again a few hours later, and now the connections to the dspaceWeb pool are maxed at 250 (the new limit I imposed with the new separate pool scheme)
        • -
        • What's interesting is that the DSpace log says the connections are all busy:
        • +
        • What’s interesting is that the DSpace log says the connections are all busy:
        org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
         
          @@ -263,14 +263,14 @@ $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle 187
          • What the fuck, does DSpace think all connections are busy?
          • -
          • I suspect these are issues with abandoned connections or maybe a leak, so I'm going to try adding the removeAbandoned='true' parameter which is apparently off by default
          • -
          • I will try testOnReturn='true' too, just to add more validation, because I'm fucking grasping at straws
          • +
          • I suspect these are issues with abandoned connections or maybe a leak, so I’m going to try adding the removeAbandoned='true' parameter which is apparently off by default
          • +
          • I will try testOnReturn='true' too, just to add more validation, because I’m fucking grasping at straws
          • Also, WTF, there was a heap space error randomly in catalina.out:
          Wed Feb 07 15:01:54 UTC 2018 | Query:containerItem:91917 AND type:2
           Exception in thread "http-bio-127.0.0.1-8081-exec-58" java.lang.OutOfMemoryError: Java heap space
           
            -
          • I'm trying to find a way to determine what was using all those Tomcat sessions, but parsing the DSpace log is hard because some IPs are IPv6, which contain colons!
          • +
          • I’m trying to find a way to determine what was using all those Tomcat sessions, but parsing the DSpace log is hard because some IPs are IPv6, which contain colons!
          • Looking at the first crash this morning around 11, I see these IPv4 addresses making requests around 10 and 11AM:
          $ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | sort -n | uniq -c | sort -n | tail -n 20
          @@ -319,20 +319,20 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
           992
           
           
            -
          • Let's investigate who these IPs belong to: +
          • Let’s investigate who these IPs belong to:
            • 104.196.152.243 is CIAT, which is already marked as a bot via nginx!
            • -
            • 207.46.13.71 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
            • -
            • 40.77.167.62 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
            • -
            • 207.46.13.135 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
            • -
            • 68.180.228.157 is Yahoo, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
            • -
            • 40.77.167.36 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
            • -
            • 207.46.13.54 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
            • -
            • 46.229.168.x is Semrush, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
            • +
            • 207.46.13.71 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
            • +
            • 40.77.167.62 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
            • +
            • 207.46.13.135 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
            • +
            • 68.180.228.157 is Yahoo, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
            • +
            • 40.77.167.36 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
            • +
            • 207.46.13.54 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
            • +
            • 46.229.168.x is Semrush, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
          • -
          • Nice, so these are all known bots that are already crammed into one session by Tomcat's Crawler Session Manager Valve.
          • -
          • What in the actual fuck, why is our load doing this? It's gotta be something fucked up with the database pool being “busy” but everything is fucking idle
          • +
          • Nice, so these are all known bots that are already crammed into one session by Tomcat’s Crawler Session Manager Valve.
          • +
          • What in the actual fuck, why is our load doing this? It’s gotta be something fucked up with the database pool being “busy” but everything is fucking idle
          • One that I should probably add in nginx is 54.83.138.123, which is apparently the following user agent:
          BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
          @@ -343,7 +343,7 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
           /var/log/nginx/access.log:1925
           /var/log/nginx/access.log.1:2029
           
            -
          • And they have 30 IPs, so fuck that shit I'm going to add them to the Tomcat Crawler Session Manager Valve nowwww
          • +
          • And they have 30 IPs, so fuck that shit I’m going to add them to the Tomcat Crawler Session Manager Valve nowwww
          • Lots of discussions on the dspace-tech mailing list over the last few years about leaky transactions being a known problem with DSpace
          • Helix84 recommends restarting PostgreSQL instead of Tomcat because it restarts quicker
          • This is how the connections looked when it crashed this afternoon:
          • @@ -359,16 +359,16 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | 5 dspaceWeb
            • So is this just some fucked up XMLUI database leaking?
            • -
            • I notice there is an issue (that I've probably noticed before) on the Jira tracker about this that was fixed in DSpace 5.7: https://jira.duraspace.org/browse/DS-3551
            • -
            • I seriously doubt this leaking shit is fixed for sure, but I'm gonna cherry-pick all those commits and try them on DSpace Test and probably even CGSpace because I'm fed up with this shit
            • -
            • I cherry-picked all the commits for DS-3551 but it won't build on our current DSpace 5.5!
            • +
            • I notice there is an issue (that I’ve probably noticed before) on the Jira tracker about this that was fixed in DSpace 5.7: https://jira.duraspace.org/browse/DS-3551
            • +
            • I seriously doubt this leaking shit is fixed for sure, but I’m gonna cherry-pick all those commits and try them on DSpace Test and probably even CGSpace because I’m fed up with this shit
            • +
            • I cherry-picked all the commits for DS-3551 but it won’t build on our current DSpace 5.5!
            • I sent a message to the dspace-tech mailing list asking why DSpace thinks these connections are busy when PostgreSQL says they are idle

            2018-02-10

            • I tried to disable ORCID lookups but keep the existing authorities
            • This item has an ORCID for Ralf Kiese: http://localhost:8080/handle/10568/89897
            • -
            • Switch authority.controlled off and change authorLookup to lookup, and the ORCID badge doesn't show up on the item
            • +
            • Switch authority.controlled off and change authorLookup to lookup, and the ORCID badge doesn’t show up on the item
            • Leave all settings but change choices.presentation to lookup and ORCID badge is there and item submission uses LC Name Authority and it breaks with this error:
            Field dc_contributor_author has choice presentation of type "select", it may NOT be authority-controlled.
            @@ -377,7 +377,7 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
             
          xmlui.mirage2.forms.instancedCompositeFields.noSuggestionError
           
            -
          • So I don't think we can disable the ORCID lookup function and keep the ORCID badges
          • +
          • So I don’t think we can disable the ORCID lookup function and keep the ORCID badges

          2018-02-11

            @@ -409,7 +409,7 @@ authors-2018-02-05.csv: line 100, char 18, byte 4179: After a first byte between
            $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
             $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
             
            @@ -440,7 +440,7 @@ dspace=# commit;
          • I wrote a Python script (resolve-orcids-from-solr.py) using SolrClient to parse the Solr authority cache for ORCID IDs
          • We currently have 1562 authority records with ORCID IDs, and 624 unique IDs
          • We can use this to build a controlled vocabulary of ORCID IDs for new item submissions
          • -
          • I don't know how to add ORCID IDs to existing items yet… some more querying of PostgreSQL for authority values perhaps?
          • +
          • I don’t know how to add ORCID IDs to existing items yet… some more querying of PostgreSQL for authority values perhaps?
          • I added the script to the ILRI DSpace wiki on GitHub

          2018-02-12

          @@ -448,21 +448,21 @@ dspace=# commit;
        • Follow up with Atmire on the DSpace 5.8 Compatibility ticket to ask again if they want me to send them a DSpace 5.8 branch to work on
        • Abenet asked if there was a way to get the number of submissions she and Bizuwork did
        • I said that the Atmire Workflow Statistics module was supposed to be able to do that
        • -
        • We had tried it in June, 2017 and found that it didn't work
        • -
        • Atmire sent us some fixes but they didn't work either
        • +
        • We had tried it in June, 2017 and found that it didn’t work
        • +
        • Atmire sent us some fixes but they didn’t work either
        • I just tried the branch with the fixes again and it indeed does not work:

        Atmire Workflow Statistics No Data Available

          -
        • I see that in April, 2017 I just used a SQL query to get a user's submissions by checking the dc.description.provenance field
        • +
        • I see that in April, 2017 I just used a SQL query to get a user’s submissions by checking the dc.description.provenance field
        • So for Abenet, I can check her submissions in December, 2017 with:
        dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*yabowork.*2017-12.*';
         
        • I emailed Peter to ask whether we can move DSpace Test to a new Linode server and attach 300 GB of disk space to it
        • -
        • This would be using Linode's new block storage volumes
        • +
        • This would be using Linode’s new block storage volumes
        • I think our current $40/month Linode has enough CPU and memory capacity, but we need more disk space
        • -
        • I think I'd probably just attach the block storage volume and mount it on /home/dspace
        • +
        • I think I’d probably just attach the block storage volume and mount it on /home/dspace
        • Ask Peter about dc.rights on DSpace Test again, if he likes it then we should move it to CGSpace soon

        2018-02-13

        @@ -492,16 +492,16 @@ dspace.log.2018-02-11:3 dspace.log.2018-02-12:0 dspace.log.2018-02-13:4
          -
        • I apparently added that on 2018-02-07 so it could be, as I don't see any of those socket closed errors in 2018-01's logs!
        • +
        • I apparently added that on 2018-02-07 so it could be, as I don’t see any of those socket closed errors in 2018-01’s logs!
        • I will increase the removeAbandonedTimeout from its default of 60 to 90 and enable logAbandoned
        • -
        • Peter hit this issue one more time, and this is apparently what Tomcat's catalina.out log says when an abandoned connection is removed:
        • +
        • Peter hit this issue one more time, and this is apparently what Tomcat’s catalina.out log says when an abandoned connection is removed:
        Feb 13, 2018 2:05:42 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
         WARNING: Connection has been abandoned PooledConnection[org.postgresql.jdbc.PgConnection@22e107be]:java.lang.Exception
         

        2018-02-14

        • Skype with Peter and the Addis team to discuss what we need to do for the ORCIDs in the immediate future
        • -
        • We said we'd start with a controlled vocabulary for cg.creator.id on the DSpace Test submission form, where we store the author name and the ORCID in some format like: Alan S. Orth (0000-0002-1735-7458)
        • +
        • We said we’d start with a controlled vocabulary for cg.creator.id on the DSpace Test submission form, where we store the author name and the ORCID in some format like: Alan S. Orth (0000-0002-1735-7458)
        • Eventually we need to find a way to print the author names with links to their ORCID profiles
        • Abenet will send an email to the partners to give us ORCID IDs for their authors and to stress that they update their name format on ORCID.org if they want it in a special way
        • I sent the Codeobia guys a question to ask how they prefer that we store the IDs, ie one of: @@ -539,14 +539,14 @@ $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_c
          $ cat CGcenter_ORCID_ID_combined.csv ciat-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
           1227
           
            -
          • There are some formatting issues with names in Peter's list, so I should remember to re-generate the list of names from ORCID's API once we're done
          • +
          • There are some formatting issues with names in Peter’s list, so I should remember to re-generate the list of names from ORCID’s API once we’re done
          • The dspace cleanup -v currently fails on CGSpace with the following:
           - Deleting bitstream record from database (ID: 149473)
           Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
             Detail: Key (bitstream_id)=(149473) is still referenced from table "bundle".
           
            -
          • The solution is to update the bitstream table, as I've discovered several other times in 2016 and 2017:
          • +
          • The solution is to update the bitstream table, as I’ve discovered several other times in 2016 and 2017:
          $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (149473);'
           UPDATE 1
          @@ -561,7 +561,7 @@ UPDATE 1
           
        • See the corresponding page on Altmetric: https://www.altmetric.com/details/handle/10568/78450
        -
      • And this item doesn't even exist on CGSpace!
      • +
      • And this item doesn’t even exist on CGSpace!
      • Start working on XMLUI item display code for ORCIDs
      • Send emails to Macaroni Bros and Usman at CIFOR about ORCID metadata
      • CGSpace crashed while I was driving to Tel Aviv, and was down for four hours!
      • @@ -573,7 +573,7 @@ UPDATE 1 1 dspaceWeb 3 dspaceApi
          -
        • I see shitloads of memory errors in Tomcat's logs:
        • +
        • I see shitloads of memory errors in Tomcat’s logs:
        # grep -c "Java heap space" /var/log/tomcat7/catalina.out
         56
        @@ -607,13 +607,13 @@ UPDATE 1
         UPDATE 2
         

        2018-02-18

          -
        • ICARDA's Mohamed Salem pointed out that it would be easiest to format the cg.creator.id field like “Alan Orth: 0000-0002-1735-7458” because no name will have a “:” so it's easier to split on
        • +
        • ICARDA’s Mohamed Salem pointed out that it would be easiest to format the cg.creator.id field like “Alan Orth: 0000-0002-1735-7458” because no name will have a “:” so it’s easier to split on
        • I finally figured out a few ways to extract ORCID iDs from metadata using XSLT and display them in the XMLUI:

        Displaying ORCID iDs in XMLUI

          -
        • The one on the bottom left uses a similar format to our author display, and the one in the middle uses the format recommended by ORCID's branding guidelines
        • -
        • Also, I realized that the Academicons font icon set we're using includes an ORCID badge so we don't need to use the PNG image anymore
        • +
        • The one on the bottom left uses a similar format to our author display, and the one in the middle uses the format recommended by ORCID’s branding guidelines
        • +
        • Also, I realized that the Academicons font icon set we’re using includes an ORCID badge so we don’t need to use the PNG image anymore
        • Run system updates on DSpace Test (linode02) and reboot the server
        • Looking back at the system errors on 2018-02-15, I wonder what the fuck caused this:
        @@ -629,13 +629,13 @@ UPDATE 2 167432 dspace.log.2018-02-18
        • From an average of a few hundred thousand to over four million lines in DSpace log?
        • -
        • Using grep's -B1 I can see the line before the heap space error, which has the time, ie:
        • +
        • Using grep’s -B1 I can see the line before the heap space error, which has the time, ie:
        2018-02-15 16:02:12,748 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
         org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
         
        • So these errors happened at hours 16, 18, 19, and 20
        • -
        • Let's see what was going on in nginx then:
        • +
        • Let’s see what was going on in nginx then:
        # zcat --force /var/log/nginx/*.log.{3,4}.gz | wc -l
         168571
        @@ -693,7 +693,7 @@ Traceback (most recent call last):
             family_name = data['name']['family-name']['value']
         TypeError: 'NoneType' object is not subscriptable
         
          -
        • According to ORCID that identifier's family-name is null so that sucks
        • +
        • According to ORCID that identifier’s family-name is null so that sucks
        • I fixed the script so that it checks if the family name is null
        • Now another:
        @@ -707,19 +707,19 @@ Traceback (most recent call last): if data['name']['given-names']: TypeError: 'NoneType' object is not subscriptable
          -
        • According to ORCID that identifier's entire name block is null!
        • +
        • According to ORCID that identifier’s entire name block is null!

        2018-02-20

        • Send Abenet an email about getting a purchase requisition for a new DSpace Test server on Linode
        • -
        • Discuss some of the issues with null values and poor-quality names in some ORCID identifiers with Abenet and I think we'll now only use ORCID iDs that have been sent to use from partners, not those extracted via keyword searches on orcid.org
        • -
        • This should be the version we use (the existing controlled vocabulary generated from CGSpace's Solr authority core plus the IDs sent to us so far by partners):
        • +
        • Discuss some of the issues with null values and poor-quality names in some ORCID identifiers with Abenet and I think we’ll now only use ORCID iDs that have been sent to use from partners, not those extracted via keyword searches on orcid.org
        • +
        • This should be the version we use (the existing controlled vocabulary generated from CGSpace’s Solr authority core plus the IDs sent to us so far by partners):
        $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > 2018-02-20-combined.txt
         
        • I updated the resolve-orcids.py to use the “credit-name” if it exists in a profile, falling back to “given-names” + “family-name”
        • Also, I added color coded output to the debug messages and added a “quiet” mode that supresses the normal behavior of printing results to the screen
        • -
        • I'm using this as the test input for resolve-orcids.py:
        • +
        • I’m using this as the test input for resolve-orcids.py:
        $ cat orcid-test-values.txt 
         # valid identifier with 'given-names' and 'family-name'
        @@ -753,13 +753,13 @@ TypeError: 'NoneType' object is not subscriptable
         
      • The Altmetric JavaScript builds the following API call: https://api.altmetric.com/v1/handle/10568/83320?callback=_altmetric.embed_callback&domain=cgspace.cgiar.org&key=3c130976ca2b8f2e88f8377633751ba1&cache_until=13-20
      • The response body is not JSON
      • To contrast, the following bare API call without query parameters is valid JSON: https://api.altmetric.com/v1/handle/10568/83320
      • -
      • I told them that it's their JavaScript that is fucked up
      • +
      • I told them that it’s their JavaScript that is fucked up
      • Remove CPWF project number and Humidtropics subject from submission form (#3)
      • I accidentally merged it into my own repository, oops

      2018-02-22

        -
      • CGSpace was apparently down today around 13:00 server time and I didn't get any emails on my phone, but saw them later on the computer
      • +
      • CGSpace was apparently down today around 13:00 server time and I didn’t get any emails on my phone, but saw them later on the computer
      • It looks like Sisay restarted Tomcat because I was offline
      • There was absolutely nothing interesting going on at 13:00 on the server, WTF?
      @@ -789,7 +789,7 @@ TypeError: 'NoneType' object is not subscriptable 5208 5.9.6.51 8686 45.5.184.196
        -
      • So I don't see any definite cause for this crash, I see a shit ton of abandoned PostgreSQL connections today around 1PM!
      • +
      • So I don’t see any definite cause for this crash, I see a shit ton of abandoned PostgreSQL connections today around 1PM!
      # grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
       729
      @@ -821,14 +821,14 @@ TypeError: 'NoneType' object is not subscriptable
       
      $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ccafs | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
       1004
       
        -
      • I will add them to DSpace Test but Abenet says she's still waiting to set us ILRI's list
      • +
      • I will add them to DSpace Test but Abenet says she’s still waiting to set us ILRI’s list
      • I will tell her that we should proceed on sharing our work on DSpace Test with the partners this week anyways and we can update the list later
      • While regenerating the names for these ORCID identifiers I saw one that has a weird value for its names:
      Looking up the names associated with ORCID iD: 0000-0002-2614-426X
       Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
       
        -
      • I don't know if the user accidentally entered this as their name or if that's how ORCID behaves when the name is private?
      • +
      • I don’t know if the user accidentally entered this as their name or if that’s how ORCID behaves when the name is private?
      • I will remove that one from our list for now
      • Remove Dryland Systems subject from submission form because that CRP closed two years ago (#355)
      • Run all system updates on DSpace Test
      • @@ -842,7 +842,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X 62464 (1 row)
        -
      • I know from earlier this month that there are only 624 unique ORCID identifiers in the Solr authority core, so it's way easier to just fetch the unique ORCID iDs from Solr and then go back to PostgreSQL and do the metadata mapping that way
      • +
      • I know from earlier this month that there are only 624 unique ORCID identifiers in the Solr authority core, so it’s way easier to just fetch the unique ORCID iDs from Solr and then go back to PostgreSQL and do the metadata mapping that way
      • The query in Solr would simply be orcid_id:*
      • Assuming I know that authority record with id:d7ef744b-bbd4-4171-b449-00e37e1b776f, then I could query PostgreSQL for all metadata records using that authority:
      @@ -877,14 +877,14 @@ Nor Azwadi: 0000-0001-9634-1958
      • Peter is having problems with “Socket closed” on his submissions page again
      • He says his personal account loads much faster than his CGIAR account, which could be because the CGIAR account has potentially thousands of submissions over the last few years
      • -
      • I don't know why it would take so long, but this logic kinda makes sense
      • +
      • I don’t know why it would take so long, but this logic kinda makes sense
      • I think I should increase the removeAbandonedTimeout from 90 to something like 180 and continue observing
      • I also reduced the timeout for the API pool back to 60 because those interfaces are only used by bots

      2018-02-27

      • Peter is still having problems with “Socket closed” on his submissions page
      • -
      • I have disabled removeAbandoned for now because that's the only thing I changed in the last few weeks since he started having issues
      • +
      • I have disabled removeAbandoned for now because that’s the only thing I changed in the last few weeks since he started having issues
      • I think the real line of logic to follow here is why the submissions page is so slow for him (presumably because of loading all his submissions?)
      • I need to see which SQL queries are run during that time
      • And only a few hours after I disabled the removeAbandoned thing CGSpace went down and lo and behold, there were 264 connections, most of which were idle:
      • @@ -895,7 +895,7 @@ Nor Azwadi: 0000-0001-9634-1958 $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction" 218
          -
        • So I'm re-enabling the removeAbandoned setting
        • +
        • So I’m re-enabling the removeAbandoned setting
        • I grabbed a snapshot of the active connections in pg_stat_activity for all queries running longer than 2 minutes:
        dspace=# \copy (SELECT now() - query_start as "runtime", application_name, usename, datname, waiting, state, query
        @@ -926,8 +926,8 @@ COPY 263
         

      2018-02-28

        -
      • CGSpace crashed today, the first HTTP 499 in nginx's access.log was around 09:12
      • -
      • There's nothing interesting going on in nginx's logs around that time:
      • +
      • CGSpace crashed today, the first HTTP 499 in nginx’s access.log was around 09:12
      • +
      • There’s nothing interesting going on in nginx’s logs around that time:
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Feb/2018:09:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
            65 197.210.168.174
      @@ -995,8 +995,8 @@ dspace.log.2018-02-28:1
       
    • According to the log 01D9932D6E85E90C2BA9FF5563A76D03 is an ILRI editor, doing lots of updating and editing of items
    • 8100883DAD00666A655AE8EC571C95AE is some Indian IP address
    • 1E9834E918A550C5CD480076BC1B73A4 looks to be a session shared by the bots
    • -
    • So maybe it was due to the editor's uploading of files, perhaps something that was too big or?
    • -
    • I think I'll increase the JVM heap size on CGSpace from 6144m to 8192m because I'm sick of this random crashing shit and the server has memory and I'd rather eliminate this so I can get back to solving PostgreSQL issues and doing other real work
    • +
    • So maybe it was due to the editor’s uploading of files, perhaps something that was too big or?
    • +
    • I think I’ll increase the JVM heap size on CGSpace from 6144m to 8192m because I’m sick of this random crashing shit and the server has memory and I’d rather eliminate this so I can get back to solving PostgreSQL issues and doing other real work
    • Run the few corrections from earlier this month for sponsor on CGSpace:
    cgspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
    diff --git a/docs/2018-03/index.html b/docs/2018-03/index.html
    index 7882881d0..8ee63ebc8 100644
    --- a/docs/2018-03/index.html
    +++ b/docs/2018-03/index.html
    @@ -21,7 +21,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
     
     Export a CSV of the IITA community metadata for Martin Mueller
     "/>
    -
    +
     
     
         
    @@ -51,7 +51,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
         
         
         
    -    
    +    
         
     
         
    @@ -98,7 +98,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
       

    March, 2018

    @@ -143,7 +143,7 @@ UPDATE 1
    • Add CIAT author Mauricio Efren Sotelo Cabrera to controlled vocabulary for ORCID identifiers (#360)
    • Help Sisay proof 200 IITA records on DSpace Test
    • -
    • Finally import Udana's 24 items to IWMI Journal Articles on CGSpace
    • +
    • Finally import Udana’s 24 items to IWMI Journal Articles on CGSpace
    • Skype with James Stapleton to discuss CGSpace, ILRI website, CKM staff issues, etc

    2018-03-08

    @@ -189,14 +189,14 @@ dspacetest=# select distinct text_lang from metadatavalue where resource_type_id es (9 rows)
      -
    • On second inspection it looks like dc.description.provenance fields use the text_lang “en” so that's probably why there are over 100,000 fields changed…
    • +
    • On second inspection it looks like dc.description.provenance fields use the text_lang “en” so that’s probably why there are over 100,000 fields changed…
    • If I skip that, there are about 2,000, which seems more reasonably like the amount of fields users have edited manually, or fucked up during CSV import, etc:
    dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
     UPDATE 2309
     
    • I will apply this on CGSpace right now
    • -
    • In other news, I was playing with adding ORCID identifiers to a dump of CIAT's community via CSV in OpenRefine
    • +
    • In other news, I was playing with adding ORCID identifiers to a dump of CIAT’s community via CSV in OpenRefine
    • Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the cg.creator.id field
    • For example, a GREL expression in a custom text facet to get all items with dc.contributor.author[en_US] of a certain author with several name variations (this is how you use a logical OR in OpenRefine):
    @@ -206,7 +206,7 @@ UPDATE 2309
    if(isBlank(value), "Hernan Ceballos: 0000-0002-8744-7918", value + "||Hernan Ceballos: 0000-0002-8744-7918")
     
      -
    • One thing that bothers me is that this won't honor author order
    • +
    • One thing that bothers me is that this won’t honor author order
    • It might be better to do batches of these in PostgreSQL with a script that takes the place column of an author into account when setting the cg.creator.id
    • I wrote a Python script to read the author names and ORCID identifiers from CSV and create matching cg.creator.id fields: add-orcid-identifiers-csv.py
    • The CSV should have two columns: author name and ORCID identifier:
    • @@ -215,13 +215,13 @@ UPDATE 2309 "Orth, Alan",Alan S. Orth: 0000-0002-1735-7458 "Orth, A.",Alan S. Orth: 0000-0002-1735-7458
        -
      • I didn't integrate the ORCID API lookup for author names in this script for now because I was only interested in “tagging” old items for a few given authors
      • -
      • I added ORCID identifers for 187 items by CIAT's Hernan Ceballos, because that is what Elizabeth was trying to do manually!
      • +
      • I didn’t integrate the ORCID API lookup for author names in this script for now because I was only interested in “tagging” old items for a few given authors
      • +
      • I added ORCID identifers for 187 items by CIAT’s Hernan Ceballos, because that is what Elizabeth was trying to do manually!
      • Also, I decided to add ORCID identifiers for all records from Peter, Abenet, and Sisay as well

      2018-03-09

        -
      • Give James Stapleton input on Sisay's KRAs
      • +
      • Give James Stapleton input on Sisay’s KRAs
      • Create a pull request to disable ORCID authority integration for dc.contributor.author in the submission forms and XMLUI display (#363)

      2018-03-11

      @@ -240,12 +240,12 @@ g/jspui/listings-and-reports org.apache.jasper.JasperException: java.lang.NullPointerException
        -
      • Looks like I needed to remove the Humidtropics subject from Listings and Reports because it was looking for the terms and couldn't find them
      • -
      • I made a quick fix and it's working now (#364)
      • +
      • Looks like I needed to remove the Humidtropics subject from Listings and Reports because it was looking for the terms and couldn’t find them
      • +
      • I made a quick fix and it’s working now (#364)

      2018-03-12

        -
      • Increase upload size on CGSpace's nginx config to 85MB so Sisay can upload some data
      • +
      • Increase upload size on CGSpace’s nginx config to 85MB so Sisay can upload some data

      2018-03-13

        @@ -269,7 +269,7 @@ org.apache.jasper.JasperException: java.lang.NullPointerException

        2018-03-15

        • Help Abenet troubleshoot the Listings and Reports issue again
        • -
        • It looks like it's an issue with the layouts, if you create a new layout that only has one type (dc.identifier.citation):
        • +
        • It looks like it’s an issue with the layouts, if you create a new layout that only has one type (dc.identifier.citation):

        Listing and Reports layout

          @@ -286,7 +286,7 @@ org.apache.jasper.JasperException: java.lang.NullPointerException
          • ICT made the DNS updates for dspacetest.cgiar.org late last night
          • I have removed the old server (linode02 aka linode578611) in favor of linode19 aka linode6624164
          • -
          • Looking at the CRP subjects on CGSpace I see there is one blank one so I'll just fix it:
          • +
          • Looking at the CRP subjects on CGSpace I see there is one blank one so I’ll just fix it:
          dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value='';
           
            @@ -305,7 +305,7 @@ COPY 21
            • Tezira has been having problems accessing CGSpace from the ILRI Nairobi campus since last week
            • She is getting an HTTPS error apparently
            • -
            • It's working outside, and Ethiopian users seem to be having no issues so I've asked ICT to have a look
            • +
            • It’s working outside, and Ethiopian users seem to be having no issues so I’ve asked ICT to have a look
            • CGSpace crashed this morning for about seven minutes and Dani restarted Tomcat
            • Around that time there were an increase of SQL errors:
            @@ -313,7 +313,7 @@ COPY 21 ... 2018-03-19 09:10:54,862 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL query singleTable Error -
              -
            • But these errors, I don't even know what they mean, because a handful of them happen every day:
            • +
            • But these errors, I don’t even know what they mean, because a handful of them happen every day:
            $ grep -c 'ERROR org.dspace.storage.rdbms.DatabaseManager' dspace.log.2018-03-1*
             dspace.log.2018-03-10:13
            @@ -327,7 +327,7 @@ dspace.log.2018-03-17:13
             dspace.log.2018-03-18:15
             dspace.log.2018-03-19:90
             
              -
            • There wasn't even a lot of traffic at the time (8–9 AM):
            • +
            • There wasn’t even a lot of traffic at the time (8–9 AM):
            # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Mar/2018:0[89]:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                  92 40.77.167.197
            @@ -341,7 +341,7 @@ dspace.log.2018-03-19:90
                 207 104.196.152.243
                 294 54.198.169.202
             
              -
            • Well there is a hint in Tomcat's catalina.out:
            • +
            • Well there is a hint in Tomcat’s catalina.out:
            Mon Mar 19 09:05:28 UTC 2018 | Query:id: 92032 AND type:2
             Exception in thread "http-bio-127.0.0.1-8081-exec-280" java.lang.OutOfMemoryError: Java heap space
            @@ -354,7 +354,7 @@ Exception in thread "http-bio-127.0.0.1-8081-exec-280" java.lang.OutOf
             
          • Magdalena from CCAFS wrote to ask about one record that has a bunch of metadata missing in her Listings and Reports export
          • It appears to be this one: https://cgspace.cgiar.org/handle/10568/83473?show=full
          • The title is “Untitled” and there is some metadata but indeed the citation is missing
          • -
          • I don't know what would cause that
          • +
          • I don’t know what would cause that

          2018-03-20

            @@ -367,7 +367,7 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
            • I have no idea why it crashed
            • I ran all system updates and rebooted it
            • -
            • Abenet told me that one of Lance Robinson's ORCID iDs on CGSpace is incorrect
            • +
            • Abenet told me that one of Lance Robinson’s ORCID iDs on CGSpace is incorrect
            • I will remove it from the controlled vocabulary (#367) and update any items using the old one:
            dspace=# update metadatavalue set text_value='Lance W. Robinson: 0000-0002-5224-8644' where resource_type_id=2 and metadata_field_id=240 and text_value like '%0000-0002-6344-195X%';
            @@ -406,7 +406,7 @@ java.lang.IllegalArgumentException: No choices plugin was configured for  field
             
            • Looks like the indexing gets confused that there is still data in the authority column
            • Unfortunately this causes those items to simply not be indexed, which users noticed because item counts were cut in half and old items showed up in RSS!
            • -
            • Since we've migrated the ORCID identifiers associated with the authority data to the cg.creator.id field we can nullify the authorities remaining in the database:
            • +
            • Since we’ve migrated the ORCID identifiers associated with the authority data to the cg.creator.id field we can nullify the authorities remaining in the database:
            dspace=# UPDATE metadatavalue SET authority=NULL WHERE resource_type_id=2 AND metadata_field_id=3 AND authority IS NOT NULL;
             UPDATE 195463
            @@ -417,8 +417,8 @@ java.lang.IllegalArgumentException: No choices plugin was configured for  field
             
            dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv header;
             COPY 56156
             
              -
            • Afterwards we'll want to do some batch tagging of ORCID identifiers to these names
            • -
            • CGSpace crashed again this afternoon, I'm not sure of the cause but there are a lot of SQL errors in the DSpace log:
            • +
            • Afterwards we’ll want to do some batch tagging of ORCID identifiers to these names
            • +
            • CGSpace crashed again this afternoon, I’m not sure of the cause but there are a lot of SQL errors in the DSpace log:
            2018-03-21 15:11:08,166 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - 
             java.sql.SQLException: Connection has already been closed.
            @@ -444,11 +444,11 @@ java.lang.OutOfMemoryError: Java heap space
             
            # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
             319
             
              -
            • I guess we need to give it more RAM because it now has CGSpace's large Solr core
            • +
            • I guess we need to give it more RAM because it now has CGSpace’s large Solr core
            • I will increase the memory from 3072m to 4096m
            • Update Ansible playbooks to use PostgreSQL JBDC driver 42.2.2
            • Deploy the new JDBC driver on DSpace Test
            • -
            • I'm also curious to see how long the dspace index-discovery -b takes on DSpace Test where the DSpace installation directory is on one of Linode's new block storage volumes
            • +
            • I’m also curious to see how long the dspace index-discovery -b takes on DSpace Test where the DSpace installation directory is on one of Linode’s new block storage volumes
            $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
             
            @@ -456,9 +456,9 @@ real    208m19.155s
             user    8m39.138s
             sys     2m45.135s
             
              -
            • So that's about three times as long as it took on CGSpace this morning
            • +
            • So that’s about three times as long as it took on CGSpace this morning
            • I should also check the raw read speed with hdparm -tT /dev/sdc
            • -
            • Looking at Peter's author corrections there are some mistakes due to Windows 1252 encoding
            • +
            • Looking at Peter’s author corrections there are some mistakes due to Windows 1252 encoding
            • I need to find a way to filter these easily with OpenRefine
            • For example, Peter has inadvertantly introduced Unicode character 0xfffd into several fields
            • I can search for Unicode values by their hex code in OpenRefine using the following GREL expression:
            • @@ -475,16 +475,16 @@ sys 2m45.135s

              2018-03-24

              • More work on the Ubuntu 18.04 readiness stuff for the Ansible playbooks
              • -
              • The playbook now uses the system's Ruby and Node.js so I don't have to manually install RVM and NVM after
              • +
              • The playbook now uses the system’s Ruby and Node.js so I don’t have to manually install RVM and NVM after

              2018-03-25

                -
              • Looking at Peter's author corrections and trying to work out a way to find errors in OpenRefine easily
              • +
              • Looking at Peter’s author corrections and trying to work out a way to find errors in OpenRefine easily
              • I can find all names that have acceptable characters using a GREL expression like:
              isNotNull(value.match(/.*[a-zA-ZáÁéèïíñØøöóúü].*/))
               
                -
              • But it's probably better to just say which characters I know for sure are not valid (like parentheses, pipe, or weird Unicode characters):
              • +
              • But it’s probably better to just say which characters I know for sure are not valid (like parentheses, pipe, or weird Unicode characters):
              or(
                 isNotNull(value.match(/.*[(|)].*/)),
              @@ -493,7 +493,7 @@ sys     2m45.135s
                 isNotNull(value.match(/.*\u200A.*/))
               )
               
                -
              • And here's one combined GREL expression to check for items marked as to delete or check so I can flag them and export them to a separate CSV (though perhaps it's time to add delete support to my fix-metadata-values.py script:
              • +
              • And here’s one combined GREL expression to check for items marked as to delete or check so I can flag them and export them to a separate CSV (though perhaps it’s time to add delete support to my fix-metadata-values.py script:
              or(
                 isNotNull(value.match(/.*delete.*/i)),
              @@ -523,21 +523,21 @@ $ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.cont
               

            2018-03-26

              -
            • Atmire got back to me about the Listings and Reports issue and said it's caused by items that have missing dc.identifier.citation fields
            • +
            • Atmire got back to me about the Listings and Reports issue and said it’s caused by items that have missing dc.identifier.citation fields
            • The will send a fix

            2018-03-27

              -
            • Atmire got back with an updated quote about the DSpace 5.8 compatibility so I've forwarded it to Peter
            • +
            • Atmire got back with an updated quote about the DSpace 5.8 compatibility so I’ve forwarded it to Peter

            2018-03-28

              -
            • DSpace Test crashed due to heap space so I've increased it from 4096m to 5120m
            • -
            • The error in Tomcat's catalina.out was:
            • +
            • DSpace Test crashed due to heap space so I’ve increased it from 4096m to 5120m
            • +
            • The error in Tomcat’s catalina.out was:
            Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space
             
              -
            • Add ISI Journal (cg.isijournal) as an option in Atmire's Listing and Reports layout (#370) for Abenet
            • +
            • Add ISI Journal (cg.isijournal) as an option in Atmire’s Listing and Reports layout (#370) for Abenet
            • I noticed a few hundred CRPs using the old capitalized formatting so I corrected them:
            $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db cgspace -u cgspace -p 'fuuu'
            @@ -552,7 +552,7 @@ Fixed 28 occurences of: GRAIN LEGUMES
             Fixed 3 occurences of: FORESTS, TREES AND AGROFORESTRY
             Fixed 5 occurences of: GENEBANKS
             
              -
            • That's weird because we just updated them last week…
            • +
            • That’s weird because we just updated them last week…
            • Create a pull request to enable searching by ORCID identifier (cg.creator.id) in Discovery and Listings and Reports (#371)
            • I will test it on DSpace Test first!
            • Fix one missing XMLUI string for “Access Status” (cg.identifier.status)
            • diff --git a/docs/2018-04/index.html b/docs/2018-04/index.html index 406261bc8..cc6f5f948 100644 --- a/docs/2018-04/index.html +++ b/docs/2018-04/index.html @@ -8,7 +8,7 @@ @@ -20,10 +20,10 @@ Catalina logs at least show some memory errors yesterday: - + @@ -53,7 +53,7 @@ Catalina logs at least show some memory errors yesterday: - + @@ -100,14 +100,14 @@ Catalina logs at least show some memory errors yesterday:

              April, 2018

              2018-04-01

                -
              • I tried to test something on DSpace Test but noticed that it's down since god knows when
              • +
              • I tried to test something on DSpace Test but noticed that it’s down since god knows when
              • Catalina logs at least show some memory errors yesterday:
              Mar 31, 2018 10:26:42 PM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
              @@ -124,7 +124,7 @@ Exception in thread "ContainerBackgroundProcessor[StandardEngine[Catalina]]
               

            2018-04-04

              -
            • Peter noticed that there were still some old CRP names on CGSpace, because I hadn't forced the Discovery index to be updated after I fixed the others last week
            • +
            • Peter noticed that there were still some old CRP names on CGSpace, because I hadn’t forced the Discovery index to be updated after I fixed the others last week
            • For completeness I re-ran the CRP corrections on CGSpace:
            $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
            @@ -139,7 +139,7 @@ real    76m13.841s
             user    8m22.960s
             sys     2m2.498s
             
              -
            • Elizabeth from CIAT emailed to ask if I could help her by adding ORCID identifiers to all of Joseph Tohme's items
            • +
            • Elizabeth from CIAT emailed to ask if I could help her by adding ORCID identifiers to all of Joseph Tohme’s items
            • I used my add-orcid-identifiers-csv.py script:
            $ ./add-orcid-identifiers-csv.py -i /tmp/jtohme-2018-04-04.csv -db dspace -u dspace -p 'fuuu'
            @@ -165,13 +165,13 @@ $ git rebase -i dspace-5.8
             
          • DS-3583 Usage of correct Collection Array (#1731) (upstream commit on dspace-5_x: c8f62e6f496fa86846bfa6bcf2d16811087d9761)
          -
        • … but somehow git knew, and didn't include them in my interactive rebase!
        • +
        • … but somehow git knew, and didn’t include them in my interactive rebase!
        • I need to send this branch to Atmire and also arrange payment (see ticket #560 in their tracker)
        • -
        • Fix Sisay's SSH access to the new DSpace Test server (linode19)
        • +
        • Fix Sisay’s SSH access to the new DSpace Test server (linode19)

        2018-04-05

          -
        • Fix Sisay's sudo access on the new DSpace Test server (linode19)
        • +
        • Fix Sisay’s sudo access on the new DSpace Test server (linode19)
        • The reindexing process on DSpace Test took forever yesterday:
        $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
        @@ -220,15 +220,15 @@ sys     2m52.585s
         
        $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-04-10
         4363
         
          -
        • 70.32.83.92 appears to be some harvester we've seen before, but on a new IP
        • +
        • 70.32.83.92 appears to be some harvester we’ve seen before, but on a new IP
        • They are not creating new Tomcat sessions so there is no problem there
        • 178.154.200.38 also appears to be Yandex, and is also creating many Tomcat sessions:
        $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38' dspace.log.2018-04-10
         3982
         
          -
        • I'm not sure why Yandex creates so many Tomcat sessions, as its user agent should match the Crawler Session Manager valve
        • -
        • Let's try a manual request with and without their user agent:
        • +
        • I’m not sure why Yandex creates so many Tomcat sessions, as its user agent should match the Crawler Session Manager valve
        • +
        • Let’s try a manual request with and without their user agent:
        $ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg 'User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
         GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
        @@ -312,7 +312,7 @@ UPDATE 1
         2115
         
        • Apparently from these stacktraces we should be able to see which code is not closing connections properly
        • -
        • Here's a pretty good overview of days where we had database issues recently:
        • +
        • Here’s a pretty good overview of days where we had database issues recently:
        # zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' | awk '{print $1,$2, $3}' | sort | uniq -c | sort -n
               1 Feb 18, 2018
        @@ -337,9 +337,9 @@ UPDATE 1
         
      • In Tomcat 8.5 the removeAbandoned property has been split into two: removeAbandonedOnBorrow and removeAbandonedOnMaintenance
      • See: https://tomcat.apache.org/tomcat-8.5-doc/jndi-datasource-examples-howto.html#Database_Connection_Pool_(DBCP_2)_Configurations
      • I assume we want removeAbandonedOnBorrow and make updates to the Tomcat 8 templates in Ansible
      • -
      • After reading more documentation I see that Tomcat 8.5's default DBCP seems to now be Commons DBCP2 instead of Tomcat DBCP
      • -
      • It can be overridden in Tomcat's server.xml by setting factory="org.apache.tomcat.jdbc.pool.DataSourceFactory" in the <Resource>
      • -
      • I think we should use this default, so we'll need to remove some other settings that are specific to Tomcat's DBCP like jdbcInterceptors and abandonWhenPercentageFull
      • +
      • After reading more documentation I see that Tomcat 8.5’s default DBCP seems to now be Commons DBCP2 instead of Tomcat DBCP
      • +
      • It can be overridden in Tomcat’s server.xml by setting factory="org.apache.tomcat.jdbc.pool.DataSourceFactory" in the <Resource>
      • +
      • I think we should use this default, so we’ll need to remove some other settings that are specific to Tomcat’s DBCP like jdbcInterceptors and abandonWhenPercentageFull
      • Merge the changes adding ORCID identifier to advanced search and Atmire Listings and Reports (#371)
      • Fix one more issue of missing XMLUI strings (for CRP subject when clicking “view more” in the Discovery sidebar)
      • I told Udana to fix the citation and abstract of the one item, and to correct the dc.language.iso for the five Spanish items in his Book Chapters collection
      • @@ -377,7 +377,7 @@ java.lang.NullPointerException
      • I see the same error on DSpace Test so this is definitely a problem
      • After disabling the authority consumer I no longer see the error
      • I merged a pull request to the 5_x-prod branch to clean that up (#372)
      • -
      • File a ticket on DSpace's Jira for the target="_blank" security and performance issue (DS-3891)
      • +
      • File a ticket on DSpace’s Jira for the target="_blank" security and performance issue (DS-3891)
      • I re-deployed DSpace Test (linode19) and was surprised by how long it took the ant update to complete:
      BUILD SUCCESSFUL
      @@ -394,7 +394,7 @@ Total time: 4 minutes 12 seconds
       
       
      webui.itemlist.sort-option.1 = title:dc.title:title
      @@ -410,15 +410,15 @@ webui.itemlist.sort-option.4 = type:dc.type:text
       
    • For example, set rpp=1 and then check the results for start values of 0, 1, and 2 and they are all the same!
    • If I have time I will check if this behavior persists on DSpace 6.x on the official DSpace demo and file a bug
    • Also, the DSpace Manual as of 5.x has very poor documentation for OpenSearch
    • -
    • They don't tell you to use Discovery search filters in the query (with format query=dateIssued:2018)
    • -
    • They don't tell you that the sort options are actually defined in dspace.cfg (ie, you need to use 2 instead of dc.date.issued_dt)
    • +
    • They don’t tell you to use Discovery search filters in the query (with format query=dateIssued:2018)
    • +
    • They don’t tell you that the sort options are actually defined in dspace.cfg (ie, you need to use 2 instead of dc.date.issued_dt)
    • They are missing the order parameter (ASC vs DESC)
    • I notice that DSpace Test has crashed again, due to memory:
    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
     178
     
      -
    • I will increase the JVM heap size from 5120M to 6144M, though we don't have much room left to grow as DSpace Test (linode19) is using a smaller instance size than CGSpace
    • +
    • I will increase the JVM heap size from 5120M to 6144M, though we don’t have much room left to grow as DSpace Test (linode19) is using a smaller instance size than CGSpace
    • Gabriela from CIP asked if I could send her a list of all CIP authors so she can do some replacements on the name formats
    • I got a list of all the CIP collections manually and use the same query that I used in August, 2017:
    @@ -445,8 +445,8 @@ sys 2m2.687s

    2018-04-20

      -
    • Gabriela from CIP emailed to say that CGSpace was returning a white page, but I haven't seen any emails from UptimeRobot
    • -
    • I confirm that it's just giving a white page around 4:16
    • +
    • Gabriela from CIP emailed to say that CGSpace was returning a white page, but I haven’t seen any emails from UptimeRobot
    • +
    • I confirm that it’s just giving a white page around 4:16
    • The DSpace logs show that there are no database connections:
    org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-715] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:18; idle:0; lastwait:5000].
    @@ -456,7 +456,7 @@ sys     2m2.687s
     
    # grep -c 'org.apache.tomcat.jdbc.pool.PoolExhaustedException' /home/cgspace.cgiar.org/log/dspace.log.2018-04-20
     32147
     
      -
    • I can't even log into PostgreSQL as the postgres user, WTF?
    • +
    • I can’t even log into PostgreSQL as the postgres user, WTF?
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c 
     ^C
    @@ -475,7 +475,7 @@ sys     2m2.687s
        4325 70.32.83.92
       10718 45.5.184.2
     
      -
    • It doesn't even seem like there is a lot of traffic compared to the previous days:
    • +
    • It doesn’t even seem like there is a lot of traffic compared to the previous days:
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Apr/2018" | wc -l
     74931
    @@ -485,9 +485,9 @@ sys     2m2.687s
     93459
     
    • I tried to restart Tomcat but systemctl hangs
    • -
    • I tried to reboot the server from the command line but after a few minutes it didn't come back up
    • +
    • I tried to reboot the server from the command line but after a few minutes it didn’t come back up
    • Looking at the Linode console I see that it is stuck trying to shut down
    • -
    • Even “Reboot” via Linode console doesn't work!
    • +
    • Even “Reboot” via Linode console doesn’t work!
    • After shutting it down a few times via the Linode console it finally rebooted
    • Everything is back but I have no idea what caused this—I suspect something with the hosting provider
    • Also super weird, the last entry in the DSpace log file is from 2018-04-20 16:35:09, and then immediately it goes to 2018-04-20 19:15:04 (three hours later!):
    • @@ -518,13 +518,13 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [localhost-startStop-2] Time

    2018-04-24

      -
    • Testing my Ansible playbooks with a clean and updated installation of Ubuntu 18.04 and I fixed some issues that I hadn't run into a few weeks ago
    • +
    • Testing my Ansible playbooks with a clean and updated installation of Ubuntu 18.04 and I fixed some issues that I hadn’t run into a few weeks ago
    • There seems to be a new issue with Java dependencies, though
    • The default-jre package is going to be Java 10 on Ubuntu 18.04, but I want to use openjdk-8-jre-headless (well, the JDK actually, but it uses this JRE)
    • Tomcat and Ant are fine with Java 8, but the maven package wants to pull in Java 10 for some reason
    • Looking closer, I see that maven depends on java7-runtime-headless, which is indeed provided by openjdk-8-jre-headless
    • -
    • So it must be one of Maven's dependencies…
    • -
    • I will watch it for a few days because it could be an issue that will be resolved before Ubuntu 18.04's release
    • +
    • So it must be one of Maven’s dependencies…
    • +
    • I will watch it for a few days because it could be an issue that will be resolved before Ubuntu 18.04’s release
    • Otherwise I will post a bug to the ubuntu-release mailing list
    • Looks like the only way to fix this is to install openjdk-8-jdk-headless before (so it pulls in the JRE) in a separate transaction, or to manually install openjdk-8-jre-headless in the same apt transaction as maven
    • Also, I started porting PostgreSQL 9.6 into the Ansible infrastructure scripts
    • @@ -534,12 +534,12 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [localhost-startStop-2] Time
      • Still testing the Ansible infrastructure playbooks for Ubuntu 18.04, Tomcat 8.5, and PostgreSQL 9.6
      • One other new thing I notice is that PostgreSQL 9.6 no longer uses createuser and nocreateuser, as those have actually meant superuser and nosuperuser and have been deprecated for ten years
      • -
      • So for my notes, when I'm importing a CGSpace database dump I need to amend my notes to give super user permission to a user, rather than create user:
      • +
      • So for my notes, when I’m importing a CGSpace database dump I need to amend my notes to give super user permission to a user, rather than create user:
      $ psql dspacetest -c 'alter user dspacetest superuser;'
       $ pg_restore -O -U dspacetest -d dspacetest -W -h localhost /tmp/dspace_2018-04-18.backup
       
        -
      • There's another issue with Tomcat in Ubuntu 18.04:
      • +
      • There’s another issue with Tomcat in Ubuntu 18.04:
      25-Apr-2018 13:26:21.493 SEVERE [http-nio-127.0.0.1-8443-exec-1] org.apache.coyote.AbstractProtocol$ConnectionHandler.process Error reading request, ignored
        java.lang.NoSuchMethodError: java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer;
      @@ -554,13 +554,13 @@ $ pg_restore -O -U dspacetest -d dspacetest -W -h localhost /tmp/dspace_2018-04-
               at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
               at java.lang.Thread.run(Thread.java:748)
       

      2018-04-29

      • DSpace Test crashed again, looks like memory issues again
      • -
      • JVM heap size was last increased to 6144m but the system only has 8GB total so there's not much we can do here other than get a bigger Linode instance or remove the massive Solr Statistics data
      • +
      • JVM heap size was last increased to 6144m but the system only has 8GB total so there’s not much we can do here other than get a bigger Linode instance or remove the massive Solr Statistics data

      2018-04-30

        diff --git a/docs/2018-05/index.html b/docs/2018-05/index.html index 8436ba410..f6f2e7a0f 100644 --- a/docs/2018-05/index.html +++ b/docs/2018-05/index.html @@ -35,7 +35,7 @@ http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E Then I reduced the JVM heap size from 6144 back to 5120m Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use "/> - + @@ -65,7 +65,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked - + @@ -112,7 +112,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked

        May, 2018

        @@ -135,7 +135,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
      • Looking over some IITA records for Sisay
        • Other than trimming and collapsing consecutive whitespace, I made some other corrections
        • -
        • I need to check the correct formatting of COTE D'IVOIRE vs COTE D’IVOIRE
        • +
        • I need to check the correct formatting of COTE D’IVOIRE vs COTE D’IVOIRE
        • I replaced all DOIs with HTTPS
        • I checked a few DOIs and found at least one that was missing, so I Googled the title of the paper and found the correct DOI
        • Also, I found an FAQ for DOI that says the dx.doi.org syntax is older, so I will replace all the DOIs with doi.org instead
        • @@ -180,7 +180,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
        $ for line in $(< /tmp/links.txt); do echo $line; http --print h $line; done
         
          -
        • Most of the links are good, though one is duplicate and one seems to even be incorrect in the publisher's site so…
        • +
        • Most of the links are good, though one is duplicate and one seems to even be incorrect in the publisher’s site so…
        • Also, there are some duplicates:
          • 10568/92241 and 10568/92230 (same DOI)
          • @@ -216,8 +216,8 @@ $ ./resolve-orcids.py -i /tmp/2018-05-06-combined.txt -o /tmp/2018-05-06-combine # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents) $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
      -
    • I made a pull request (#373) for this that I'll merge some time next week (I'm expecting Atmire to get back to us about DSpace 5.8 soon)
    • -
    • After testing quickly I just decided to merge it, and I noticed that I don't even need to restart Tomcat for the changes to get loaded
    • +
    • I made a pull request (#373) for this that I’ll merge some time next week (I’m expecting Atmire to get back to us about DSpace 5.8 soon)
    • +
    • After testing quickly I just decided to merge it, and I noticed that I don’t even need to restart Tomcat for the changes to get loaded

    2018-05-07

      @@ -225,7 +225,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
    • The documentation regarding the Solr stuff is limited, and I cannot figure out what all the fields in conciliator.properties are supposed to be
    • But then I found reconcile-csv, which allows you to reconcile against values in a CSV file!
    • That, combined with splitting our multi-value fields on “||” in OpenRefine is amaaaaazing, because after reconciliation you can just join them again
    • -
    • Oh wow, you can also facet on the individual values once you've split them! That's going to be amazing for proofing CRPs, subjects, etc.
    • +
    • Oh wow, you can also facet on the individual values once you’ve split them! That’s going to be amazing for proofing CRPs, subjects, etc.

    2018-05-09

      @@ -276,7 +276,7 @@ Livestock and Fish
      • It turns out there was a space in my “country” header that was causing reconcile-csv to crash
      • After removing that it works fine!
      • -
      • Looking at Sisay's 2,640 CIFOR records on DSpace Test (10568/92904) +
      • Looking at Sisay’s 2,640 CIFOR records on DSpace Test (10568/92904)
        • Trimmed all leading / trailing white space and condensed multiple spaces into one
        • Corrected DOIs to use HTTPS and “doi.org” instead of “dx.doi.org” @@ -318,9 +318,9 @@ return "blank"
        • You could use this in a facet or in a new column
        • More information and good examples here: https://programminghistorian.org/lessons/fetch-and-parse-data-with-openrefine
        • Finish looking at the 2,640 CIFOR records on DSpace Test (10568/92904), cleaning up authors and adding collection mappings
        • -
        • They can now be moved to CGSpace as far as I'm concerned, but I don't know if Sisay will do it or me
        • -
        • I was checking the CIFOR data for duplicates using Atmire's Metadata Quality Module (and found some duplicates actually), but then DSpace died…
        • -
        • I didn't see anything in the Tomcat, DSpace, or Solr logs, but I saw this in dmest -T:
        • +
        • They can now be moved to CGSpace as far as I’m concerned, but I don’t know if Sisay will do it or me
        • +
        • I was checking the CIFOR data for duplicates using Atmire’s Metadata Quality Module (and found some duplicates actually), but then DSpace died…
        • +
        • I didn’t see anything in the Tomcat, DSpace, or Solr logs, but I saw this in dmest -T:
        [Tue May 15 12:10:01 2018] Out of memory: Kill process 3763 (java) score 706 or sacrifice child
         [Tue May 15 12:10:01 2018] Killed process 3763 (java) total-vm:14667688kB, anon-rss:5705268kB, file-rss:0kB, shmem-rss:0kB
        @@ -335,7 +335,7 @@ return "blank"
         
      2018-05-15 12:35:30,858 INFO  org.dspace.submit.step.CompleteStep @ m.garruccio@cgiar.org:session_id=8AC4499945F38B45EF7A1226E3042DAE:submission_complete:Completed submission with id=96060
       
        -
      • So I'm not sure…
      • +
      • So I’m not sure…
      • I finally figured out how to get OpenRefine to reconcile values from Solr via conciliator:
      • The trick was to use a more appropriate Solr fieldType text_en instead of text_general so that more terms match, for example uppercase and lower case:
      @@ -344,11 +344,11 @@ $ ./bin/solr create_core -c countries $ curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"country", "type":"text_en", "multiValued":false, "stored":true}}' http://localhost:8983/solr/countries/schema $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
        -
      • It still doesn't catch simple mistakes like “ALBANI” or “AL BANIA” for “ALBANIA”, and it doesn't return scores, so I have to select matches manually:
      • +
      • It still doesn’t catch simple mistakes like “ALBANI” or “AL BANIA” for “ALBANIA”, and it doesn’t return scores, so I have to select matches manually:

      OpenRefine reconciling countries from local Solr

        -
      • I should probably make a general copy field and set it to be the default search field, like DSpace's search core does (see schema.xml):
      • +
      • I should probably make a general copy field and set it to be the default search field, like DSpace’s search core does (see schema.xml):
      <defaultSearchField>search_text</defaultSearchField>
       ...
      @@ -356,7 +356,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
       
      • Actually, I wonder how much of their schema I could just copy…
      • Apparently the default search field is the df parameter and you could technically just add it to the query string, so no need to bother with that in the schema now
      • -
      • I copied over the DSpace search_text field type from the DSpace Solr config (had to remove some properties so Solr would start) but it doesn't seem to be any better at matching than the text_en type
      • +
      • I copied over the DSpace search_text field type from the DSpace Solr config (had to remove some properties so Solr would start) but it doesn’t seem to be any better at matching than the text_en type
      • I think I need to focus on trying to return scores with conciliator

      2018-05-16

      @@ -364,9 +364,9 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
    • Discuss GDPR with James Stapleton
      • As far as I see it, we are “Data Controllers” on CGSpace because we store peoples’ names, emails, and phone numbers if they register
      • -
      • We set cookies on the user's computer, but these do not contain personally identifiable information (PII) and they are “session” cookies which are deleted when the user closes their browser
      • +
      • We set cookies on the user’s computer, but these do not contain personally identifiable information (PII) and they are “session” cookies which are deleted when the user closes their browser
      • We use Google Analytics to track website usage, which makes Google the “Data Processor” and in this case we merely need to limit or obfuscate the information we send to them
      • -
      • As the only personally identifiable information we send is the user's IP address, I think we only need to enable IP Address Anonymization in our analytics.js code snippets
      • +
      • As the only personally identifiable information we send is the user’s IP address, I think we only need to enable IP Address Anonymization in our analytics.js code snippets
      • Then we can add a “Privacy” page to CGSpace that makes all of this clear
    • @@ -380,22 +380,22 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
      • I tested loading a certain page before and after adding this and afterwards I saw that the parameter aip=1 was being sent with the analytics response to Google
      • According to the analytics.js protocol parameter documentation this means that IPs are being anonymized
      • -
      • After finding and fixing some duplicates in IITA's IITA_April_27 test collection on DSpace Test (10568/92703) I told Sisay that he can move them to IITA's Journal Articles collection on CGSpace
      • +
      • After finding and fixing some duplicates in IITA’s IITA_April_27 test collection on DSpace Test (10568/92703) I told Sisay that he can move them to IITA’s Journal Articles collection on CGSpace

      2018-05-17

        -
      • Testing reconciliation of countries against Solr via conciliator, I notice that CÔTE D'IVOIRE doesn't match COTE D'IVOIRE, whereas with reconcile-csv it does
      • -
      • Also, when reconciling regions against Solr via conciliator EASTERN AFRICA doesn't match EAST AFRICA, whereas with reconcile-csv it does
      • +
      • Testing reconciliation of countries against Solr via conciliator, I notice that CÔTE D'IVOIRE doesn’t match COTE D'IVOIRE, whereas with reconcile-csv it does
      • +
      • Also, when reconciling regions against Solr via conciliator EASTERN AFRICA doesn’t match EAST AFRICA, whereas with reconcile-csv it does
      • And SOUTH AMERICA matches both SOUTH ASIA and SOUTH AMERICA with the same match score of 2… WTF.
      • It could be that I just need to tune the query filter in Solr (currently using the example text_en field type)
      • Oh sweet, it turns out that the issue with searching for characters with accents is called “code folding” in Solr
      • You can use either a solr.ASCIIFoldingFilterFactory filter or a solr.MappingCharFilterFactory charFilter mapping against mapping-FoldToASCII.txt
      • Also see: https://opensourceconnections.com/blog/2017/02/20/solr-utf8/
      • Now CÔTE D'IVOIRE matches COTE D'IVOIRE!
      • -
      • I'm not sure which method is better, perhaps the solr.ASCIIFoldingFilterFactory filter because it doesn't require copying the mapping-FoldToASCII.txt file
      • -
      • And actually I'm not entirely sure about the order of filtering before tokenizing, etc…
      • +
      • I’m not sure which method is better, perhaps the solr.ASCIIFoldingFilterFactory filter because it doesn’t require copying the mapping-FoldToASCII.txt file
      • +
      • And actually I’m not entirely sure about the order of filtering before tokenizing, etc…
      • Ah, I see that charFilter must be before the tokenizer because it works on a stream, whereas filter operates on tokenized input so it must come after the tokenizer
      • -
      • Regarding the use of the charFilter vs the filter class before and after the tokenizer, respectively, I think it's better to use the charFilter to normalize the input stream before tokenizing it as I have no idea what kinda stuff might get removed by the tokenizer
      • +
      • Regarding the use of the charFilter vs the filter class before and after the tokenizer, respectively, I think it’s better to use the charFilter to normalize the input stream before tokenizing it as I have no idea what kinda stuff might get removed by the tokenizer
      • Skype with Geoffrey from IITA in Nairobi who wants to deposit records to CGSpace via the REST API but I told him that this skips the submission workflows and because we cannot guarantee the data quality we would not allow anyone to use it this way
      • I finished making the XMLUI changes for anonymization of IP addresses in Google Analytics and merged the changes to the 5_x-prod branch (#375
      • Also, I think we might be able to implement opt-out functionality for Google Analytics using a window property that could be managed by storing its status in a cookie
      • @@ -430,7 +430,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv

      2018-05-23

        -
      • I'm investigating how many non-CGIAR users we have registered on CGSpace:
      • +
      • I’m investigating how many non-CGIAR users we have registered on CGSpace:
      dspace=# select email, netid from eperson where email not like '%cgiar.org%' and email like '%@%';
       
        @@ -443,13 +443,13 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv

        2018-05-28

        • Daniel Haile-Michael sent a message that CGSpace was down (I am currently in Oregon so the time difference is ~10 hours)
        • -
        • I looked in the logs but didn't see anything that would be the cause of the crash
        • +
        • I looked in the logs but didn’t see anything that would be the cause of the crash
        • Atmire finalized the DSpace 5.8 testing and sent a pull request: https://github.com/ilri/DSpace/pull/378
        • They have asked if I can test this and get back to them by June 11th

        2018-05-30

          -
        • Talk to Samantha from Bioversity about something related to Google Analytics, I'm still not sure what they want
        • +
        • Talk to Samantha from Bioversity about something related to Google Analytics, I’m still not sure what they want
        • DSpace Test crashed last night, seems to be related to system memory (not JVM heap)
        • I see this in dmesg:
        @@ -458,7 +458,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv [Wed May 30 00:00:40 2018] oom_reaper: reaped process 6082 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        • I need to check the Tomcat JVM heap size/usage, command line JVM heap size (for cron jobs), and PostgreSQL memory usage
        • -
        • It might be possible to adjust some things, but eventually we'll need a larger VPS instance
        • +
        • It might be possible to adjust some things, but eventually we’ll need a larger VPS instance
        • For some reason there are no JVM stats in Munin, ugh
        • Run all system updates on DSpace Test and reboot it
        • I generated a list of CIFOR duplicates from the CIFOR_May_9 collection using the Atmire MQM module and then dumped the HTML source so I could process it for sending to Vika
        • @@ -467,13 +467,13 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
          $ grep -E 'aspect.duplicatechecker.DuplicateResults.field.del_handle_[0-9]{1,3}_Item' ~/Desktop/https\ _dspacetest.cgiar.org_atmire_metadata-quality_duplicate-checker.html > ~/cifor-duplicates.txt
           $ sed 's/.*Item1.*/\n&/g' ~/cifor-duplicates.txt > ~/cifor-duplicates-cleaned.txt
           
            -
          • I told Vika to look through the list manually and indicate which ones are indeed duplicates that we should delete, and which ones to map to CIFOR's collection
          • +
          • I told Vika to look through the list manually and indicate which ones are indeed duplicates that we should delete, and which ones to map to CIFOR’s collection
          • A few weeks ago Peter wanted a list of authors from the ILRI collections, so I need to find a way to get the handles of all those collections
          • -
          • I can use the /communities/{id}/collections endpoint of the REST API but it only takes IDs (not handles) and doesn't seem to descend into sub communities
          • +
          • I can use the /communities/{id}/collections endpoint of the REST API but it only takes IDs (not handles) and doesn’t seem to descend into sub communities
          • Shit, so I need the IDs for the the top-level ILRI community and all its sub communities (and their sub communities)
          • There has got to be a better way to do this than going to each community and getting their handles and IDs manually
          • Oh shit, I literally already wrote a script to get all collections in a community hierarchy from the REST API: rest-find-collections.py
          • -
          • The output isn't great, but all the handles and IDs are printed in debug mode:
          • +
          • The output isn’t great, but all the handles and IDs are printed in debug mode:
          $ ./rest-find-collections.py -u https://cgspace.cgiar.org/rest -d 10568/1 2> /tmp/ilri-collections.txt
           
            @@ -482,8 +482,8 @@ $ sed 's/.*Item1.*/\n&/g' ~/cifor-duplicates.txt > ~/cifor-duplicates-cle
            dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/67236','10568/67274',...))) group by text_value order by count desc) to /tmp/ilri-authors.csv with csv;
             

            2018-05-31

              -
            • Clarify CGSpace's usage of Google Analytics and personally identifiable information during user registration for Bioversity team who had been asking about GDPR compliance
            • -
            • Testing running PostgreSQL in a Docker container on localhost because when I'm on Arch Linux there isn't an easily installable package for particular PostgreSQL versions
            • +
            • Clarify CGSpace’s usage of Google Analytics and personally identifiable information during user registration for Bioversity team who had been asking about GDPR compliance
            • +
            • Testing running PostgreSQL in a Docker container on localhost because when I’m on Arch Linux there isn’t an easily installable package for particular PostgreSQL versions
            • Now I can just use Docker:
            $ docker pull postgres:9.5-alpine
            diff --git a/docs/2018-06/index.html b/docs/2018-06/index.html
            index e094313c1..5a894697f 100644
            --- a/docs/2018-06/index.html
            +++ b/docs/2018-06/index.html
            @@ -10,7 +10,7 @@
             
             Test the DSpace 5.8 module upgrades from Atmire (#378)
             
            -There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn't build
            +There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
             
             
             I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
            @@ -38,7 +38,7 @@ sys     2m7.289s
             
             Test the DSpace 5.8 module upgrades from Atmire (#378)
             
            -There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn't build
            +There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
             
             
             I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
            @@ -55,7 +55,7 @@ real    74m42.646s
             user    8m5.056s
             sys     2m7.289s
             "/>
            -
            +
             
             
                 
            @@ -85,7 +85,7 @@ sys     2m7.289s
                 
                 
                 
            -    
            +    
                 
             
                 
            @@ -132,7 +132,7 @@ sys     2m7.289s
               

            June, 2018

            @@ -141,7 +141,7 @@ sys 2m7.289s
            • Test the DSpace 5.8 module upgrades from Atmire (#378)
                -
              • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn't build
              • +
              • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
            • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
            • @@ -160,8 +160,8 @@ sys 2m7.289s

            2018-06-06

            2018-06-07

              @@ -204,7 +204,7 @@ update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015

            2018-06-09

              -
            • It's pretty annoying, but the JVM monitoring for Munin was never set up when I migrated DSpace Test to its new server a few months ago
            • +
            • It’s pretty annoying, but the JVM monitoring for Munin was never set up when I migrated DSpace Test to its new server a few months ago
            • I ran the tomcat and munin-node tags in Ansible again and now the stuff is all wired up and recording stats properly
            • I applied the CIP author corrections on CGSpace and DSpace Test and re-ran the Discovery indexing
            @@ -216,9 +216,9 @@ update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015
             INFO [org.dspace.servicemanager.DSpaceServiceManager] Shutdown DSpace core service manager
             Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'org.dspace.servicemanager.spring.DSpaceBeanPostProcessor#0' defined in class path resource [spring/spring-dspace-applicationContext.xml]: Unsatisfied dependency expressed through constructor argument with index 0 of type [org.dspace.servicemanager.config.DSpaceConfigurationService]: : Cannot find class [com.atmire.dspace.discovery.ItemCollectionPlugin] for bean with name 'itemCollectionPlugin' defined in file [/home/aorth/dspace/config/spring/api/discovery.xml];
             
              -
            • I can fix this by commenting out the ItemCollectionPlugin line of discovery.xml, but from looking at the git log I'm not actually sure if that is related to MQM or not
            • +
            • I can fix this by commenting out the ItemCollectionPlugin line of discovery.xml, but from looking at the git log I’m not actually sure if that is related to MQM or not
            • I will have to ask Atmire
            • -
            • I continued to look at Sisay's IITA records from last week +
            • I continued to look at Sisay’s IITA records from last week
              • I normalized all DOIs to use HTTPS and “doi.org” instead of “dx.doi.org”
              • I cleaned up white space in cg.subject.iita and dc.subject
              • @@ -254,14 +254,14 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
              • “Institut de la Recherche Agronomique, Cameroon” and “Institut de Recherche Agronomique, Cameroon”
            • -
            • Inconsistency in countries: “COTE D’IVOIRE” and “COTE D'IVOIRE”
            • +
            • Inconsistency in countries: “COTE D’IVOIRE” and “COTE D’IVOIRE”
            • A few DOIs with spaces or invalid characters
            • Inconsistency in IITA subjects, for example “PRODUCTION VEGETALE” and “PRODUCTION VÉGÉTALE” and several others
            • I ran value.unescape('javascript') on the abstract and citation fields because it looks like this data came from a SQL database and some stuff was escaped
            -
          • It turns out that Abenet actually did a lot of small corrections on this data so when Sisay uses Bosede's original file it doesn't have all those corrections
          • -
          • So I told Sisay to re-create the collection using Abenet's XLS from last week (Mercy1805_AY.xls)
          • +
          • It turns out that Abenet actually did a lot of small corrections on this data so when Sisay uses Bosede’s original file it doesn’t have all those corrections
          • +
          • So I told Sisay to re-create the collection using Abenet’s XLS from last week (Mercy1805_AY.xls)
          • I was curious to see if I could create a GREL for use with a custom text facet in Open Refine to find cells with two or more consecutive spaces
          • I always use the built-in trim and collapse transformations anyways, but this seems to work to find the offending cells: isNotNull(value.match(/.*?\s{2,}.*?/))
          • I wonder if I should start checking for “smart” quotes like ’ (hex 2019)
          • @@ -271,15 +271,15 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
          • Udana from IWMI asked about the OAI base URL for their community on CGSpace
          • I think it should be this: https://cgspace.cgiar.org/oai/request?verb=ListRecords&metadataPrefix=oai_dc&set=com_10568_16814
          • The style sheet obfuscates the data, but if you look at the source it is all there, including information about pagination of results
          • -
          • Regarding Udana's Book Chapters and Reports on DSpace Test last week, Abenet told him to fix some character encoding and CRP issues, then I told him I'd check them after that
          • -
          • The latest batch of IITA's 200 records (based on Abenet's version Mercy1805_AY.xls) are now in the IITA_Jan_9_II_Ab collection
          • +
          • Regarding Udana’s Book Chapters and Reports on DSpace Test last week, Abenet told him to fix some character encoding and CRP issues, then I told him I’d check them after that
          • +
          • The latest batch of IITA’s 200 records (based on Abenet’s version Mercy1805_AY.xls) are now in the IITA_Jan_9_II_Ab collection
          • So here are some corrections:
            • use of Unicode smart quote (hex 2019) in countries and affiliations, for example “COTE D’IVOIRE” and “Institut d’Economic Rurale, Mali”
            • inconsistencies in cg.contributor.affiliation:
              • “Centro Internacional de Agricultura Tropical” and “Centro International de Agricultura Tropical” should use the English name of CIAT (International Center for Tropical Agriculture)
              • -
              • “Institut International d'Agriculture Tropicale” should use the English name of IITA (International Institute of Tropical Agriculture)
              • +
              • “Institut International d’Agriculture Tropicale” should use the English name of IITA (International Institute of Tropical Agriculture)
              • “East and Southern Africa Regional Center” and “Eastern and Southern Africa Regional Centre”
              • “Institut de la Recherche Agronomique, Cameroon” and “Institut de Recherche Agronomique, Cameroon”
              • “Institut des Recherches Agricoles du Bénin” and “Institut National des Recherche Agricoles du Benin” and “National Agricultural Research Institute, Benin”
              • @@ -320,7 +320,7 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
              • “MATÉRIEL DE PLANTATION” and “MATÉRIELS DE PLANTATION”
            • -
            • I noticed that some records do have encoding errors in the dc.description.abstract field, but only four of them so probably not from Abenet's handling of the XLS file
            • +
            • I noticed that some records do have encoding errors in the dc.description.abstract field, but only four of them so probably not from Abenet’s handling of the XLS file
            • Based on manually eyeballing the text I used a custom text facet with this GREL to identify the records:
          • @@ -344,7 +344,7 @@ Failed to startup the DSpace Service Manager: failure starting up spring service

          2018-06-13

            -
          • Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Robin Buruchara's items
          • +
          • Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Robin Buruchara’s items
          • I used my add-orcid-identifiers-csv.py script:
          $ ./add-orcid-identifiers-csv.py -i 2018-06-13-Robin-Buruchara.csv -db dspace -u dspace -p 'fuuu'
          @@ -355,7 +355,7 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
           "Buruchara, Robin",Robin Buruchara: 0000-0003-0934-1218
           "Buruchara, Robin A.",Robin Buruchara: 0000-0003-0934-1218
           
            -
          • On a hunch I checked to see if CGSpace's bitstream cleanup was working properly and of course it's broken:
          • +
          • On a hunch I checked to see if CGSpace’s bitstream cleanup was working properly and of course it’s broken:
          $ dspace cleanup -v
           ...
          @@ -368,7 +368,7 @@ Error: ERROR: update or delete on table "bitstream" violates foreign k
           UPDATE 1
           

          2018-06-14

            -
          • Check through Udana's IWMI records from last week on DSpace Test
          • +
          • Check through Udana’s IWMI records from last week on DSpace Test
          • There were only some minor whitespace and one or two syntax errors, but they look very good otherwise
          • I uploaded the twenty-four reports to the IWMI Reports collection: https://cgspace.cgiar.org/handle/10568/36188
          • I uploaded the seventy-six book chapters to the IWMI Book Chapters collection: https://cgspace.cgiar.org/handle/10568/36178
          • @@ -384,22 +384,22 @@ $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h loca $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
            • The -O option to pg_restore makes the import process ignore ownership specified in the dump itself, and instead makes the schema owned by the user doing the restore
            • -
            • I always prefer to use the postgres user locally because it's just easier than remembering the dspacetest user's password, but then I couldn't figure out why the resulting schema was owned by postgres
            • +
            • I always prefer to use the postgres user locally because it’s just easier than remembering the dspacetest user’s password, but then I couldn’t figure out why the resulting schema was owned by postgres
            • So with this you connect as the postgres superuser and then switch roles to dspacetest (also, make sure this user has superuser privileges before the restore)
            • Last week Linode emailed me to say that our Linode 8192 instance used for DSpace Test qualified for an upgrade
            • Apparently they announced some upgrades to most of their plans in 2018-05
            • -
            • After the upgrade I see we have more disk space available in the instance's dashboard, so I shut the instance down and resized it from 98GB to 160GB
            • +
            • After the upgrade I see we have more disk space available in the instance’s dashboard, so I shut the instance down and resized it from 98GB to 160GB
            • The resize was very quick (less than one minute) and after booting the instance back up I now have 160GB for the root filesystem!
            • -
            • I will move the DSpace installation directory back to the root file system and delete the extra 300GB block storage, as it was actually kinda slow when we put Solr there and now we don't actually need it anymore because running the production Solr on this instance didn't work well with 8GB of RAM
            • -
            • Also, the larger instance we're using for CGSpace will go from 24GB of RAM to 32, and will also get a storage increase from 320GB to 640GB… that means we don't need to consider using block storage right now!
            • -
            • The smaller instances get increased storage and network speed but I doubt many are actually using much of their current allocations so we probably don't need to bother with upgrading them
            • +
            • I will move the DSpace installation directory back to the root file system and delete the extra 300GB block storage, as it was actually kinda slow when we put Solr there and now we don’t actually need it anymore because running the production Solr on this instance didn’t work well with 8GB of RAM
            • +
            • Also, the larger instance we’re using for CGSpace will go from 24GB of RAM to 32, and will also get a storage increase from 320GB to 640GB… that means we don’t need to consider using block storage right now!
            • +
            • The smaller instances get increased storage and network speed but I doubt many are actually using much of their current allocations so we probably don’t need to bother with upgrading them
            • Last week Abenet asked if we could add dc.language.iso to the advanced search filters
            • -
            • There is already a search filter for this field defined in discovery.xml but we aren't using it, so I quickly enabled and tested it, then merged it to the 5_x-prod branch (#380)
            • +
            • There is already a search filter for this field defined in discovery.xml but we aren’t using it, so I quickly enabled and tested it, then merged it to the 5_x-prod branch (#380)
            • Back to testing the DSpace 5.8 changes from Atmire, I had another issue with SQL migrations:
            Caused by: org.flywaydb.core.api.FlywayException: Validate failed. Found differences between applied migrations and available migrations: Detected applied migration missing on the classpath: 5.8.2015.12.03.3
             
              -
            • It took me a while to figure out that this migration is for MQM, which I removed after Atmire's original advice about the migrations so we actually need to delete this migration instead up updating it
            • +
            • It took me a while to figure out that this migration is for MQM, which I removed after Atmire’s original advice about the migrations so we actually need to delete this migration instead up updating it
            • So I need to make sure to run the following during the DSpace 5.8 upgrade:
            -- Delete existing CUA 4 migration if it exists
            @@ -430,20 +430,20 @@ Done.
             "Jarvis, Andrew",Andy Jarvis: 0000-0001-6543-0798
             

            2018-06-26

              -
            • Atmire got back to me to say that we can remove the itemCollectionPlugin and HasBitstreamsSSIPlugin beans from DSpace's discovery.xml file, as they are used by the Metadata Quality Module (MQM) that we are not using anymore
            • +
            • Atmire got back to me to say that we can remove the itemCollectionPlugin and HasBitstreamsSSIPlugin beans from DSpace’s discovery.xml file, as they are used by the Metadata Quality Module (MQM) that we are not using anymore
            • I removed both those beans and did some simple tests to check item submission, media-filter of PDFs, REST API, but got an error “No matches for the query” when listing records in OAI
            • This warning appears in the DSpace log:
            2018-06-26 16:58:12,052 WARN  org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
             
              -
            • It's actually only a warning and it also appears in the logs on DSpace Test (which is currently running DSpace 5.5), so I need to keep troubleshooting
            • +
            • It’s actually only a warning and it also appears in the logs on DSpace Test (which is currently running DSpace 5.5), so I need to keep troubleshooting
            • Ah, I think I just need to run dspace oai import

            2018-06-27

            • Vika from CIFOR sent back his annotations on the duplicates for the “CIFOR_May_9” archive import that I sent him last week
            • -
            • I'll have to figure out how to separate those we're keeping, deleting, and mapping into CIFOR's archive collection
            • -
            • First, get the 62 deletes from Vika's file and remove them from the collection:
            • +
            • I’ll have to figure out how to separate those we’re keeping, deleting, and mapping into CIFOR’s archive collection
            • +
            • First, get the 62 deletes from Vika’s file and remove them from the collection:
            $ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-delete.txt
             $ wc -l cifor-handle-to-delete.txt
            @@ -470,7 +470,7 @@ $ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv
             
          • Then I can use Open Refine to add the “CIFOR Archive” collection to the mappings
          • Importing the 2398 items via dspace metadata-import ends up with a Java garbage collection error, so I think I need to do it in batches of 1,000
          • After deleting the 62 duplicates, mapping the 50 items from elsewhere in CGSpace, and uploading 2,398 unique items, there are a total of 2,448 items added in this batch
          • -
          • I'll let Abenet take one last look and then move them to CGSpace
          • +
          • I’ll let Abenet take one last look and then move them to CGSpace

          2018-06-28

            @@ -481,9 +481,9 @@ $ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv [Thu Jun 28 00:00:30 2018] Killed process 14501 (java) total-vm:14926704kB, anon-rss:5693608kB, file-rss:0kB, shmem-rss:0kB [Thu Jun 28 00:00:30 2018] oom_reaper: reaped process 14501 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
              -
            • Look over IITA's IITA_Jan_9_II_Ab collection from earlier this month on DSpace Test
            • +
            • Look over IITA’s IITA_Jan_9_II_Ab collection from earlier this month on DSpace Test
            • Bosede fixed a few things (and seems to have removed many French IITA subjects like AMÉLIORATION DES PLANTES and SANTÉ DES PLANTES)
            • -
            • I still see at least one issue with author affiliations, and I didn't bother to check the AGROVOC subjects because it's such a mess aanyways
            • +
            • I still see at least one issue with author affiliations, and I didn’t bother to check the AGROVOC subjects because it’s such a mess aanyways
            • I suggested that IITA provide an updated list of subject to us so we can include their controlled vocabulary in CGSpace, which would also make it easier to do automated validation
            diff --git a/docs/2018-07/index.html b/docs/2018-07/index.html index 89bd5ecd8..5dda54087 100644 --- a/docs/2018-07/index.html +++ b/docs/2018-07/index.html @@ -33,7 +33,7 @@ During the mvn package stage on the 5.8 branch I kept getting issues with java r There is insufficient memory for the Java Runtime Environment to continue. "/> - + @@ -63,7 +63,7 @@ There is insufficient memory for the Java Runtime Environment to continue. - + @@ -110,7 +110,7 @@ There is insufficient memory for the Java Runtime Environment to continue.

            July, 2018

            @@ -217,7 +217,7 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana

            2018-07-04

            • I verified that the autowire error indeed only occurs on Tomcat 8.5, but the application works fine on Tomcat 7
            • -
            • I have raised this in the DSpace 5.8 compatibility ticket on Atmire's tracker
            • +
            • I have raised this in the DSpace 5.8 compatibility ticket on Atmire’s tracker
            • Abenet wants me to add “United Kingdom government” to the sponsors on CGSpace so I created a ticket to track it (#381)
            • Also, Udana wants me to add “Enhancing Sustainability Across Agricultural Systems” to the WLE Phase II research themes so I created a ticket to track that (#382)
            • I need to try to finish this DSpace 5.8 business first because I have too many branches with cherry-picks going on right now!
            • @@ -225,13 +225,13 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana

              2018-07-06

              • CCAFS want me to add “PII-FP2_MSCCCAFS” to their Phase II project tags on CGSpace (#383)
              • -
              • I'll do it in a batch with all the other metadata updates next week
              • +
              • I’ll do it in a batch with all the other metadata updates next week

              2018-07-08

                -
              • I was tempted to do the Linode instance upgrade on CGSpace (linode18), but after looking closely at the system backups I noticed that Solr isn't being backed up to S3
              • -
              • I apparently noticed this—and fixed it!—in 2016-07, but it doesn't look like the backup has been updated since then!
              • -
              • It looks like I added Solr to the backup_to_s3.sh script, but that script is not even being used (s3cmd is run directly from root's crontab)
              • +
              • I was tempted to do the Linode instance upgrade on CGSpace (linode18), but after looking closely at the system backups I noticed that Solr isn’t being backed up to S3
              • +
              • I apparently noticed this—and fixed it!—in 2016-07, but it doesn’t look like the backup has been updated since then!
              • +
              • It looks like I added Solr to the backup_to_s3.sh script, but that script is not even being used (s3cmd is run directly from root’s crontab)
              • For now I have just initiated a manual S3 backup of the Solr data:
              # s3cmd sync --delete-removed /home/backup/solr/ s3://cgspace.cgiar.org/solr/
              @@ -245,16 +245,16 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
               
              $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > /tmp/2018-07-08-orcids.txt
               $ ./resolve-orcids.py -i /tmp/2018-07-08-orcids.txt -o /tmp/2018-07-08-names.txt -d
               
                -
              • But after comparing to the existing list of names I didn't see much change, so I just ignored it
              • +
              • But after comparing to the existing list of names I didn’t see much change, so I just ignored it

              2018-07-09

                -
              • Uptime Robot said that CGSpace was down for two minutes early this morning but I don't see anything in Tomcat logs or dmesg
              • -
              • Uptime Robot said that CGSpace was down for two minutes again later in the day, and this time I saw a memory error in Tomcat's catalina.out:
              • +
              • Uptime Robot said that CGSpace was down for two minutes early this morning but I don’t see anything in Tomcat logs or dmesg
              • +
              • Uptime Robot said that CGSpace was down for two minutes again later in the day, and this time I saw a memory error in Tomcat’s catalina.out:
              Exception in thread "http-bio-127.0.0.1-8081-exec-557" java.lang.OutOfMemoryError: Java heap space
               
                -
              • I'm not sure if it's the same error, but I see this in DSpace's solr.log:
              • +
              • I’m not sure if it’s the same error, but I see this in DSpace’s solr.log:
              2018-07-09 06:25:09,913 ERROR org.apache.solr.servlet.SolrDispatchFilter @ null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
               
                @@ -284,17 +284,17 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
                $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-07-09
                 4435
                 
                  -
                • 95.108.181.88 appears to be Yandex, so I dunno why it's creating so many sessions, as its user agent should match Tomcat's Crawler Session Manager Valve
                • -
                • 70.32.83.92 is on MediaTemple but I'm not sure who it is. They are mostly hitting REST so I guess that's fine
                • -
                • 35.227.26.162 doesn't declare a user agent and is on Google Cloud, so I should probably mark them as a bot in nginx
                • +
                • 95.108.181.88 appears to be Yandex, so I dunno why it’s creating so many sessions, as its user agent should match Tomcat’s Crawler Session Manager Valve
                • +
                • 70.32.83.92 is on MediaTemple but I’m not sure who it is. They are mostly hitting REST so I guess that’s fine
                • +
                • 35.227.26.162 doesn’t declare a user agent and is on Google Cloud, so I should probably mark them as a bot in nginx
                • 178.154.200.38 is Yandex again
                • 207.46.13.47 is Bing
                • 157.55.39.234 is Bing
                • 137.108.70.6 is our old friend CORE bot
                • -
                • 50.116.102.77 doesn't declare a user agent and lives on HostGator, but mostly just hits the REST API so I guess that's fine
                • +
                • 50.116.102.77 doesn’t declare a user agent and lives on HostGator, but mostly just hits the REST API so I guess that’s fine
                • 40.77.167.84 is Bing again
                • Interestingly, the first time that I see 35.227.26.162 was on 2018-06-08
                • -
                • I've added 35.227.26.162 to the bot tagging logic in the nginx vhost
                • +
                • I’ve added 35.227.26.162 to the bot tagging logic in the nginx vhost

                2018-07-10

                  @@ -303,7 +303,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
                • Add “PII-FP2_MSCCCAFS” to CCAFS Phase II Project Tags (#383)
                • Add journal title (dc.source) to Discovery search filters (#384)
                • All were tested and merged to the 5_x-prod branch and will be deployed on CGSpace this coming weekend when I do the Linode server upgrade
                • -
                • I need to get them onto the 5.8 testing branch too, either via cherry-picking or by rebasing after we finish testing Atmire's 5.8 pull request (#378)
                • +
                • I need to get them onto the 5.8 testing branch too, either via cherry-picking or by rebasing after we finish testing Atmire’s 5.8 pull request (#378)
                • Linode sent an alert about CPU usage on CGSpace again, about 13:00UTC
                • These are the top ten users in the last two hours:
                @@ -324,7 +324,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
                213.139.52.250 - - [10/Jul/2018:13:39:41 +0000] "GET /bitstream/handle/10568/75668/dryad.png HTTP/2.0" 200 53750 "http://localhost:4200/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36"
                 
                • He said there was a bug that caused his app to request a bunch of invalid URLs
                • -
                • I'll have to keep and eye on this and see how their platform evolves
                • +
                • I’ll have to keep and eye on this and see how their platform evolves

                2018-07-11

                  @@ -365,9 +365,9 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki 96 40.77.167.90 7075 208.110.72.10
                -
              • We have never seen 208.110.72.10 before… so that's interesting!
              • +
              • We have never seen 208.110.72.10 before… so that’s interesting!
              • The user agent for these requests is: Pcore-HTTP/v0.44.0
              • -
              • A brief Google search doesn't turn up any information about what this bot is, but lots of users complaining about it
              • +
              • A brief Google search doesn’t turn up any information about what this bot is, but lots of users complaining about it
              • This bot does make a lot of requests all through the day, although it seems to re-use its Tomcat session:
              # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
              @@ -387,7 +387,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
               208.110.72.10 - - [12/Jul/2018:00:22:28 +0000] "GET /robots.txt HTTP/1.1" 200 1301 "https://cgspace.cgiar.org/robots.txt" "Pcore-HTTP/v0.44.0"
               
              • So this bot is just like Baiduspider, and I need to add it to the nginx rate limiting
              • -
              • I'll also add it to Tomcat's Crawler Session Manager Valve to force the re-use of a common Tomcat sesssion for all crawlers just in case
              • +
              • I’ll also add it to Tomcat’s Crawler Session Manager Valve to force the re-use of a common Tomcat sesssion for all crawlers just in case
              • Generate a list of all affiliations in CGSpace to send to Mohamed Salem to compare with the list on MEL (sorting the list by most occurrences):
              dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv header
              @@ -406,7 +406,7 @@ COPY 4518
               

              2018-07-15

              • Run all system updates on CGSpace, add latest metadata changes from last week, and start the Linode instance upgrade
              • -
              • After the upgrade I see we have more disk space available in the instance's dashboard, so I shut the instance down and resized it from 392GB to 650GB
              • +
              • After the upgrade I see we have more disk space available in the instance’s dashboard, so I shut the instance down and resized it from 392GB to 650GB
              • The resize was very quick (less than one minute) and after booting the instance back up I now have 631GB for the root filesystem (with 267GB available)!
              • Peter had asked a question about how mapped items are displayed in the Altmetric dashboard
              • For example, 10568/82810 is mapped to four collections, but only shows up in one “department” in their dashboard
              • @@ -452,9 +452,9 @@ $ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolv
                • ICARDA sent me another refined list of ORCID iDs so I sorted and formatted them into our controlled vocabulary again
                • Participate in call with IWMI and WLE to discuss Altmetric, CGSpace, and social media
                • -
                • I told them that they should try to be including the Handle link on their social media shares because that's the only way to get Altmetric to notice them and associate them with their DOIs
                • +
                • I told them that they should try to be including the Handle link on their social media shares because that’s the only way to get Altmetric to notice them and associate them with their DOIs
                • I suggested that we should have a wider meeting about this, and that I would post that on Yammer
                • -
                • I was curious about how and when Altmetric harvests the OAI, so I looked in nginx's OAI log
                • +
                • I was curious about how and when Altmetric harvests the OAI, so I looked in nginx’s OAI log
                • For every day in the past week I only see about 50 to 100 requests per day, but then about nine days ago I see 1500 requsts
                • In there I see two bots making about 750 requests each, and this one is probably Altmetric:
                @@ -494,7 +494,7 @@ X-XSS-Protection: 1; mode=block
              • Post a note on Yammer about Altmetric and Handle best practices
              • Update PostgreSQL JDBC jar from 42.2.2 to 42.2.4 in the RMG Ansible playbooks
              • IWMI asked why all the dates in their OpenSearch RSS feed show up as January 01, 2018
              • -
              • On closer inspection I notice that many of their items use “2018” as their dc.date.issued, which is a valid ISO 8601 date but it's not very specific so DSpace assumes it is January 01, 2018 00:00:00…
              • +
              • On closer inspection I notice that many of their items use “2018” as their dc.date.issued, which is a valid ISO 8601 date but it’s not very specific so DSpace assumes it is January 01, 2018 00:00:00…
              • I told her that they need to start using more accurate dates for their issue dates
              • In the example item I looked at the DOI has a publish date of 2018-03-16, so they should really try to capture that
              @@ -507,8 +507,8 @@ X-XSS-Protection: 1; mode=block
              webui.itemlist.sort-option.3 = dateaccessioned:dc.date.accessioned:date
               
              • Just because I was curious I made sure that these options are working as expected in DSpace 5.8 on DSpace Test (they are)
              • -
              • I tested the Atmire Listings and Reports (L&R) module one last time on my local test environment with a new snapshot of CGSpace's database and re-generated Discovery index and it worked fine
              • -
              • I finally informed Atmire that we're ready to proceed with deploying this to CGSpace and that they should advise whether we should wait about the SNAPSHOT versions in pom.xml
              • +
              • I tested the Atmire Listings and Reports (L&R) module one last time on my local test environment with a new snapshot of CGSpace’s database and re-generated Discovery index and it worked fine
              • +
              • I finally informed Atmire that we’re ready to proceed with deploying this to CGSpace and that they should advise whether we should wait about the SNAPSHOT versions in pom.xml
              • There is no word on the issue I reported with Tomcat 8.5.32 yet, though…

              2018-07-23

              @@ -539,7 +539,7 @@ dspace=# select count(text_value) from metadatavalue where resource_type_id=2 an

            2018-07-27

              -
            • Follow up with Atmire again about the SNAPSHOT versions in our pom.xml because I want to finalize the DSpace 5.8 upgrade soon and I haven't heard from them in a month (ticket 560)
            • +
            • Follow up with Atmire again about the SNAPSHOT versions in our pom.xml because I want to finalize the DSpace 5.8 upgrade soon and I haven’t heard from them in a month (ticket 560)
            diff --git a/docs/2018-08/index.html b/docs/2018-08/index.html index 08cec5d29..4d3f9cc92 100644 --- a/docs/2018-08/index.html +++ b/docs/2018-08/index.html @@ -15,10 +15,10 @@ DSpace Test had crashed at some point yesterday morning and I see the following [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight -From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat's -I'm not sure why Tomcat didn't crash with an OutOfMemoryError… +From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s +I’m not sure why Tomcat didn’t crash with an OutOfMemoryError… Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core -The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes +The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes I ran all system updates on DSpace Test and rebooted it " /> @@ -37,13 +37,13 @@ DSpace Test had crashed at some point yesterday morning and I see the following [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight -From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat's -I'm not sure why Tomcat didn't crash with an OutOfMemoryError… +From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s +I’m not sure why Tomcat didn’t crash with an OutOfMemoryError… Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core -The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes +The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes I ran all system updates on DSpace Test and rebooted it "/> - + @@ -73,7 +73,7 @@ I ran all system updates on DSpace Test and rebooted it - + @@ -120,7 +120,7 @@ I ran all system updates on DSpace Test and rebooted it

            August, 2018

            @@ -134,10 +134,10 @@ I ran all system updates on DSpace Test and rebooted it [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
            • Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
            • -
            • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat's
            • -
            • I'm not sure why Tomcat didn't crash with an OutOfMemoryError…
            • +
            • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s
            • +
            • I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…
            • Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
            • -
            • The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
            • +
            • The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
            • I ran all system updates on DSpace Test and rebooted it
              @@ -152,23 +152,23 @@ I ran all system updates on DSpace Test and rebooted it

            2018-08-02

              -
            • DSpace Test crashed again and I don't see the only error I see is this in dmesg:
            • +
            • DSpace Test crashed again and I don’t see the only error I see is this in dmesg:
            [Thu Aug  2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child
             [Thu Aug  2 00:00:12 2018] Killed process 1407 (java) total-vm:18876328kB, anon-rss:6323836kB, file-rss:0kB, shmem-rss:0kB
             
            • I am still assuming that this is the Tomcat process that is dying, so maybe actually we need to reduce its memory instead of increasing it?
            • -
            • The risk we run there is that we'll start getting OutOfMemory errors from Tomcat
            • +
            • The risk we run there is that we’ll start getting OutOfMemory errors from Tomcat
            • So basically we need a new test server with more RAM very soon…
            • Abenet asked about the workflow statistics in the Atmire CUA module again
            • -
            • Last year Atmire told me that it's disabled by default but you can enable it with workflow.stats.enabled = true in the CUA configuration file
            • -
            • There was a bug with adding users so they sent a patch, but I didn't merge it because it was very dirty and I wasn't sure it actually fixed the problem
            • -
            • I just tried to enable the stats again on DSpace Test now that we're on DSpace 5.8 with updated Atmire modules, but every user I search for shows “No data available”
            • +
            • Last year Atmire told me that it’s disabled by default but you can enable it with workflow.stats.enabled = true in the CUA configuration file
            • +
            • There was a bug with adding users so they sent a patch, but I didn’t merge it because it was very dirty and I wasn’t sure it actually fixed the problem
            • +
            • I just tried to enable the stats again on DSpace Test now that we’re on DSpace 5.8 with updated Atmire modules, but every user I search for shows “No data available”
            • As a test I submitted a new item and I was able to see it in the workflow statistics “data” tab, but not in the graph

            2018-08-15

              -
            • Run through Peter's list of author affiliations from earlier this month
            • +
            • Run through Peter’s list of author affiliations from earlier this month
            • I did some quick sanity checks and small cleanups in Open Refine, checking for spaces, weird accents, and encoding errors
            • Finally I did a test run with the fix-metadata-value.py script:
            @@ -210,8 +210,8 @@ Verchot, L.V. Verchot, LV Verchot, Louis V.
              -
            • I'll just tag them all with Louis Verchot's ORCID identifier…
            • -
            • In the end, I'll run the following CSV with my add-orcid-identifiers-csv.py script:
            • +
            • I’ll just tag them all with Louis Verchot’s ORCID identifier…
            • +
            • In the end, I’ll run the following CSV with my add-orcid-identifiers-csv.py script:
            dc.contributor.author,cg.creator.id
             "Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859
            @@ -290,17 +290,17 @@ sys     2m20.248s
             # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19
             1724
             
              -
            • I don't even know how its possible for the bot to use MORE sessions than total requests…
            • +
            • I don’t even know how its possible for the bot to use MORE sessions than total requests…
            • The user agent is:
            Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
             
              -
            • So I'm thinking we should add “crawl” to the Tomcat Crawler Session Manager valve, as we already have “bot” that catches Googlebot, Bingbot, etc.
            • +
            • So I’m thinking we should add “crawl” to the Tomcat Crawler Session Manager valve, as we already have “bot” that catches Googlebot, Bingbot, etc.

            2018-08-20

            • Help Sisay with some UTF-8 encoding issues in a file Peter sent him
            • -
            • Finish up reconciling Atmire's pull request for DSpace 5.8 changes with the latest status of our 5_x-prod branch
            • +
            • Finish up reconciling Atmire’s pull request for DSpace 5.8 changes with the latest status of our 5_x-prod branch
            • I had to do some git rev-list --reverse --no-merges oldestcommit..newestcommit and git cherry-pick -S hackery to get everything all in order
            • After building I ran the Atmire schema migrations and forced old migrations, then did the ant update
            • I tried to build it on DSpace Test, but it seems to still need more RAM to complete (like I experienced last month), so I stopped Tomcat and set JAVA_OPTS to 1024m and tried the mvn package again
            • @@ -308,8 +308,8 @@ sys 2m20.248s
            • I will try to reduce Tomcat memory from 4608m to 4096m and then retry the mvn package with 1024m of JAVA_OPTS again
            • After running the mvn package for the third time and waiting an hour, I attached strace to the Java process and saw that it was indeed reading XMLUI theme data… so I guess I just need to wait more
            • After waiting two hours the maven process completed and installation was successful
            • -
            • I restarted Tomcat and it seems everything is working well, so I'll merge the pull request and try to schedule the CGSpace upgrade for this coming Sunday, August 26th
            • -
            • I merged Atmire's pull request into our 5_x-dspace-5.8 temporary brach and then cherry-picked all the changes from 5_x-prod since April, 2018 when that temporary branch was created
            • +
            • I restarted Tomcat and it seems everything is working well, so I’ll merge the pull request and try to schedule the CGSpace upgrade for this coming Sunday, August 26th
            • +
            • I merged Atmire’s pull request into our 5_x-dspace-5.8 temporary brach and then cherry-picked all the changes from 5_x-prod since April, 2018 when that temporary branch was created
            • As the branch histories are very different I cannot merge the new 5.8 branch into the current 5_x-prod branch
            • Instead, I will archive the current 5_x-prod DSpace 5.5 branch as 5_x-prod-dspace-5.5 and then hard reset 5_x-prod based on 5_x-dspace-5.8
            • Unfortunately this will mess up the references in pull requests and issues on GitHub
            • @@ -320,8 +320,8 @@ sys 2m20.248s
            [INFO] Processing overlay [ id org.dspace.modules:xmlui-mirage2]
             
              -
            • It's the same on DSpace Test, my local laptop, and CGSpace…
            • -
            • It wasn't this way before when I was constantly building the previous 5.8 branch with Atmire patches…
            • +
            • It’s the same on DSpace Test, my local laptop, and CGSpace…
            • +
            • It wasn’t this way before when I was constantly building the previous 5.8 branch with Atmire patches…
            • I will restore the previous 5_x-dspace-5.8 and atmire-module-upgrades-5.8 branches to see if the build time is different there
            • … it seems that the atmire-module-upgrades-5.8 branch still takes 1 hour and 23 minutes on my local machine…
            • Let me try to build the old 5_x-prod-dspace-5.5 branch on my local machine and see how long it takes
            • @@ -330,7 +330,7 @@ sys 2m20.248s
            [INFO] --- maven-war-plugin:2.4:war (default-war) @ xmlui ---
             
              -
            • And I notice that Atmire changed something in the XMLUI module's pom.xml as part of the DSpace 5.8 changes, specifically to remove the exclude for node_modules in the maven-war-plugin step
            • +
            • And I notice that Atmire changed something in the XMLUI module’s pom.xml as part of the DSpace 5.8 changes, specifically to remove the exclude for node_modules in the maven-war-plugin step
            • This exclude is present in vanilla DSpace, and if I add it back the build time goes from 1 hour 23 minutes to 12 minutes!
            • It makes sense that it would take longer to complete this step because the node_modules folder has tens of thousands of files, and we have 27 themes!
            • I need to test to see if this has any side effects when deployed…
            • @@ -342,14 +342,14 @@ sys 2m20.248s
            • They say they want to start working on the ContentDM harvester middleware again
            • I sent a list of the top 1500 author affiliations on CGSpace to CodeObia so we can compare ours with the ones on MELSpace
            • Discuss CTA items with Sisay, he was trying to figure out how to do the collection mapping in combination with SAFBuilder
            • -
            • It appears that the web UI's upload interface requires you to specify the collection, whereas the CLI interface allows you to omit the collection command line flag and defer to the collections file inside each item in the bundle
            • +
            • It appears that the web UI’s upload interface requires you to specify the collection, whereas the CLI interface allows you to omit the collection command line flag and defer to the collections file inside each item in the bundle
            • I imported the CTA items on CGSpace for Sisay:
            $ dspace import -a -e s.webshet@cgiar.org -s /home/swebshet/ictupdates_uploads_August_21 -m /tmp/2018-08-23-cta-ictupdates.map
             

            2018-08-26

            • Doing the DSpace 5.8 upgrade on CGSpace (linode18)
            • -
            • I already finished the Maven build, now I'll take a backup of the PostgreSQL database and do a database cleanup just in case:
            • +
            • I already finished the Maven build, now I’ll take a backup of the PostgreSQL database and do a database cleanup just in case:
            $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-08-26-before-dspace-58.backup dspace
             $ dspace cleanup -v
            @@ -371,7 +371,7 @@ dspace=> \q
             
             $ dspace database migrate ignored
             
              -
            • Then I'll run all system updates and reboot the server:
            • +
            • Then I’ll run all system updates and reboot the server:
            $ sudo su -
             # apt update && apt full-upgrade
            @@ -380,9 +380,9 @@ $ dspace database migrate ignored
             
            • After reboot I logged in and cleared all the XMLUI caches and everything looked to be working fine
            • Adam from WLE had asked a few weeks ago about getting the metadata for a bunch of items related to gender from 2013 until now
            • -
            • They want a CSV with all metadata, which the Atmire Listings and Reports module can't do
            • +
            • They want a CSV with all metadata, which the Atmire Listings and Reports module can’t do
            • I exported a list of items from Listings and Reports with the following criteria: from year 2013 until now, have WLE subject GENDER or GENDER POVERTY AND INSTITUTIONS, and CRP Water, Land and Ecosystems
            • -
            • Then I extracted the Handle links from the report so I could export each item's metadata as CSV
            • +
            • Then I extracted the Handle links from the report so I could export each item’s metadata as CSV
            $ grep -o -E "[0-9]{5}/[0-9]{0,5}" listings-export.txt > /tmp/iwmi-gender-items.txt
             
              @@ -391,21 +391,21 @@ $ dspace database migrate ignored
              $ while read -r line; do dspace metadata-export -f "/tmp/${line/\//-}.csv" -i $line; sleep 2; done < /tmp/iwmi-gender-items.txt
               
              • But from here I realized that each of the fifty-nine items will have different columns in their CSVs, making it difficult to combine them
              • -
              • I'm not sure how to proceed without writing some script to parse and join the CSVs, and I don't think it's worth my time
              • -
              • I tested DSpace 5.8 in Tomcat 8.5.32 and it seems to work now, so I'm not sure why I got those errors last time I tried
              • +
              • I’m not sure how to proceed without writing some script to parse and join the CSVs, and I don’t think it’s worth my time
              • +
              • I tested DSpace 5.8 in Tomcat 8.5.32 and it seems to work now, so I’m not sure why I got those errors last time I tried
              • It could have been a configuration issue, though, as I also reconciled the server.xml with the one in our Ansible infrastructure scripts
              • But now I can start testing and preparing to move DSpace Test to Ubuntu 18.04 + Tomcat 8.5 + OpenJDK + PostgreSQL 9.6…
              • Actually, upon closer inspection, it seems that when you try to go to Listings and Reports under Tomcat 8.5.33 you are taken to the JSPUI login page despite having already logged in in XMLUI
              • If I type my username and password again it does take me to Listings and Reports, though…
              • -
              • I don't see anything interesting in the Catalina or DSpace logs, so I might have to file a bug with Atmire
              • -
              • For what it's worth, the Content and Usage (CUA) module does load, though I can't seem to get any results in the graph
              • +
              • I don’t see anything interesting in the Catalina or DSpace logs, so I might have to file a bug with Atmire
              • +
              • For what it’s worth, the Content and Usage (CUA) module does load, though I can’t seem to get any results in the graph
              • I just checked to see if the Listings and Reports issue with using the CGSpace citation field was fixed as planned alongside the DSpace 5.8 upgrades (#589
              • I was able to create a new layout containing only the citation field, so I closed the ticket

              2018-08-29

              • Discuss COPO with Martin Mueller
              • -
              • He and the consortium's idea is to use this for metadata annotation (submission?) to all repositories
              • +
              • He and the consortium’s idea is to use this for metadata annotation (submission?) to all repositories
              • It is somehow related to adding events as items in the repository, and then linking related papers, presentations, etc to the event item using dc.relation, etc.
              • Discuss Linode server charges with Abenet, apparently we want to start charging these to Big Data
              diff --git a/docs/2018-09/index.html b/docs/2018-09/index.html index b34a047e3..6c148a8bb 100644 --- a/docs/2018-09/index.html +++ b/docs/2018-09/index.html @@ -9,9 +9,9 @@ @@ -23,11 +23,11 @@ I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&# - + @@ -57,7 +57,7 @@ I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&# - + @@ -104,7 +104,7 @@ I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&#

              September, 2018

              @@ -112,9 +112,9 @@ I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&#

              2018-09-02

              • New PostgreSQL JDBC driver version 42.2.5
              • -
              • I'll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
              • -
              • Also, I'll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month
              • -
              • I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:
              • +
              • I’ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
              • +
              • Also, I’ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
              • +
              • I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
              02-Sep-2018 11:18:52.678 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
                java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
              @@ -138,11 +138,11 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
               
            • XMLUI fails to load, but the REST, SOLR, JSPUI, etc work
            • The old 5_x-prod-dspace-5.5 branch does work in Ubuntu 18.04 with Tomcat 8.5.30-1ubuntu1.4, however!
            • And the 5_x-prod DSpace 5.8 branch does work in Tomcat 8.5.x on my Arch Linux laptop…
            • -
            • I'm not sure where the issue is then!
            • +
            • I’m not sure where the issue is then!

            2018-09-03

              -
            • Abenet says she's getting three emails about periodic statistics reports every day since the DSpace 5.8 upgrade last week
            • +
            • Abenet says she’s getting three emails about periodic statistics reports every day since the DSpace 5.8 upgrade last week
            • They are from the CUA module
            • Two of them have “no data” and one has a “null” title
            • The last one is a report of the top downloaded items, and includes a graph
            • @@ -151,13 +151,13 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana

            2018-09-04

              -
            • I'm looking over the latest round of IITA records from Sisay: Mercy1806_August_29 +
            • I’m looking over the latest round of IITA records from Sisay: Mercy1806_August_29
              • All fields are split with multiple columns like cg.authorship.types and cg.authorship.types[]
              • This makes it super annoying to do the checks and cleanup, so I will merge them (also time consuming)
              • Five items had dc.date.issued values like 2013-5 so I corrected them to be 2013-05
              • Several metadata fields had values with newlines in them (even in some titles!), which I fixed by trimming the consecutive whitespaces in Open Refine
              • -
              • Many (91!) items from before 2011 are indicated as having a CRP, but CRPs didn't exist then so this is impossible +
              • Many (91!) items from before 2011 are indicated as having a CRP, but CRPs didn’t exist then so this is impossible
                • I got all items that were from 2011 and onwards using a custom facet with this GREL on the dc.date.issued column: isNotNull(value.match(/201[1-8].*/)) and then blanking their CRPs
                @@ -170,7 +170,7 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
              • One invalid value for dc.type
            • -
            • Abenet says she hasn't received any more subscription emails from the CUA module since she unsubscribed yesterday, so I think we don't need create an issue on Atmire's bug tracker anymore
            • +
            • Abenet says she hasn’t received any more subscription emails from the CUA module since she unsubscribed yesterday, so I think we don’t need create an issue on Atmire’s bug tracker anymore

            2018-09-10

              @@ -213,7 +213,7 @@ requests:
              2018-09-10 07:26:35,551 ERROR org.dspace.submit.step.CompleteStep @ Caught exception in submission step: 
               org.dspace.authorize.AuthorizeException: Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:2 by user 3819
               
                -
              • Seems to be during submit step, because it's workflow step 1…?
              • +
              • Seems to be during submit step, because it’s workflow step 1…?
              • Move some top-level CRP communities to be below the new CGIAR Research Programs and Platforms community:
              $ dspace community-filiator --set -p 10568/97114 -c 10568/51670
              @@ -237,7 +237,7 @@ UPDATE 15
               
            • The current cg.identifier.status field will become “Access rights” and dc.rights will become “Usage rights”
            • I have some work in progress on the 5_x-rights branch
            • Linode said that CGSpace (linode18) had a high CPU load earlier today
            • -
            • When I looked, I see it's the same Russian IP that I noticed last month:
            • +
            • When I looked, I see it’s the same Russian IP that I noticed last month:
            # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Sep/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                1459 157.55.39.202
            @@ -260,7 +260,7 @@ UPDATE 15
             
          Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
           
            -
          • I added .*crawl.* to the Tomcat Session Crawler Manager Valve, so I'm not sure why the bot is creating so many sessions…
          • +
          • I added .*crawl.* to the Tomcat Session Crawler Manager Valve, so I’m not sure why the bot is creating so many sessions…
          • I just tested that user agent on CGSpace and it does not create a new session:
          $ http --print Hh https://cgspace.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)'
          @@ -298,17 +298,17 @@ $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e
           
          • Sisay is still having problems with the controlled vocabulary for top authors
          • I took a look at the submission template and Firefox complains that the XML file is missing a root element
          • -
          • I guess it's because Firefox is receiving an empty XML file
          • +
          • I guess it’s because Firefox is receiving an empty XML file
          • I told Sisay to run the XML file through tidy
          • More testing of the access and usage rights changes

          2018-09-13

          • Peter was communicating with Altmetric about the OAI mapping issue for item 10568/82810 again
          • -
          • Altmetric said it was somehow related to the OAI dateStamp not getting updated when the mappings changed, but I said that back in 2018-07 when this happened it was because the OAI was actually just not reflecting all the item's mappings
          • +
          • Altmetric said it was somehow related to the OAI dateStamp not getting updated when the mappings changed, but I said that back in 2018-07 when this happened it was because the OAI was actually just not reflecting all the item’s mappings
          • After forcing a complete re-indexing of OAI the mappings were fine
          • -
          • The dateStamp is most probably only updated when the item's metadata changes, not its mappings, so if Altmetric is relying on that we're in a tricky spot
          • -
          • We need to make sure that our OAI isn't publicizing stale data… I was going to post something on the dspace-tech mailing list, but never did
          • +
          • The dateStamp is most probably only updated when the item’s metadata changes, not its mappings, so if Altmetric is relying on that we’re in a tricky spot
          • +
          • We need to make sure that our OAI isn’t publicizing stale data… I was going to post something on the dspace-tech mailing list, but never did
          • Linode says that CGSpace (linode18) has had high CPU for the past two hours
          • The top IP addresses today are:
          @@ -331,8 +331,8 @@ $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-13 | sort | uniq 2
            -
          • So I'm not sure what's going on
          • -
          • Valerio asked me if there's a way to get the page views and downloads from CGSpace
          • +
          • So I’m not sure what’s going on
          • +
          • Valerio asked me if there’s a way to get the page views and downloads from CGSpace
          • I said no, but that we might be able to piggyback on the Atmire statlet REST API
          • For example, when you expand the “statlet” at the bottom of an item like 10568/97103 you can see the following request in the browser console:
          @@ -340,12 +340,12 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
          • That JSON file has the total page views and item downloads for the item…
          • Abenet forwarded a request by CIP that item thumbnails be included in RSS feeds
          • -
          • I had a quick look at the DSpace 5.x manual and it doesn't not seem that this is possible (you can only add metadata)
          • -
          • Testing the new LDAP server the CGNET says will be replacing the old one, it doesn't seem that they are using the global catalog on port 3269 anymore, now only 636 is open
          • +
          • I had a quick look at the DSpace 5.x manual and it doesn’t not seem that this is possible (you can only add metadata)
          • +
          • Testing the new LDAP server the CGNET says will be replacing the old one, it doesn’t seem that they are using the global catalog on port 3269 anymore, now only 636 is open
          • I did a clean deploy of DSpace 5.8 on Ubuntu 18.04 with some stripped down Tomcat 8 configuration and actually managed to get it up and running without the autowire errors that I had previously experienced
          • I realized that it always works on my local machine with Tomcat 8.5.x, but not when I do the deployment from Ansible in Ubuntu 18.04
          • So there must be something in my Tomcat 8 server.xml template
          • -
          • Now I re-deployed it with the normal server template and it's working, WTF?
          • +
          • Now I re-deployed it with the normal server template and it’s working, WTF?
          • Must have been something like an old DSpace 5.5 file in the spring folder… weird
          • But yay, this means we can update DSpace Test to Ubuntu 18.04, Tomcat 8, PostgreSQL 9.6, etc…
          @@ -357,7 +357,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-

          2018-09-16

          • Add the DSpace build.properties as a template into my Ansible infrastructure scripts for configuring DSpace machines
          • -
          • One stupid thing there is that I add all the variables in a private vars file, which is apparently higher precedence than host vars, meaning that I can't override them (like SMTP server) on a per-host basis
          • +
          • One stupid thing there is that I add all the variables in a private vars file, which is apparently higher precedence than host vars, meaning that I can’t override them (like SMTP server) on a per-host basis
          • Discuss access and usage rights with Peter
          • I suggested that we leave access rights (cg.identifier.access) as it is now, with “Open Access” or “Limited Access”, and then simply re-brand that as “Access rights” in the UIs and relevant drop downs
          • Then we continue as planned to add dc.rights as “Usage rights”
          • @@ -374,26 +374,26 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
          • Update these immediately, but talk to CodeObia to create a mapping between the old and new values
          • Finalize dc.rights “Usage rights” with seven combinations of Creative Commons, plus the others
          • -
          • Need to double check the new CRP community to see why the collection counts aren't updated after we moved the communities there last week +
          • Need to double check the new CRP community to see why the collection counts aren’t updated after we moved the communities there last week
            • I forced a full Discovery re-index and now the community shows 1,600 items
          • -
          • Check if it's possible to have items deposited via REST use a workflow so we can perhaps tell ICARDA to use that from MEL
          • -
          • Agree that we'll publicize AReS explorer on the week before the Big Data Platform workshop +
          • Check if it’s possible to have items deposited via REST use a workflow so we can perhaps tell ICARDA to use that from MEL
          • +
          • Agree that we’ll publicize AReS explorer on the week before the Big Data Platform workshop
            • Put a link and or picture on the CGSpace homepage saying “Visualized CGSpace research” or something, and post a message on Yammer
          • I want to explore creating a thin API to make the item view and download stats available from Solr so CodeObia can use them in the AReS explorer
          • -
          • Currently CodeObia is exploring using the Atmire statlets internal API, but I don't really like that…
          • +
          • Currently CodeObia is exploring using the Atmire statlets internal API, but I don’t really like that…
          • There are some example queries on the DSpace Solr wiki
          • For example, this query returns 1655 rows for item 10568/10630:
          $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false'
           
            -
          • The id in the Solr query is the item's database id (get it from the REST API or something)
          • -
          • Next, I adopted a query to get the downloads and it shows 889, which is similar to the number Atmire's statlet shows, though the query logic here is confusing:
          • +
          • The id in the Solr query is the item’s database id (get it from the REST API or something)
          • +
          • Next, I adopted a query to get the downloads and it shows 889, which is similar to the number Atmire’s statlet shows, though the query logic here is confusing:
          $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
           
            @@ -404,7 +404,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
          • -(bundleName:[*+TO+*]-bundleName:ORIGINAL) seems to be a negative query starting with all documents, subtracting those with bundleName:ORIGINAL, and then negating the whole thing… meaning only documents from bundleName:ORIGINAL?
          -
        • What the shit, I think I'm right: the simplified logic in this query returns the same 889:
        • +
        • What the shit, I think I’m right: the simplified logic in this query returns the same 889:
        $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=bundleName:ORIGINAL&fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
         
          @@ -412,12 +412,12 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
        $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=bundleName:ORIGINAL&fq=statistics_type:view'
         
          -
        • As for item views, I suppose that's just the same query, minus the bundleName:ORIGINAL:
        • +
        • As for item views, I suppose that’s just the same query, minus the bundleName:ORIGINAL:
        $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=-bundleName:ORIGINAL&fq=statistics_type:view'
         
        • That one returns 766, which is exactly 1655 minus 889…
        • -
        • Also, Solr's fq is similar to the regular q query parameter, but it is considered for the Solr query cache so it should be faster for multiple queries
        • +
        • Also, Solr’s fq is similar to the regular q query parameter, but it is considered for the Solr query cache so it should be faster for multiple queries

        2018-09-18

          @@ -432,7 +432,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09- "views": 15 }
            -
          • The numbers are different than those that come from Atmire's statlets for some reason, but as I'm querying Solr directly, I have no idea where their numbers come from!
          • +
          • The numbers are different than those that come from Atmire’s statlets for some reason, but as I’m querying Solr directly, I have no idea where their numbers come from!
          • Moayad from CodeObia asked if I could make the API be able to paginate over all items, for example: /statistics?limit=100&page=1
          • Getting all the item IDs from PostgreSQL is certainly easy:
          @@ -443,7 +443,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-

          2018-09-19

          2018-09-20

            @@ -464,21 +464,21 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-

            2018-09-21

            • I see that there was a nice optimization to the ImageMagick PDF CMYK detection in the upstream dspace-5_x branch: DS-3664
            • -
            • The fix will go into DSpace 5.10, and we are currently on DSpace 5.8 but I think I'll cherry-pick that fix into our 5_x-prod branch: +
            • The fix will go into DSpace 5.10, and we are currently on DSpace 5.8 but I think I’ll cherry-pick that fix into our 5_x-prod branch:
              • 4e8c7b578bdbe26ead07e36055de6896bbf02f83: ImageMagick: Only execute “identify” on first page
            • I think it would also be nice to cherry-pick the fixes for DS-3883, which is related to optimizing the XMLUI item display of items with many bitstreams
                -
              • a0ea20bd1821720b111e2873b08e03ce2bf93307: DS-3883: Don't loop through original bitstreams if only displaying thumbnails
              • +
              • a0ea20bd1821720b111e2873b08e03ce2bf93307: DS-3883: Don’t loop through original bitstreams if only displaying thumbnails
              • 8d81e825dee62c2aa9d403a505e4a4d798964e8d: DS-3883: If only including thumbnails, only load the main item thumbnail.

            2019-09-23

              -
            • I did more work on my cgspace-statistics-api, fixing some item view counts and adding indexing via SQLite (I'm trying to avoid having to set up yet another database, user, password, etc) during deployment
            • +
            • I did more work on my cgspace-statistics-api, fixing some item view counts and adding indexing via SQLite (I’m trying to avoid having to set up yet another database, user, password, etc) during deployment
            • I created a new branch called 5_x-upstream-cherry-picks to test and track those cherry-picks from the upstream 5.x branch
            • Also, I need to test the new LDAP server, so I will deploy that on DSpace Test today
            • Rename my cgspace-statistics-api to dspace-statistics-api on GitHub
            • @@ -486,7 +486,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-

              2018-09-24

              • Trying to figure out how to get item views and downloads from SQLite in a join
              • -
              • It appears SQLite doesn't support FULL OUTER JOIN so some people on StackOverflow have emulated it with LEFT JOIN and UNION:
              • +
              • It appears SQLite doesn’t support FULL OUTER JOIN so some people on StackOverflow have emulated it with LEFT JOIN and UNION:
              > SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemviews views
               LEFT JOIN itemdownloads downloads USING(id)
              @@ -495,7 +495,7 @@ SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemdownloa
               LEFT JOIN itemviews views USING(id)
               WHERE views.id IS NULL;
               
                -
              • This “works” but the resulting rows are kinda messy so I'd have to do extra logic in Python
              • +
              • This “works” but the resulting rows are kinda messy so I’d have to do extra logic in Python
              • Maybe we can use one “items” table with defaults values and UPSERT (aka insert… on conflict … do update):
              sqlite> CREATE TABLE items(id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0);
              @@ -507,9 +507,9 @@ sqlite> INSERT INTO items(id, views) VALUES(0, 3) ON CONFLICT(id) DO UPDATE S
               sqlite> INSERT INTO items(id, views) VALUES(0, 7) ON CONFLICT(id) DO UPDATE SET downloads=excluded.views;
               
              • This totally works!
              • -
              • Note the special excluded.views form! See SQLite's lang_UPSERT documentation
              • -
              • Oh nice, I finally finished the Falcon API route to page through all the results using SQLite's amazing LIMIT and OFFSET support
              • -
              • But when I deployed it on my Ubuntu 16.04 environment I realized Ubuntu's SQLite is old and doesn't support UPSERT, so my indexing doesn't work…
              • +
              • Note the special excluded.views form! See SQLite’s lang_UPSERT documentation
              • +
              • Oh nice, I finally finished the Falcon API route to page through all the results using SQLite’s amazing LIMIT and OFFSET support
              • +
              • But when I deployed it on my Ubuntu 16.04 environment I realized Ubuntu’s SQLite is old and doesn’t support UPSERT, so my indexing doesn’t work…
              • Apparently UPSERT came in SQLite 3.24.0 (2018-06-04), and Ubuntu 16.04 has 3.11.0
              • Ok this is hilarious, I manually downloaded the libsqlite3 3.24.0 deb from Ubuntu 18.10 “cosmic” and installed it in Ubnutu 16.04 and now the Python indexer.py works
              • This is definitely a dirty hack, but the list of packages we use that depend on libsqlite3-0 in Ubuntu 16.04 are actually pretty few:
              • @@ -543,28 +543,28 @@ dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DE

                2018-09-25

                • I deployed the DSpace statistics API on CGSpace, but when I ran the indexer it wanted to index 180,000 pages of item views
                • -
                • I'm not even sure how that's possible, as we only have 74,000 items!
                • +
                • I’m not even sure how that’s possible, as we only have 74,000 items!
                • I need to inspect the id values that are returned for views and cross check them with the owningItem values for bitstream downloads…
                • -
                • Also, I could try to check all IDs against the items table to see if they are actually items (perhaps the Solr id field doesn't correspond with actual DSpace items?)
                • -
                • I want to purge the bot hits from the Solr statistics core, as I am now realizing that I don't give a shit about tens of millions of hits by Google and Bing indexing my shit every day (at least not in Solr!)
                • -
                • CGSpace's Solr core has 150,000,000 documents in it… and it's still pretty fast to query, but it's really a maintenance and backup burden
                • -
                • DSpace Test currently has about 2,000,000 documents with isBot:true in its Solr statistics core, and the size on disk is 2GB (it's not much, but I have to test this somewhere!)
                • -
                • According to the DSpace 5.x Solr documentation I can use dspace stats-util -f, so let's try it:
                • +
                • Also, I could try to check all IDs against the items table to see if they are actually items (perhaps the Solr id field doesn’t correspond with actual DSpace items?)
                • +
                • I want to purge the bot hits from the Solr statistics core, as I am now realizing that I don’t give a shit about tens of millions of hits by Google and Bing indexing my shit every day (at least not in Solr!)
                • +
                • CGSpace’s Solr core has 150,000,000 documents in it… and it’s still pretty fast to query, but it’s really a maintenance and backup burden
                • +
                • DSpace Test currently has about 2,000,000 documents with isBot:true in its Solr statistics core, and the size on disk is 2GB (it’s not much, but I have to test this somewhere!)
                • +
                • According to the DSpace 5.x Solr documentation I can use dspace stats-util -f, so let’s try it:
                $ dspace stats-util -f
                 
                • The command comes back after a few seconds and I still see 2,000,000 documents in the statistics core with isBot:true
                • -
                • I was just writing a message to the dspace-tech mailing list and then I decided to check the number of bot view events on DSpace Test again, and now it's 201 instead of 2,000,000, and statistics core is only 30MB now!
                • +
                • I was just writing a message to the dspace-tech mailing list and then I decided to check the number of bot view events on DSpace Test again, and now it’s 201 instead of 2,000,000, and statistics core is only 30MB now!
                • I will set the logBots = false property in dspace/config/modules/usage-statistics.cfg on DSpace Test and check if the number of isBot:true events goes up any more…
                • I restarted the server with logBots = false and after it came back up I see 266 events with isBots:true (maybe they were buffered)… I will check again tomorrow
                • -
                • After a few hours I see there are still only 266 view events with isBot:true on DSpace Test's Solr statistics core, so I'm definitely going to deploy this on CGSpace soon
                • -
                • Also, CGSpace currently has 60,089,394 view events with isBot:true in it's Solr statistics core and it is 124GB!
                • +
                • After a few hours I see there are still only 266 view events with isBot:true on DSpace Test’s Solr statistics core, so I’m definitely going to deploy this on CGSpace soon
                • +
                • Also, CGSpace currently has 60,089,394 view events with isBot:true in it’s Solr statistics core and it is 124GB!
                • Amazing! After running dspace stats-util -f on CGSpace the Solr statistics core went from 124GB to 60GB, and now there are only 700 events with isBot:true so I should really disable logging of bot events!
                • -
                • I'm super curious to see how the JVM heap usage changes…
                • +
                • I’m super curious to see how the JVM heap usage changes…
                • I made (and merged) a pull request to disable bot logging on the 5_x-prod branch (#387)
                • -
                • Now I'm wondering if there are other bot requests that aren't classified as bots because the IP lists or user agents are outdated
                • +
                • Now I’m wondering if there are other bot requests that aren’t classified as bots because the IP lists or user agents are outdated
                • DSpace ships a list of spider IPs, for example: config/spiders/iplists.com-google.txt
                • -
                • I checked the list against all the IPs we've seen using the “Googlebot” useragent on CGSpace's nginx access logs
                • +
                • I checked the list against all the IPs we’ve seen using the “Googlebot” useragent on CGSpace’s nginx access logs
                • The first thing I learned is that shit tons of IPs in Russia, Ukraine, Ireland, Brazil, Portugal, the US, Canada, etc are pretending to be “Googlebot”…
                • According to the Googlebot FAQ the domain name in the reverse DNS lookup should contain either googlebot.com or google.com
                • In Solr this appears to be an appropriate query that I can maybe use later (returns 81,000 documents):
                • @@ -577,7 +577,7 @@ dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DE
                  • And magically all those 81,000 documents are gone!
                  • After a few hours the Solr statistics core is down to 44GB on CGSpace!
                  • -
                  • I did a major refactor and logic fix in the DSpace Statistics API's indexer.py
                  • +
                  • I did a major refactor and logic fix in the DSpace Statistics API’s indexer.py
                  • Basically, it turns out that using facet.mincount=1 is really beneficial for me because it reduces the size of the Solr result set, reduces the amount of data we need to ingest into PostgreSQL, and the API returns HTTP 404 Not Found for items without views or downloads anyways
                  • I deployed the new version on CGSpace and now it looks pretty good!
                  @@ -585,14 +585,14 @@ dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DE ... Indexing item downloads (page 260 of 260)
                    -
                  • And now it's fast as hell due to the muuuuch smaller Solr statistics core
                  • +
                  • And now it’s fast as hell due to the muuuuch smaller Solr statistics core

                  2018-09-26

                  • Linode emailed to say that CGSpace (linode18) was using 30Mb/sec of outward bandwidth for two hours around midnight
                  • -
                  • I don't see anything unusual in the nginx logs, so perhaps it was the cron job that syncs the Solr database to Amazon S3?
                  • +
                  • I don’t see anything unusual in the nginx logs, so perhaps it was the cron job that syncs the Solr database to Amazon S3?
                  • It could be that the bot purge yesterday changed the core significantly so there was a lot to change?
                  • -
                  • I don't see any drop in JVM heap size in CGSpace's munin stats since I did the Solr cleanup, but this looks pretty good:
                  • +
                  • I don’t see any drop in JVM heap size in CGSpace’s munin stats since I did the Solr cleanup, but this looks pretty good:

                  Tomcat max processing time week

                    @@ -610,16 +610,16 @@ real 77m3.755s user 7m39.785s sys 2m18.485s
                      -
                    • I told Peter it's better to do the access rights before the usage rights because the git branches are conflicting with each other and it's actually a pain in the ass to keep changing the values as we discuss, rebase, merge, fix conflicts…
                    • +
                    • I told Peter it’s better to do the access rights before the usage rights because the git branches are conflicting with each other and it’s actually a pain in the ass to keep changing the values as we discuss, rebase, merge, fix conflicts…
                    • Udana and Mia from WLE were asking some questions about their WLE Feedburner feed
                    • -
                    • It's pretty confusing, because until recently they were entering issue dates as only YYYY (like 2018) and their feeds were all showing items in the wrong order
                    • -
                    • I'm not exactly sure what their problem now is, though (confusing)
                    • -
                    • I updated the dspace-statistiscs-api to use psycopg2's execute_values() to insert batches of 100 values into PostgreSQL instead of doing every insert individually
                    • +
                    • It’s pretty confusing, because until recently they were entering issue dates as only YYYY (like 2018) and their feeds were all showing items in the wrong order
                    • +
                    • I’m not exactly sure what their problem now is, though (confusing)
                    • +
                    • I updated the dspace-statistiscs-api to use psycopg2’s execute_values() to insert batches of 100 values into PostgreSQL instead of doing every insert individually
                    • On CGSpace this reduces the total run time of indexer.py from 432 seconds to 400 seconds (most of the time is actually spent in getting the data from Solr though)

                    2018-09-27

                      -
                    • Linode emailed to say that CGSpace's (linode19) CPU load was high for a few hours last night
                    • +
                    • Linode emailed to say that CGSpace’s (linode19) CPU load was high for a few hours last night
                    • Looking in the nginx logs around that time I see some new IPs that look like they are harvesting things:
                    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "26/Sep/2018:(19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                    @@ -643,7 +643,7 @@ sys     2m18.485s
                     $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26 | sort | uniq
                     758
                     
                      -
                    • I will add their IPs to the list of bad bots in nginx so we can add a “bot” user agent to them and let Tomcat's Crawler Session Manager Valve handle them
                    • +
                    • I will add their IPs to the list of bad bots in nginx so we can add a “bot” user agent to them and let Tomcat’s Crawler Session Manager Valve handle them
                    • I asked Atmire to prepare an invoice for 125 credits

                    2018-09-29

                    @@ -670,7 +670,7 @@ $ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p
                  dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2018-09-30-languages.csv with csv;
                   
                    -
                  • Then I can simply delete the “Other” and “other” ones because that's not useful at all:
                  • +
                  • Then I can simply delete the “Other” and “other” ones because that’s not useful at all:
                  dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Other';
                   DELETE 6
                  diff --git a/docs/2018-10/index.html b/docs/2018-10/index.html
                  index 194f7707d..cd29ea479 100644
                  --- a/docs/2018-10/index.html
                  +++ b/docs/2018-10/index.html
                  @@ -9,7 +9,7 @@
                   
                   
                   
                  @@ -21,9 +21,9 @@ I created a GitHub issue to track this #389, because I'm super busy in Nairo
                   
                  -
                  +
                   
                   
                       
                  @@ -53,7 +53,7 @@ I created a GitHub issue to track this #389, because I'm super busy in Nairo
                       
                       
                       
                  -    
                  +    
                       
                   
                       
                  @@ -100,7 +100,7 @@ I created a GitHub issue to track this #389, because I'm super busy in Nairo
                     

                  October, 2018

                  @@ -108,7 +108,7 @@ I created a GitHub issue to track this #389, because I'm super busy in Nairo

                  2018-10-01

                  • Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
                  • -
                  • I created a GitHub issue to track this #389, because I'm super busy in Nairobi right now
                  • +
                  • I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now

                  2018-10-03

                    @@ -133,7 +133,7 @@ I created a GitHub issue to track this #389, because I'm super busy in Nairo 118927 200 31435 500
                    -
                  • I added Phil Thornton and Sonal Henson's ORCID identifiers to the controlled vocabulary for cg.creator.orcid and then re-generated the names using my resolve-orcids.py script:
                  • +
                  • I added Phil Thornton and Sonal Henson’s ORCID identifiers to the controlled vocabulary for cg.creator.orcid and then re-generated the names using my resolve-orcids.py script:
                  $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > 2018-10-03-orcids.txt
                   $ ./resolve-orcids.py -i 2018-10-03-orcids.txt -o 2018-10-03-names.txt -d
                  @@ -160,14 +160,14 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
                     87646 34.218.226.147
                    111729 213.139.53.62
                   
                    -
                  • But in super positive news, he says they are using my new dspace-statistics-api and it's MUCH faster than using Atmire CUA's internal “restlet” API
                  • -
                  • I don't recognize the 138.201.49.199 IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams:
                  • +
                  • But in super positive news, he says they are using my new dspace-statistics-api and it’s MUCH faster than using Atmire CUA’s internal “restlet” API
                  • +
                  • I don’t recognize the 138.201.49.199 IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams:
                  # grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c
                      8324 GET /bitstream
                      4193 GET /handle
                   
                    -
                  • Suspiciously, it's only grabbing the CGIAR System Office community (handle prefix 10947):
                  • +
                  • Suspiciously, it’s only grabbing the CGIAR System Office community (handle prefix 10947):
                  # grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c
                         7 GET /handle/10568
                  @@ -177,9 +177,9 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
                   
                Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
                 
                  -
                • It's clearly a bot and it's not re-using its Tomcat session, so I will add its IP to the nginx bad bot list
                • -
                • I looked in Solr's statistics core and these hits were actually all counted as isBot:false (of course)… hmmm
                • -
                • I tagged all of Sonal and Phil's items with their ORCID identifiers on CGSpace using my add-orcid-identifiers.py script:
                • +
                • It’s clearly a bot and it’s not re-using its Tomcat session, so I will add its IP to the nginx bad bot list
                • +
                • I looked in Solr’s statistics core and these hits were actually all counted as isBot:false (of course)… hmmm
                • +
                • I tagged all of Sonal and Phil’s items with their ORCID identifiers on CGSpace using my add-orcid-identifiers.py script:
                $ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu'
                 
                  @@ -205,7 +205,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
                • I see there are other bundles we might need to pay attention to: TEXT, @_LOGO-COLLECTION_@, @_LOGO-COMMUNITY_@, etc…
                • On a hunch I dropped the statistics table and re-indexed and now those two items above have no downloads
                • -
                • So it's fixed, but I'm not sure why!
                • +
                • So it’s fixed, but I’m not sure why!
                • Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests):
                # zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets'
                @@ -216,7 +216,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
                 

              2018-10-05

                -
              • Meet with Peter, Abenet, and Sisay to discuss CGSpace meeting in Nairobi and Sisay's work plan
              • +
              • Meet with Peter, Abenet, and Sisay to discuss CGSpace meeting in Nairobi and Sisay’s work plan
              • We agreed that he would do monthly updates of the controlled vocabularies and generate a new one for the top 1,000 AGROVOC terms
              • Add a link to AReS explorer to the CGSpace homepage introduction text
              @@ -224,30 +224,30 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
              • Follow up with AgriKnowledge about including Handle links (dc.identifier.uri) on their item pages
              • In July, 2018 they had said their programmers would include the field in the next update of their website software
              • -
              • CIMMYT's DSpace repository is now running DSpace 5.x!
              • -
              • It's running OAI, but not REST, so I need to talk to Richard about that!
              • +
              • CIMMYT’s DSpace repository is now running DSpace 5.x!
              • +
              • It’s running OAI, but not REST, so I need to talk to Richard about that!

              2018-10-08

                -
              • AgriKnowledge says they're going to add the dc.identifier.uri to their item view in November when they update their website software
              • +
              • AgriKnowledge says they’re going to add the dc.identifier.uri to their item view in November when they update their website software

              2018-10-10

                -
              • Peter noticed that some recently added PDFs don't have thumbnails
              • -
              • When I tried to force them to be generated I got an error that I've never seen before:
              • +
              • Peter noticed that some recently added PDFs don’t have thumbnails
              • +
              • When I tried to force them to be generated I got an error that I’ve never seen before:
              $ dspace filter-media -v -f -i 10568/97613
               org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: not authorized `/tmp/impdfthumb5039464037201498062.pdf' @ error/constitute.c/ReadImage/412.
               
                -
              • I see there was an update to Ubuntu's ImageMagick on 2018-10-05, so maybe something changed or broke?
              • -
              • I get the same error when forcing filter-media to run on DSpace Test too, so it's gotta be an ImageMagic bug
              • +
              • I see there was an update to Ubuntu’s ImageMagick on 2018-10-05, so maybe something changed or broke?
              • +
              • I get the same error when forcing filter-media to run on DSpace Test too, so it’s gotta be an ImageMagic bug
              • The ImageMagick version is currently 8:6.8.9.9-7ubuntu5.13, and there is an Ubuntu Security Notice from 2018-10-04
              • Wow, someone on Twitter posted about this breaking his web application (and it was retweeted by the ImageMagick acount!)
              • I commented out the line that disables PDF thumbnails in /etc/ImageMagick-6/policy.xml:
                <!--<policy domain="coder" rights="none" pattern="PDF" />-->
               
                -
              • This works, but I'm not sure what ImageMagick's long-term plan is if they are going to disable ALL image formats…
              • +
              • This works, but I’m not sure what ImageMagick’s long-term plan is if they are going to disable ALL image formats…
              • I suppose I need to enable a workaround for this in Ansible?

              2018-10-11

              @@ -292,7 +292,7 @@ COPY 10000

              2018-10-13

              • Run all system updates on DSpace Test (linode19) and reboot it
              • -
              • Look through Peter's list of 746 author corrections in OpenRefine
              • +
              • Look through Peter’s list of 746 author corrections in OpenRefine
              • I first facet by blank, trim whitespace, and then check for weird characters that might be indicative of encoding issues with this GREL:
              or(
              @@ -307,13 +307,13 @@ COPY 10000
               
            $ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t CORRECT -m 3
             
              -
            • I will apply these on CGSpace when I do the other updates tomorrow, as well as double check the high scoring ones to see if they are correct in Sisay's author controlled vocabulary
            • +
            • I will apply these on CGSpace when I do the other updates tomorrow, as well as double check the high scoring ones to see if they are correct in Sisay’s author controlled vocabulary

            2018-10-14

            • Merge the authors controlled vocabulary (#393), usage rights (#394), and the upstream DSpace 5.x cherry-picks (#394) into our 5_x-prod branch
            • -
            • Switch to new CGIAR LDAP server on CGSpace, as it's been running (at least for authentication) on DSpace Test for the last few weeks, and I think they old one will be deprecated soon (today?)
            • -
            • Apply Peter's 746 author corrections on CGSpace and DSpace Test using my fix-metadata-values.py script:
            • +
            • Switch to new CGIAR LDAP server on CGSpace, as it’s been running (at least for authentication) on DSpace Test for the last few weeks, and I think they old one will be deprecated soon (today?)
            • +
            • Apply Peter’s 746 author corrections on CGSpace and DSpace Test using my fix-metadata-values.py script:
            $ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p 'fuuu'
             
              @@ -322,21 +322,21 @@ COPY 10000
            • Restarting the service with systemd works for a few seconds, then the java process quits
            • I suspect that the systemd service type needs to be forking rather than simple, because the service calls the default DSpace start-handle-server shell script, which uses nohup and & to background the java process
            • It would be nice if there was a cleaner way to start the service and then just log to the systemd journal rather than all this hiding and log redirecting
            • -
            • Email the Landportal.org people to ask if they would consider Dublin Core metadata tags in their page's header, rather than the HTML properties they are using in their body
            • +
            • Email the Landportal.org people to ask if they would consider Dublin Core metadata tags in their page’s header, rather than the HTML properties they are using in their body
            • Peter pointed out that some thumbnails were still not getting generated
              • When I tried to generate them manually I noticed that the path to the CMYK profile had changed because Ubuntu upgraded Ghostscript from 9.18 to 9.25 last week… WTF?
              • Looks like I can use /usr/share/ghostscript/current instead of /usr/share/ghostscript/9.25
            • -
            • I limited the tall thumbnails even further to 170px because Peter said CTA's were still too tall at 200px (#396)
            • +
            • I limited the tall thumbnails even further to 170px because Peter said CTA’s were still too tall at 200px (#396)

            2018-10-15

            • Tomcat on DSpace Test (linode19) has somehow stopped running all the DSpace applications
            • -
            • I don't see anything in the Catalina logs or dmesg, and the Tomcat manager shows XMLUI, REST, OAI, etc all “Running: false”
            • +
            • I don’t see anything in the Catalina logs or dmesg, and the Tomcat manager shows XMLUI, REST, OAI, etc all “Running: false”
            • Actually, now I remember that yesterday when I deployed the latest changes from git on DSpace Test I noticed a syntax error in one XML file when I was doing the discovery reindexing
            • -
            • I fixed it so that I could reindex, but I guess the rest of DSpace actually didn't start up…
            • +
            • I fixed it so that I could reindex, but I guess the rest of DSpace actually didn’t start up…
            • Create an account on DSpace Test for Felix from Earlham so he can test COPO submission
              • I created a new collection and added him as the administrator so he can test submission
              • @@ -360,8 +360,8 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser
              dspace=# \copy (SELECT (CASE when metadata_schema_id=1 THEN 'dc' WHEN metadata_schema_id=2 THEN 'cg' END) AS schema, element, qualifier, scope_note FROM metadatafieldregistry where metadata_schema_id IN (1,2)) TO /tmp/cgspace-schema.csv WITH CSV HEADER;
               
                -
              • Talking to the CodeObia guys about the REST API I started to wonder why it's so slow and how I can quantify it in order to ask the dspace-tech mailing list for help profiling it
              • -
              • Interestingly, the speed doesn't get better after you request the same thing multiple times–it's consistently bad on both CGSpace and DSpace Test!
              • +
              • Talking to the CodeObia guys about the REST API I started to wonder why it’s so slow and how I can quantify it in order to ask the dspace-tech mailing list for help profiling it
              • +
              • Interestingly, the speed doesn’t get better after you request the same thing multiple times–it’s consistently bad on both CGSpace and DSpace Test!
              $ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
               ...
              @@ -441,13 +441,13 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
               
              Looking up the names associated with ORCID iD: 0000-0001-7930-5752
               Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
               
                -
              • So I need to handle that situation in the script for sure, but I'm not sure what to do organizationally or ethically, since that user disabled their name! Do we remove him from the list?
              • +
              • So I need to handle that situation in the script for sure, but I’m not sure what to do organizationally or ethically, since that user disabled their name! Do we remove him from the list?
              • I made a pull request and merged the ORCID updates into the 5_x-prod branch (#397)
              • Improve the logic of name checking in my resolve-orcids.py script

              2018-10-18

                -
              • I granted MEL's deposit user admin access to IITA, CIP, Bioversity, and RTB communities on DSpace Test so they can start testing real depositing
              • +
              • I granted MEL’s deposit user admin access to IITA, CIP, Bioversity, and RTB communities on DSpace Test so they can start testing real depositing
              • After they do some tests and we check the values Enrico will send a formal email to Peter et al to ask that they start depositing officially
              • I upgraded PostgreSQL to 9.6 on DSpace Test using Ansible, then had to manually migrate from 9.5 to 9.6:
              @@ -474,12 +474,12 @@ $ exit 1629 66.249.64.91 1758 5.9.6.51
                -
              • 5.9.6.51 is MegaIndex, which I've seen before…
              • +
              • 5.9.6.51 is MegaIndex, which I’ve seen before…

              2018-10-20

                -
              • I was going to try to run Solr in Docker because I learned I can run Docker on Travis-CI (for testing my dspace-statistics-api), but the oldest official Solr images are for 5.5, and DSpace's Solr configuration is for 4.9
              • -
              • This means our existing Solr configuration doesn't run in Solr 5.5:
              • +
              • I was going to try to run Solr in Docker because I learned I can run Docker on Travis-CI (for testing my dspace-statistics-api), but the oldest official Solr images are for 5.5, and DSpace’s Solr configuration is for 4.9
              • +
              • This means our existing Solr configuration doesn’t run in Solr 5.5:
              $ sudo docker pull solr:5
               $ sudo docker run --name my_solr -v ~/dspace/solr/statistics/conf:/tmp/conf -d -p 8983:8983 -t solr:5
              @@ -488,7 +488,7 @@ $ sudo docker logs my_solr
               ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics] Caused by: solr.IntField
               
              • Apparently a bunch of variable types were removed in Solr 5
              • -
              • So for now it's actually a huge pain in the ass to run the tests for my dspace-statistics-api
              • +
              • So for now it’s actually a huge pain in the ass to run the tests for my dspace-statistics-api
              • Linode sent a message that the CPU usage was high on CGSpace (linode18) last night
              • According to the nginx logs around that time it was 5.9.6.51 (MegaIndex) again:
              @@ -517,11 +517,11 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics] # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-10-20 | sort | uniq 8915
                -
              • Last month I added “crawl” to the Tomcat Crawler Session Manager Valve's regular expression matching, and it seems to be working for MegaIndex's user agent:
              • +
              • Last month I added “crawl” to the Tomcat Crawler Session Manager Valve’s regular expression matching, and it seems to be working for MegaIndex’s user agent:
              $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1' User-Agent:'"Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"'
               
                -
              • So I'm not sure why this bot uses so many sessions — is it because it requests very slowly?
              • +
              • So I’m not sure why this bot uses so many sessions — is it because it requests very slowly?

              2018-10-21

                @@ -552,7 +552,7 @@ UPDATE 76608
              • Improve the usage rights (dc.rights) on CGSpace again by adding the long names in the submission form, as well as adding versio 3.0 and Creative Commons Zero (CC0) public domain license (#399)
              • Add “usage rights” to the XMLUI item display (#400)
              • I emailed the MARLO guys to ask if they can send us a dump of rights data and Handles from their system so we can tag our older items on CGSpace
              • -
              • Testing REST login and logout via httpie because Felix from Earlham says he's having issues:
              • +
              • Testing REST login and logout via httpie because Felix from Earlham says he’s having issues:
              $ http --print b POST 'https://dspacetest.cgiar.org/rest/login' email='testdeposit@cgiar.org' password=deposit
               acef8a4a-41f3-4392-b870-e873790f696b
              @@ -576,8 +576,8 @@ $ curl -X GET -H "Content-Type: application/json" -H "Accept: app
               
              • I deployed the new Creative Commons choices to the usage rights on the CGSpace submission form
              • Also, I deployed the changes to show usage rights on the item view
              • -
              • Re-work the dspace-statistics-api to use Python's native json instead of ujson to make it easier to deploy in places where we don't have — or don't want to have — Python headers and a compiler (like containers)
              • -
              • Re-work the deployment of the API to use systemd's EnvironmentFile to read the environment variables instead of Environment in the RMG Ansible infrastructure scripts
              • +
              • Re-work the dspace-statistics-api to use Python’s native json instead of ujson to make it easier to deploy in places where we don’t have — or don’t want to have — Python headers and a compiler (like containers)
              • +
              • Re-work the deployment of the API to use systemd’s EnvironmentFile to read the environment variables instead of Environment in the RMG Ansible infrastructure scripts

              2018-10-25

                @@ -602,7 +602,7 @@ $ curl -X GET -H "Content-Type: application/json" -H "Accept: app
              • Then I re-generated the requirements.txt in the dspace-statistics-library and released version 0.5.2
              • Then I re-deployed the API on DSpace Test, ran all system updates on the server, and rebooted it
              • I tested my hack of depositing to one collection where the default item and bistream READ policies are restricted and then mapping the item to another collection, but the item retains its default policies so Anonymous cannot see them in the mapped collection either
              • -
              • Perhaps we need to try moving the item and inheriting the target collection's policies?
              • +
              • Perhaps we need to try moving the item and inheriting the target collection’s policies?
              • I merged the changes for adding publisher (dc.publisher) to the advanced search to the 5_x-prod branch (#402)
              • I merged the changes for adding versionless Creative Commons licenses to the submission form to the 5_x-prod branch (#403)
              • I will deploy them later this week
              • @@ -617,7 +617,7 @@ $ curl -X GET -H "Content-Type: application/json" -H "Accept: app
              • Meet with the COPO guys to walk them through the CGSpace submission workflow and discuss CG core, REST API, etc
                • I suggested that they look into submitting via the SWORDv2 protocol because it respects the workflows
                • -
                • They said that they're not too worried about the hierarchical CG core schema, that they would just flatten metadata like affiliations when depositing to a DSpace repository
                • +
                • They said that they’re not too worried about the hierarchical CG core schema, that they would just flatten metadata like affiliations when depositing to a DSpace repository
                • I said that it might be time to engage the DSpace community to add support for more advanced schemas in DSpace 7+ (perhaps partnership with Atmire?)
              • diff --git a/docs/2018-11/index.html b/docs/2018-11/index.html index db2199bd2..7918a46e4 100644 --- a/docs/2018-11/index.html +++ b/docs/2018-11/index.html @@ -33,7 +33,7 @@ Send a note about my dspace-statistics-api to the dspace-tech mailing list Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage Today these are the top 10 IPs: "/> - + @@ -63,7 +63,7 @@ Today these are the top 10 IPs: - + @@ -110,7 +110,7 @@ Today these are the top 10 IPs:

                November, 2018

                @@ -138,7 +138,7 @@ Today these are the top 10 IPs: 22508 66.249.64.59
              • The 66.249.64.x are definitely Google
              • -
              • 70.32.83.92 is well known, probably CCAFS or something, as it's only a few thousand requests and always to REST API
              • +
              • 70.32.83.92 is well known, probably CCAFS or something, as it’s only a few thousand requests and always to REST API
              • 84.38.130.177 is some new IP in Latvia that is only hitting the XMLUI, using the following user agent:
              Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.792.0 Safari/535.1
              @@ -154,13 +154,13 @@ Today these are the top 10 IPs:
               
            Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
             
              -
            • And it doesn't seem they are re-using their Tomcat sessions:
            • +
            • And it doesn’t seem they are re-using their Tomcat sessions:
            $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03
             1243
             
              -
            • Ah, we've apparently seen this server exactly a year ago in 2017-11, making 40,000 requests in one day…
            • -
            • I wonder if it's worth adding them to the list of bots in the nginx config?
            • +
            • Ah, we’ve apparently seen this server exactly a year ago in 2017-11, making 40,000 requests in one day…
            • +
            • I wonder if it’s worth adding them to the list of bots in the nginx config?
            • Linode sent a mail that CGSpace (linode18) is using high outgoing bandwidth
            • Looking at the nginx logs again I see the following top ten IPs:
            @@ -176,11 +176,11 @@ Today these are the top 10 IPs: 12557 78.46.89.18 32152 66.249.64.59
              -
            • 78.46.89.18 is new since I last checked a few hours ago, and it's from Hetzner with the following user agent:
            • +
            • 78.46.89.18 is new since I last checked a few hours ago, and it’s from Hetzner with the following user agent:
            Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
             
              -
            • It's making lots of requests, though actually it does seem to be re-using its Tomcat sessions:
            • +
            • It’s making lots of requests, though actually it does seem to be re-using its Tomcat sessions:
            $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
             8449
            @@ -190,7 +190,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
             
          • Updated on 2018-12-04 to correct the grep command above, as it was inaccurate and it seems the bot was actually already re-using its Tomcat sessions
          • I could add this IP to the list of bot IPs in nginx, but it seems like a futile effort when some new IP could come along and do the same thing
          • Perhaps I should think about adding rate limits to dynamic pages like /discover and /browse
          • -
          • I think it's reasonable for a human to click one of those links five or ten times a minute…
          • +
          • I think it’s reasonable for a human to click one of those links five or ten times a minute…
          • To contrast, 78.46.89.18 made about 300 requests per minute for a few hours today:
          # grep 78.46.89.18 /var/log/nginx/access.log | grep -o -E '03/Nov/2018:[0-9][0-9]:[0-9][0-9]' | sort | uniq -c | sort -n | tail -n 20
          @@ -221,7 +221,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
           

        2018-11-04

          -
        • Forward Peter's information about CGSpace financials to Modi from ICRISAT
        • +
        • Forward Peter’s information about CGSpace financials to Modi from ICRISAT
        • Linode emailed about the CPU load and outgoing bandwidth on CGSpace (linode18) again
        • Here are the top ten IPs active so far this morning:
        @@ -355,7 +355,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11

        2018-11-08

        • I deployed verison 0.7.0 of the dspace-statistics-api on DSpace Test (linode19) so I can test it for a few days (and check the Munin stats to see the change in database connections) before deploying on CGSpace
        • -
        • I also enabled systemd's persistent journal by setting Storage=persistent in journald.conf
        • +
        • I also enabled systemd’s persistent journal by setting Storage=persistent in journald.conf
        • Apparently Ubuntu 16.04 defaulted to using rsyslog for boot records until early 2018, so I removed rsyslog too
        • Proof 277 IITA records on DSpace Test: IITA_ ALIZZY1802-csv_oct23
            @@ -371,7 +371,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11

            2018-11-13

            • Help troubleshoot an issue with Judy Kimani submitting to the ILRI project reports, papers and documents collection on CGSpace
            • -
            • For some reason there is an existing group for the “Accept/Reject” workflow step, but it's empty
            • +
            • For some reason there is an existing group for the “Accept/Reject” workflow step, but it’s empty
            • I added Judy to the group and told her to try again
            • Sisay changed his leave to be full days until December so I need to finish the IITA records that he was working on (IITA_ ALIZZY1802-csv_oct23)
            • Sisay had said there were a few PDFs missing and Bosede sent them this week, so I had to find those items on DSpace Test and add the bitstreams to the items manually
            • @@ -381,7 +381,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11

              2018-11-14

              • Finally import the 277 IITA (ALIZZY1802) records to CGSpace
              • -
              • I had to export them from DSpace Test and import them into a temporary collection on CGSpace first, then export the collection as CSV to map them to new owning collections (IITA books, IITA posters, etc) with OpenRefine because DSpace's dspace export command doesn't include the collections for the items!
              • +
              • I had to export them from DSpace Test and import them into a temporary collection on CGSpace first, then export the collection as CSV to map them to new owning collections (IITA books, IITA posters, etc) with OpenRefine because DSpace’s dspace export command doesn’t include the collections for the items!
              • Delete all old IITA collections on DSpace Test and run dspace cleanup to get rid of all the bitstreams

              2018-11-15

              @@ -428,12 +428,12 @@ java.lang.IllegalStateException: DSpace kernel cannot be null at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78) 2018-11-19 15:23:04,223 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (4629 of 76007): 72731
                -
              • I looked in the Solr log around that time and I don't see anything…
              • -
              • Working on Udana's WLE records from last month, first the sixteen records in 2018-11-20 RDL Temp +
              • I looked in the Solr log around that time and I don’t see anything…
              • +
              • Working on Udana’s WLE records from last month, first the sixteen records in 2018-11-20 RDL Temp
                • these items will go to the Restoring Degraded Landscapes collection
                • a few items missing DOIs, but they are easily available on the publication page
                • -
                • clean up DOIs to use “https://doi.org" format
                • +
                • clean up DOIs to use “https://doi.org” format
                • clean up some cg.identifier.url to remove unneccessary query strings
                • remove columns with no metadata (river basin, place, target audience, isbn, uri, publisher, ispartofseries, subject)
                • fix column with invalid spaces in metadata field name (cg. subject. wle)
                • @@ -447,12 +447,12 @@ java.lang.IllegalStateException: DSpace kernel cannot be null
                • these items will go to the Variability, Risks and Competing Uses collection
                • trim and collapse whitespace in all fields (lots in WLE subject!)
                • clean up some cg.identifier.url fields that had unneccessary anchors in their links
                • -
                • clean up DOIs to use “https://doi.org" format
                • +
                • clean up DOIs to use “https://doi.org” format
                • fix column with invalid spaces in metadata field name (cg. subject. wle)
                • remove columns with no metadata (place, target audience, isbn, uri, publisher, ispartofseries, subject)
                • remove some weird Unicode characters (0xfffd) from abstracts, citations, and titles using Open Refine: value.replace('�','')
                • -
                • I notice a few items using DOIs pointing at ICARDA's DSpace like: https://doi.org/20.500.11766/8178, which then points at the “real” DOI on the publisher's site… these should be using the real DOI instead of ICARDA's “fake” Handle DOI
                • -
                • Some items missing DOIs, but they clearly have them if you look at the publisher's site
                • +
                • I notice a few items using DOIs pointing at ICARDA’s DSpace like: https://doi.org/20.500.11766/8178, which then points at the “real” DOI on the publisher’s site… these should be using the real DOI instead of ICARDA’s “fake” Handle DOI
                • +
                • Some items missing DOIs, but they clearly have them if you look at the publisher’s site
              @@ -463,7 +463,7 @@ java.lang.IllegalStateException: DSpace kernel cannot be null
            • Judy Kimani was having issues resuming submissions in another ILRI collection recently, and the issue there was due to an empty group defined for the “accept/reject” step (aka workflow step 1)
            • The error then was “authorization denied for workflow step 1” where “workflow step 1” was the “accept/reject” step, which had a group defined, but was empty
            • Adding her to this group solved her issues
            • -
            • Tezira says she's also getting the same “authorization denied” error for workflow step 1 when resuming submissions, so I told Abenet to delete the empty group
            • +
            • Tezira says she’s also getting the same “authorization denied” error for workflow step 1 when resuming submissions, so I told Abenet to delete the empty group
          @@ -475,7 +475,7 @@ java.lang.IllegalStateException: DSpace kernel cannot be null
          $ dspace index-discovery -r 10568/41888
           $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
           
            -
          • … but the item still doesn't appear in the collection
          • +
          • … but the item still doesn’t appear in the collection
          • Now I will try a full Discovery re-index:
          $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
          @@ -503,7 +503,7 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
              4564 70.32.83.92
           
          • We know 70.32.83.92 is CCAFS harvester on MediaTemple, but 205.186.128.185 is new appears to be a new CCAFS harvester
          • -
          • I think we might want to prune some old accounts from CGSpace, perhaps users who haven't logged in in the last two years would be a conservative bunch:
          • +
          • I think we might want to prune some old accounts from CGSpace, perhaps users who haven’t logged in in the last two years would be a conservative bunch:
          $ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 | wc -l
           409
          @@ -514,15 +514,15 @@ $ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 -d
           
          • The workflow step 1 (accept/reject) is now undefined for some reason
          • Last week the group was defined, but empty, so we added her to the group and she was able to take the tasks
          • -
          • Since then it looks like the group was deleted, so now she didn't have permission to take or leave the tasks in her pool
          • -
          • We added her back to the group, then she was able to take the tasks, and then we removed the group again, as we generally don't use this step in CGSpace
          • +
          • Since then it looks like the group was deleted, so now she didn’t have permission to take or leave the tasks in her pool
          • +
          • We added her back to the group, then she was able to take the tasks, and then we removed the group again, as we generally don’t use this step in CGSpace
        • Help Marianne troubleshoot some issue with items in their WLE collections and the WLE publicatons website

        2018-11-28

          -
        • Change the usage rights text a bit based on Maria Garruccio's feedback on “all rights reserved” (#404)
        • +
        • Change the usage rights text a bit based on Maria Garruccio’s feedback on “all rights reserved” (#404)
        • Run all system updates on DSpace Test (linode19) and reboot the server
        diff --git a/docs/2018-12/index.html b/docs/2018-12/index.html index 032c9d61b..77d4e8f38 100644 --- a/docs/2018-12/index.html +++ b/docs/2018-12/index.html @@ -33,7 +33,7 @@ Then I ran all system updates and restarted the server I noticed that there is another issue with PDF thumbnails on CGSpace, and I see there was another Ghostscript vulnerability last week "/> - + @@ -63,7 +63,7 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see - + @@ -110,7 +110,7 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see

        December, 2018

        @@ -148,7 +148,7 @@ org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.c
        • A comment on StackOverflow question from yesterday suggests it might be a bug with the pngalpha device in Ghostscript and links to an upstream bug
        • I think we need to wait for a fix from Ubuntu
        • -
        • For what it's worth, I get the same error on my local Arch Linux environment with Ghostscript 9.26:
        • +
        • For what it’s worth, I get the same error on my local Arch Linux environment with Ghostscript 9.26:
        $ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
         DEBUG: FC_WEIGHT didn't match
        @@ -167,7 +167,7 @@ DEBUG: FC_WEIGHT didn't match
         
      • One item had “MADAGASCAR” for ISI Journal
      • Minor corrections in IITA subject (LIVELIHOOD→LIVELIHOODS)
      • Trim whitespace in abstract field
      • -
      • Fix some sponsors (though some with “Governments of Canada” etc I'm not sure why those are plural)
      • +
      • Fix some sponsors (though some with “Governments of Canada” etc I’m not sure why those are plural)
      • Eighteen items had en||fr for the language, but the content was only in French so changed them to just fr
      • Six items had encoding errors in French text so I will ask Bosede to re-do them carefully
      • Correct and normalize a few AGROVOC subjects
      • @@ -198,18 +198,18 @@ DEBUG: FC_WEIGHT didn't match Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf[0]=>Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf PDF 595x841 595x841+0+0 16-bit sRGB 107443B 0.000u 0:00.000 identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
          -
        • And wow, I can't even run ImageMagick's identify on the first page of the second item (10568/98930):
        • +
        • And wow, I can’t even run ImageMagick’s identify on the first page of the second item (10568/98930):
        $ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
         zsh: abort (core dumped)  identify Food\ safety\ Kenya\ fruits.pdf\[0\]
         
          -
        • But with GraphicsMagick's identify it works:
        • +
        • But with GraphicsMagick’s identify it works:
        $ gm identify Food\ safety\ Kenya\ fruits.pdf\[0\]
         DEBUG: FC_WEIGHT didn't match
         Food safety Kenya fruits.pdf PDF 612x792+0+0 DirectClass 8-bit 1.4Mi 0.000u 0m:0.000002s
         
        $ identify Food\ safety\ Kenya\ fruits.pdf
         Food safety Kenya fruits.pdf[0] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
        @@ -226,7 +226,7 @@ zsh: abort (core dumped)  convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnai
         $ gm convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
         DEBUG: FC_WEIGHT didn't match
         
          -
        • I inspected the troublesome PDF using jhove and noticed that it is using ISO PDF/A-1, Level B and the other one doesn't list a profile, though I don't think this is relevant
        • +
        • I inspected the troublesome PDF using jhove and noticed that it is using ISO PDF/A-1, Level B and the other one doesn’t list a profile, though I don’t think this is relevant
        • I found another item that fails when generating a thumbnail (10568/98391, DSpace complains:
        org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
        @@ -256,16 +256,16 @@ Caused by: org.im4java.core.CommandException: identify: FailedToExecuteCommand `
                 at org.im4java.core.ImageCommand.run(ImageCommand.java:215)
                 ... 15 more
         
          -
        • And on my Arch Linux environment ImageMagick's convert also segfaults:
        • +
        • And on my Arch Linux environment ImageMagick’s convert also segfaults:
        $ convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
         zsh: abort (core dumped)  convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\]  x60
         
          -
        • But GraphicsMagick's convert works:
        • +
        • But GraphicsMagick’s convert works:
        $ gm convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
         
          -
        • So far the only thing that stands out is that the two files that don't work were created with Microsoft Office 2016:
        • +
        • So far the only thing that stands out is that the two files that don’t work were created with Microsoft Office 2016:
        $ pdfinfo bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf | grep -E '^(Creator|Producer)'
         Creator:        Microsoft® Word 2016
        @@ -285,14 +285,14 @@ Producer:       Microsoft® Word for Office 365
         
        $ inkscape Food\ safety\ Kenya\ fruits.pdf -z --export-dpi=72 --export-area-drawing --export-png='cover.png'
         $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
         
          -
        • I've tried a few times this week to register for the Ethiopian eVisa website, but it is never successful
        • +
        • I’ve tried a few times this week to register for the Ethiopian eVisa website, but it is never successful
        • In the end I tried one last time to just apply without registering and it was apparently successful
        • Testing DSpace 5.8 (5_x-prod branch) in an Ubuntu 18.04 VM with Tomcat 8.5 and had some issues:
          • JSPUI shows an internal error (log shows something about tag cloud, though, so might be unrelated)
          • -
          • Atmire Listings and Reports, which use JSPUI, asks you to log in again and then doesn't work
          • -
          • Content and Usage Analysis doesn't show up in the sidebar after logging in
          • -
          • I can navigate to /atmire/reporting-suite/usage-graph-editor, but it's only the Atmire theme and a “page not found” message
          • +
          • Atmire Listings and Reports, which use JSPUI, asks you to log in again and then doesn’t work
          • +
          • Content and Usage Analysis doesn’t show up in the sidebar after logging in
          • +
          • I can navigate to /atmire/reporting-suite/usage-graph-editor, but it’s only the Atmire theme and a “page not found” message
          • Related messages from dspace.log:
        • @@ -311,7 +311,7 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg

        2018-12-04

          -
        • Last night Linode sent a message that the load on CGSpace (linode18) was too high, here's a list of the top users at the time and throughout the day:
        • +
        • Last night Linode sent a message that the load on CGSpace (linode18) was too high, here’s a list of the top users at the time and throughout the day:
        # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018:1(5|6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
             225 40.77.167.142
        @@ -336,14 +336,14 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
            3210 2a01:4f8:140:3192::2
            4190 35.237.175.180
         
          -
        • 35.237.175.180 is known to us (CCAFS?), and I've already added it to the list of bot IPs in nginx, which appears to be working:
        • +
        • 35.237.175.180 is known to us (CCAFS?), and I’ve already added it to the list of bot IPs in nginx, which appears to be working:
        $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03
         4772
         $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 | sort | uniq | wc -l
         630
         
          -
        • I haven't seen 2a01:4f8:140:3192::2 before. Its user agent is some new bot:
        • +
        • I haven’t seen 2a01:4f8:140:3192::2 before. Its user agent is some new bot:
        Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
         
          @@ -366,7 +366,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2 $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 | sort | uniq | wc -l 1
          -
        • In other news, it's good to see my re-work of the database connectivity in the dspace-statistics-api actually caused a reduction of persistent database connections (from 1 to 0, but still!):
        • +
        • In other news, it’s good to see my re-work of the database connectivity in the dspace-statistics-api actually caused a reduction of persistent database connections (from 1 to 0, but still!):

        PostgreSQL connections day

        2018-12-05

        @@ -376,7 +376,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03

        2018-12-06

        • Linode sent a message that the CPU usage of CGSpace (linode18) is too high last night
        • -
        • I looked in the logs and there's nothing particular going on:
        • +
        • I looked in the logs and there’s nothing particular going on:
        # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
            1225 157.55.39.177
        @@ -402,8 +402,8 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
         1156
         
        • 2a01:7e00::f03c:91ff:fe0a:d645 appears to be the CKM dev server where Danny is testing harvesting via Drupal
        • -
        • It seems they are hitting the XMLUI's OpenSearch a bit, but mostly on the REST API so no issues here yet
        • -
        • Drupal is already in the Tomcat Crawler Session Manager Valve's regex so that's good!
        • +
        • It seems they are hitting the XMLUI’s OpenSearch a bit, but mostly on the REST API so no issues here yet
        • +
        • Drupal is already in the Tomcat Crawler Session Manager Valve’s regex so that’s good!

        2018-12-10

          @@ -414,7 +414,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
        • It sounds kinda crazy, but she said when she talked to Altmetric about their Twitter harvesting they said their coverage is not perfect, so it might be some kinda prioritization thing where they only do it for popular items?
        • I am testing this by tweeting one WLE item from CGSpace that currently has no Altmetric score
        • Interestingly, after about an hour I see it has already been picked up by Altmetric and has my tweet as well as some other tweet from over a month ago…
        • -
        • I tweeted a link to the item's DOI to see if Altmetric will notice it, hopefully associated with the Handle I tweeted earlier
        • +
        • I tweeted a link to the item’s DOI to see if Altmetric will notice it, hopefully associated with the Handle I tweeted earlier
      @@ -429,9 +429,9 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05

    2018-12-13

      -
    • Oh this is very interesting: WorldFish's repository is live now
    • -
    • It's running DSpace 5.9-SNAPSHOT running on KnowledgeArc and the OAI and REST interfaces are active at least
    • -
    • Also, I notice they ended up registering a Handle (they had been considering taking KnowledgeArc's advice to not use Handles!)
    • +
    • Oh this is very interesting: WorldFish’s repository is live now
    • +
    • It’s running DSpace 5.9-SNAPSHOT running on KnowledgeArc and the OAI and REST interfaces are active at least
    • +
    • Also, I notice they ended up registering a Handle (they had been considering taking KnowledgeArc’s advice to not use Handles!)
    • Did some coordination work on the hotel bookings for the January AReS workshop in Amman

    2018-12-17

    @@ -479,7 +479,7 @@ $ ls -lh cgspace_2018-12-19.backup* -rw-r--r-- 1 aorth aorth 94M Dec 20 11:36 cgspace_2018-12-19.backup.gz -rw-r--r-- 1 aorth aorth 93M Dec 20 11:35 cgspace_2018-12-19.backup.xz
      -
    • Looks like it's really not worth it…
    • +
    • Looks like it’s really not worth it…
    • Peter pointed out that Discovery filters for CTA subjects on item pages were not working
    • It looks like there were some mismatches in the Discovery index names and the XMLUI configuration, so I fixed them (#406)
    • Peter asked if we could create a controlled vocabulary for publishers (dc.publisher)
    • @@ -491,7 +491,7 @@ $ ls -lh cgspace_2018-12-19.backup* 3522 (1 row)
        -
      • I reverted the metadata changes related to “Unrestricted Access” and “Restricted Access” on DSpace Test because we're not pushing forward with the new status terms for now
      • +
      • I reverted the metadata changes related to “Unrestricted Access” and “Restricted Access” on DSpace Test because we’re not pushing forward with the new status terms for now
      • Purge remaining Oracle Java 8 stuff from CGSpace (linode18) since we migrated to OpenJDK a few months ago:
      # dpkg -P oracle-java8-installer oracle-java8-set-default
      @@ -514,7 +514,7 @@ Fixed 466 occurences of: Copyrighted; Any re-use allowed
       # pg_dropcluster 9.5 main
       # dpkg -l | grep postgresql | grep 9.5 | awk '{print $2}' | xargs dpkg -r
       
        -
      • I've been running PostgreSQL 9.6 for months on my local development and public DSpace Test (linode19) environments
      • +
      • I’ve been running PostgreSQL 9.6 for months on my local development and public DSpace Test (linode19) environments
      • Run all system updates on CGSpace (linode18) and restart the server
      • Try to run the DSpace cleanup script on CGSpace (linode18), but I get some errors about foreign key constraints:
      @@ -564,7 +564,7 @@ UPDATE 1 1253 54.70.40.11
      • All these look ok (54.70.40.11 is known to us from earlier this month and should be reusing its Tomcat sessions)
      • -
      • So I'm not sure what was going on last night…
      • +
      • So I’m not sure what was going on last night…
      diff --git a/docs/2019-01/index.html b/docs/2019-01/index.html index 693576725..3f135b7b5 100644 --- a/docs/2019-01/index.html +++ b/docs/2019-01/index.html @@ -9,7 +9,7 @@ - + @@ -77,7 +77,7 @@ I don't see anything interesting in the web server logs around that time tho - + @@ -124,7 +124,7 @@ I don't see anything interesting in the web server logs around that time tho

      January, 2019

      @@ -132,7 +132,7 @@ I don't see anything interesting in the web server logs around that time tho

      2019-01-02

      • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
      • -
      • I don't see anything interesting in the web server logs around that time though:
      • +
      • I don’t see anything interesting in the web server logs around that time though:
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
            92 40.77.167.4
      @@ -158,7 +158,7 @@ I don't see anything interesting in the web server logs around that time tho
       # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 46.101.86.248 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
           261 handle
       
        -
      • It's not clear to me what was causing the outbound traffic spike
      • +
      • It’s not clear to me what was causing the outbound traffic spike
      • Oh nice! The once-per-year cron job for rotating the Solr statistics actually worked now (for the first time ever!):
      Moving: 81742 into core statistics-2010
      @@ -182,7 +182,7 @@ Moving: 18497180 into core statistics-2018
       $ sudo docker rm dspacedb
       $ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
       
        -
      • Testing DSpace 5.9 with Tomcat 8.5.37 on my local machine and I see that Atmire's Listings and Reports still doesn't work +
      • Testing DSpace 5.9 with Tomcat 8.5.37 on my local machine and I see that Atmire’s Listings and Reports still doesn’t work
        • After logging in via XMLUI and clicking the Listings and Reports link from the sidebar it redirects me to a JSPUI login page
        • If I log in again there the Listings and Reports work… hmm.
        • @@ -264,17 +264,17 @@ org.apache.jasper.JasperException: /home.jsp (line: [214], column: [1]) /discove at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748)
            -
          • I notice that I get different JSESSIONID cookies for / (XMLUI) and /jspui (JSPUI) on Tomcat 8.5.37, I wonder if it's the same on Tomcat 7.0.92… yes I do.
          • +
          • I notice that I get different JSESSIONID cookies for / (XMLUI) and /jspui (JSPUI) on Tomcat 8.5.37, I wonder if it’s the same on Tomcat 7.0.92… yes I do.
          • Hmm, on Tomcat 7.0.92 I see that I get a dspace.current.user.id session cookie after logging into XMLUI, and then when I browse to JSPUI I am still logged in…
              -
            • I didn't see that cookie being set on Tomcat 8.5.37
            • +
            • I didn’t see that cookie being set on Tomcat 8.5.37
          • I sent a message to the dspace-tech mailing list to ask

          2019-01-04

            -
          • Linode sent a message last night that CGSpace (linode18) had high CPU usage, but I don't see anything around that time in the web server logs:
          • +
          • Linode sent a message last night that CGSpace (linode18) had high CPU usage, but I don’t see anything around that time in the web server logs:
          # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Jan/2019:1(7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
               189 207.46.13.192
          @@ -288,7 +288,7 @@ org.apache.jasper.JasperException: /home.jsp (line: [214], column: [1]) /discove
              1776 66.249.70.27
              2099 54.70.40.11
           
            -
          • I'm thinking about trying to validate our dc.subject terms against AGROVOC webservices
          • +
          • I’m thinking about trying to validate our dc.subject terms against AGROVOC webservices
          • There seem to be a few APIs and the documentation is kinda confusing, but I found this REST endpoint that does work well, for example searching for SOIL:
          $ http http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=SOIL&lang=en
          @@ -336,7 +336,7 @@ X-Frame-Options: ALLOW-FROM http://aims.fao.org
           }
           
          • The API does not appear to be case sensitive (searches for SOIL and soil return the same thing)
          • -
          • I'm a bit confused that there's no obvious return code or status when a term is not found, for example SOILS:
          • +
          • I’m a bit confused that there’s no obvious return code or status when a term is not found, for example SOILS:
          HTTP/1.1 200 OK
           Access-Control-Allow-Origin: *
          @@ -428,8 +428,8 @@ In [14]: for row in result.fetchone():
           
          • Tim Donohue responded to my thread about the cookies on the dspace-tech mailing list
              -
            • He suspects it's a change of behavior in Tomcat 8.5, and indeed I see a mention of new cookie processing in the Tomcat 8.5 migration guide
            • -
            • I tried to switch my XMLUI and JSPUI contexts to use the LegacyCookieProcessor, but it didn't seem to help
            • +
            • He suspects it’s a change of behavior in Tomcat 8.5, and indeed I see a mention of new cookie processing in the Tomcat 8.5 migration guide
            • +
            • I tried to switch my XMLUI and JSPUI contexts to use the LegacyCookieProcessor, but it didn’t seem to help
            • I filed DS-4140 on the DSpace issue tracker
          • @@ -438,8 +438,8 @@ In [14]: for row in result.fetchone():
            • Tezira wrote to say she has stopped receiving the DSpace Submission Approved and Archived emails from CGSpace as of January 2nd
                -
              • I told her that I haven't done anything to disable it lately, but that I would check
              • -
              • Bizu also says she hasn't received them lately
              • +
              • I told her that I haven’t done anything to disable it lately, but that I would check
              • +
              • Bizu also says she hasn’t received them lately
            @@ -452,12 +452,12 @@ In [14]: for row in result.fetchone():
          • Day two of CGSpace AReS meeting in Amman
            • Discuss possibly extending the dspace-statistics-api to make community and collection statistics available
            • -
            • Discuss new “final” CG Core document and some changes that we'll need to do on CGSpace and other repositories
            • +
            • Discuss new “final” CG Core document and some changes that we’ll need to do on CGSpace and other repositories
            • We agreed to try to stick to pure Dublin Core where possible, then use fields that exist in standard DSpace, and use “cg” namespace for everything else
            • Major changes are to move dc.contributor.author to dc.creator (which MELSpace and WorldFish are already using in their DSpace repositories)
          • -
          • I am testing the speed of the WorldFish DSpace repository's REST API and it's five to ten times faster than CGSpace as I tested in 2018-10:
          • +
          • I am testing the speed of the WorldFish DSpace repository’s REST API and it’s five to ten times faster than CGSpace as I tested in 2018-10:
          $ time http --print h 'https://digitalarchive.worldfishcenter.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
           
          @@ -582,8 +582,8 @@ In [14]: for row in result.fetchone():
           
           
        • Something happened to the Solr usage statistics on CGSpace
            -
          • I looked on the server and the Solr cores are there (56GB!), and I don't see any obvious errors in dmesg or anything
          • -
          • I see that the server hasn't been rebooted in 26 days so I rebooted it
          • +
          • I looked on the server and the Solr cores are there (56GB!), and I don’t see any obvious errors in dmesg or anything
          • +
          • I see that the server hasn’t been rebooted in 26 days so I rebooted it
        • After reboot the Solr stats are still messed up in the Atmire Usage Stats module, it only shows 2019-01!
        • @@ -712,7 +712,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
        • Abenet was asking if the Atmire Usage Stats are correct because they are over 2 million the last few months…
        • For 2019-01 alone the Usage Stats are already around 1.2 million
        • -
        • I tried to look in the nginx logs to see how many raw requests there are so far this month and it's about 1.4 million:
        • +
        • I tried to look in the nginx logs to see how many raw requests there are so far this month and it’s about 1.4 million:
        # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
         1442874
        @@ -724,8 +724,8 @@ sys     0m2.396s
         
        • Send reminder to Atmire about purchasing the MQM module
        • Trying to decide the solid action points for CGSpace on the CG Core 2.0 metadata…
        • -
        • It's difficult to decide some of these because the current CG Core 2.0 document does not provide guidance or rationale (yet)!
        • -
        • Also, there is not a good Dublin Core reference (or maybe I just don't understand?)
        • +
        • It’s difficult to decide some of these because the current CG Core 2.0 document does not provide guidance or rationale (yet)!
        • +
        • Also, there is not a good Dublin Core reference (or maybe I just don’t understand?)
        • Several authoritative documents on Dublin Core appear to be:
          • Dublin Core Metadata Element Set, Version 1.1: Reference Description
          • @@ -762,7 +762,7 @@ sys 0m2.396s

            2019-01-19

            • -

              There's no official set of Dublin Core qualifiers so I can't tell if things like dc.contributor.author that are used by DSpace are official

              +

              There’s no official set of Dublin Core qualifiers so I can’t tell if things like dc.contributor.author that are used by DSpace are official

            • I found a great presentation from 2015 by the Digital Repository of Ireland that discusses using MARC Relator Terms with Dublin Core elements

              @@ -777,12 +777,12 @@ sys 0m2.396s

            2019-01-20

              -
            • That's weird, I logged into DSpace Test (linode19) and it says it has been up for 213 days:
            • +
            • That’s weird, I logged into DSpace Test (linode19) and it says it has been up for 213 days:
            # w
              04:46:14 up 213 days,  7:25,  4 users,  load average: 1.94, 1.50, 1.35
             
              -
            • I've definitely rebooted it several times in the past few months… according to journalctl -b it was a few weeks ago on 2019-01-02
            • +
            • I’ve definitely rebooted it several times in the past few months… according to journalctl -b it was a few weeks ago on 2019-01-02
            • I re-ran the Ansible DSpace tag, ran all system updates, and rebooted the host
            • After rebooting I notice that the Linode kernel went down from 4.19.8 to 4.18.16…
            • Atmire sent a quote on our ticket about purchasing the Metadata Quality Module (MQM) for DSpace 5.8
            • @@ -793,7 +793,7 @@ sys 0m2.396s

            2019-01-21

              -
            • Investigating running Tomcat 7 on Ubuntu 18.04 with the tarball and a custom systemd package instead of waiting for our DSpace to get compatible with Ubuntu 18.04's Tomcat 8.5
            • +
            • Investigating running Tomcat 7 on Ubuntu 18.04 with the tarball and a custom systemd package instead of waiting for our DSpace to get compatible with Ubuntu 18.04’s Tomcat 8.5
            • I could either run with a simple tomcat7.service like this:
            [Unit]
            @@ -808,7 +808,7 @@ Group=aorth
             [Install]
             WantedBy=multi-user.target
             
              -
            • Or try to use adapt a real systemd service like Arch Linux's:
            • +
            • Or try to use adapt a real systemd service like Arch Linux’s:
            [Unit]
             Description=Tomcat 7 servlet container
            @@ -847,7 +847,7 @@ ExecStop=/usr/bin/jsvc \
             WantedBy=multi-user.target
             
            • I see that jsvc and libcommons-daemon-java are both available on Ubuntu so that should be easy to port
            • -
            • We probably don't need Eclipse Java Bytecode Compiler (ecj)
            • +
            • We probably don’t need Eclipse Java Bytecode Compiler (ecj)
            • I tested Tomcat 7.0.92 on Arch Linux using the tomcat7.service with jsvc and it works… nice!
            • I think I might manage this the same way I do the restic releases in the Ansible infrastructure scripts, where I download a specific version and symlink to some generic location without the version number
            • I verified that there is indeed an issue with sharded Solr statistics cores on DSpace, which will cause inaccurate results in the dspace-statistics-api:
            • @@ -858,7 +858,7 @@ $ http 'http://localhost:3000/solr/statistics-2018/select?indent=on&rows=0&a <result name="response" numFound="241" start="0">
        • I opened an issue on the GitHub issue tracker (#10)
        • -
        • I don't think the SolrClient library we are currently using supports these type of queries so we might have to just do raw queries with requests
        • +
        • I don’t think the SolrClient library we are currently using supports these type of queries so we might have to just do raw queries with requests
        • The pysolr library says it supports multicore indexes, but I am not sure it does (or at least not with our setup):
        import pysolr
        @@ -899,8 +899,8 @@ $ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/
         
      • I implemented a proof of concept to query the Solr STATUS for active cores and to add them with a shards query string
      • A few things I noticed:
          -
        • Solr doesn't mind if you use an empty shards parameter
        • -
        • Solr doesn't mind if you have an extra comma at the end of the shards parameter
        • +
        • Solr doesn’t mind if you use an empty shards parameter
        • +
        • Solr doesn’t mind if you have an extra comma at the end of the shards parameter
        • If you are searching multiple cores, you need to include the base core in the shards parameter as well
        • For example, compare the following two queries, first including the base core and the shard in the shards parameter, and then only including the shard:
        @@ -930,7 +930,7 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q= 915 35.237.175.180
        • 35.237.175.180 is known to us
        • -
        • I don't think we've seen 196.191.127.37 before. Its user agent is:
        • +
        • I don’t think we’ve seen 196.191.127.37 before. Its user agent is:
        Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/7.0.185.1002 Safari/537.36
         
          @@ -957,7 +957,7 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=

          Very interesting discussion of methods for running Tomcat under systemd

        • -

          We can set the ulimit options that used to be in /etc/default/tomcat7 with systemd's LimitNOFILE and LimitAS (see the systemd.exec man page)

          +

          We can set the ulimit options that used to be in /etc/default/tomcat7 with systemd’s LimitNOFILE and LimitAS (see the systemd.exec man page)

          • Note that we need to use infinity instead of unlimited for the address space
          @@ -991,7 +991,7 @@ COPY 1109 9265 45.5.186.2
          • -

            I think it's the usual IPs:

            +

            I think it’s the usual IPs:

            • 45.5.186.2 is CIAT
            • 70.32.83.92 is CCAFS
            • @@ -1009,7 +1009,7 @@ COPY 1109
          • -

            Just to make sure these were not uploaded by the user or something, I manually forced the regeneration of these with DSpace's filter-media:

            +

            Just to make sure these were not uploaded by the user or something, I manually forced the regeneration of these with DSpace’s filter-media:

          $ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98390
          @@ -1022,9 +1022,9 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace fi
           

        2019-01-24

          -
        • I noticed Ubuntu's Ghostscript 9.26 works on some troublesome PDFs where Arch's Ghostscript 9.26 doesn't, so the fix for the first/last page crash is not the patch I found yesterday
        • -
        • Ubuntu's Ghostscript uses another patch from Ghostscript git (upstream bug report)
        • -
        • I re-compiled Arch's ghostscript with the patch and then I was able to generate a thumbnail from one of the troublesome PDFs
        • +
        • I noticed Ubuntu’s Ghostscript 9.26 works on some troublesome PDFs where Arch’s Ghostscript 9.26 doesn’t, so the fix for the first/last page crash is not the patch I found yesterday
        • +
        • Ubuntu’s Ghostscript uses another patch from Ghostscript git (upstream bug report)
        • +
        • I re-compiled Arch’s ghostscript with the patch and then I was able to generate a thumbnail from one of the troublesome PDFs
        • Before and after:
        $ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
        @@ -1068,7 +1068,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
         
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:" | grep 45.5.186.2 | grep -Eo "GET /(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n
       
        -
      • CIAT's community currently has 12,000 items in it so this is normal
      • +
      • CIAT’s community currently has 12,000 items in it so this is normal
      • The issue with goo.gl links that we saw yesterday appears to be resolved, as links are working again…
      • For example: https://goo.gl/fb/VRj9Gq
      • The full list of MARC Relators on the Library of Congress website linked from the DMCI relators page is very confusing
      • @@ -1085,9 +1085,9 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
        • I tested by doing a Tomcat 7.0.91 installation, then switching it to 7.0.92 and it worked… nice!
        • I refined the tasks so much that I was confident enough to deploy them on DSpace Test and it went very well
        • -
        • Basically I just stopped tomcat7, created a dspace user, removed tomcat7, chown'd everything to the dspace user, then ran the playbook
        • +
        • Basically I just stopped tomcat7, created a dspace user, removed tomcat7, chown’d everything to the dspace user, then ran the playbook
        • So now DSpace Test (linode19) is running Tomcat 7.0.92… w00t
        • -
        • Now we need to monitor it for a few weeks to see if there is anything we missed, and then I can change CGSpace (linode18) as well, and we're ready for Ubuntu 18.04 too!
        • +
        • Now we need to monitor it for a few weeks to see if there is anything we missed, and then I can change CGSpace (linode18) as well, and we’re ready for Ubuntu 18.04 too!
      @@ -1107,7 +1107,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/ 4644 205.186.128.185 4644 70.32.83.92
        -
      • I think it's the usual IPs: +
      • I think it’s the usual IPs:
        • 70.32.83.92 is CCAFS
        • 205.186.128.185 is CCAFS or perhaps another Macaroni Bros harvester (new ILRI website?)
        • @@ -1158,7 +1158,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/ 2107 199.47.87.140 2540 45.5.186.2
            -
          • Of course there is CIAT's 45.5.186.2, but also 45.5.184.2 appears to be CIAT… I wonder why they have two harvesters?
          • +
          • Of course there is CIAT’s 45.5.186.2, but also 45.5.184.2 appears to be CIAT… I wonder why they have two harvesters?
          • 199.47.87.140 and 199.47.87.141 is TurnItIn with the following user agent:
          TurnitinBot (https://turnitin.com/robot/crawlerinfo.html)
          @@ -1181,7 +1181,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
           
        • 45.5.186.2 is CIAT as usual…
        • 70.32.83.92 and 205.186.128.185 are CCAFS as usual…
        • 66.249.66.219 is Google…
        • -
        • I'm thinking it might finally be time to increase the threshold of the Linode CPU alerts +
        • I’m thinking it might finally be time to increase the threshold of the Linode CPU alerts
          • I adjusted the alert threshold from 250% to 275%
          @@ -1233,7 +1233,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/ 9239 45.5.186.2
          • 45.5.186.2 and 45.5.184.2 are CIAT as always
          • -
          • 85.25.237.71 is some new server in Germany that I've never seen before with the user agent:
          • +
          • 85.25.237.71 is some new server in Germany that I’ve never seen before with the user agent:
          Linguee Bot (http://www.linguee.com/bot; bot@linguee.com)
           
          diff --git a/docs/2019-02/index.html b/docs/2019-02/index.html index 4610a19b1..ab3e3c04a 100644 --- a/docs/2019-02/index.html +++ b/docs/2019-02/index.html @@ -69,7 +69,7 @@ real 0m19.873s user 0m22.203s sys 0m1.979s "/> - + @@ -99,7 +99,7 @@ sys 0m1.979s - + @@ -146,7 +146,7 @@ sys 0m1.979s

          February, 2019

          @@ -179,7 +179,7 @@ real 0m19.873s user 0m22.203s sys 0m1.979s
            -
          • Normally I'd say this was very high, but about this time last year I remember thinking the same thing when we had 3.1 million…
          • +
          • Normally I’d say this was very high, but about this time last year I remember thinking the same thing when we had 3.1 million…
          • I will have to keep an eye on this to see if there is some error in Solr…
          • Atmire sent their pull request to re-enable the Metadata Quality Module (MQM) on our 5_x-dev branch today
              @@ -292,7 +292,7 @@ COPY 321 4658 205.186.128.185 4658 70.32.83.92
                -
              • At this rate I think I just need to stop paying attention to these alerts—DSpace gets thrashed when people use the APIs properly and there's nothing we can do to improve REST API performance!
              • +
              • At this rate I think I just need to stop paying attention to these alerts—DSpace gets thrashed when people use the APIs properly and there’s nothing we can do to improve REST API performance!
              • Perhaps I just need to keep increasing the Linode alert threshold (currently 300%) for this host?

              2019-02-05

              @@ -461,7 +461,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE 848 66.249.66.219
              • So it seems that the load issue comes from the REST API, not the XMLUI
              • -
              • I could probably rate limit the REST API, or maybe just keep increasing the alert threshold so I don't get alert spam (this is probably the correct approach because it seems like the REST API can keep up with the requests and is returning HTTP 200 status as far as I can tell)
              • +
              • I could probably rate limit the REST API, or maybe just keep increasing the alert threshold so I don’t get alert spam (this is probably the correct approach because it seems like the REST API can keep up with the requests and is returning HTTP 200 status as far as I can tell)
              • Bosede from IITA sent a message that a colleague is having problems submitting to some collections in their community:
              Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1056 by user 1759
              @@ -470,7 +470,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
               

            IITA Posters and Presentations workflow step 1 empty

              -
            • IITA editors or approvers should be added to that step (though I'm curious why nobody is in that group currently)
            • +
            • IITA editors or approvers should be added to that step (though I’m curious why nobody is in that group currently)
            • Abenet says we are not using the “Accept/Reject” step so this group should be deleted
            • Bizuwork asked about the “DSpace Submission Approved and Archived” emails that stopped working last month
            • I tried the test-email command on DSpace and it indeed is not working:
            • @@ -489,7 +489,7 @@ Error sending email: Please see the DSpace documentation for assistance.
                -
              • I can't connect to TCP port 25 on that server so I sent a mail to CGNET support to ask what's up
              • +
              • I can’t connect to TCP port 25 on that server so I sent a mail to CGNET support to ask what’s up
              • CGNET said these servers were discontinued in 2018-01 and that I should use Office 365

              2019-02-08

              @@ -577,18 +577,18 @@ Please see the DSpace documentation for assistance. # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq | wc -l 95
                -
              • It's very clear to me now that the API requests are the heaviest!
              • -
              • I think I need to increase the Linode alert threshold from 300 to 350% now so I stop getting some of these alerts—it's becoming a bit of the boy who cried wolf because it alerts like clockwork twice per day!
              • +
              • It’s very clear to me now that the API requests are the heaviest!
              • +
              • I think I need to increase the Linode alert threshold from 300 to 350% now so I stop getting some of these alerts—it’s becoming a bit of the boy who cried wolf because it alerts like clockwork twice per day!
              • Add my Python- and shell-based metadata workflow helper scripts as well as the environment settings for pipenv to our DSpace repository (#408) so I can track changes and distribute them more formally instead of just keeping them collected on the wiki
              • Started adding IITA research theme (cg.identifier.iitatheme) to CGSpace
                  -
                • I'm still waiting for feedback from IITA whether they actually want to use “SOCIAL SCIENCE & AGRIC BUSINESS” because it is listed as “Social Science and Agribusiness” on their website
                • +
                • I’m still waiting for feedback from IITA whether they actually want to use “SOCIAL SCIENCE & AGRIC BUSINESS” because it is listed as “Social Science and Agribusiness” on their website
                • Also, I think they want to do some mappings of items with existing subjects to these new themes
              • Update ILRI author name style in the controlled vocabulary (Domelevo Entfellner, Jean-Baka) (#409)
                  -
                • I'm still waiting to hear from Bizuwork whether we'll batch update all existing items with the old name style
                • +
                • I’m still waiting to hear from Bizuwork whether we’ll batch update all existing items with the old name style
                • No, there is only one entry and Bizu already fixed it
              • @@ -606,7 +606,7 @@ Please see the DSpace documentation for assistance.
                Error sending email:
                  - Error: cannot test email because mail.server.disabled is set to true
                 
                  -
                • I'm not sure why I didn't know about this configuration option before, and always maintained multiple configurations for development and production +
                • I’m not sure why I didn’t know about this configuration option before, and always maintained multiple configurations for development and production @@ -645,11 +645,11 @@ Please see the DSpace documentation for assistance.
                  dspacestatistics=# SELECT * FROM items WHERE views > 0 ORDER BY views DESC LIMIT 10;
                   dspacestatistics=# SELECT * FROM items WHERE downloads > 0 ORDER BY downloads DESC LIMIT 10;
                   
                    -
                  • I'd have to think about what to make the REST API endpoints, perhaps: /statistics/top/items?limit=10
                  • +
                  • I’d have to think about what to make the REST API endpoints, perhaps: /statistics/top/items?limit=10
                  • But how do I do top items by views / downloads separately?
                  • I re-deployed DSpace 6.3 locally to test the PDFBox thumbnails, especially to see if they handle CMYK files properly
                      -
                    • The quality is JPEG 75 and I don't see a way to set the thumbnail dimensions, but the resulting image is indeed sRGB:
                    • +
                    • The quality is JPEG 75 and I don’t see a way to set the thumbnail dimensions, but the resulting image is indeed sRGB:
                  @@ -661,7 +661,7 @@ dspacestatistics=# SELECT * FROM items WHERE downloads > 0 ORDER BY downloads

                2019-02-13

                  -
                • ILRI ICT reset the password for the CGSpace mail account, but I still can't get it to send mail from DSpace's test-email utility
                • +
                • ILRI ICT reset the password for the CGSpace mail account, but I still can’t get it to send mail from DSpace’s test-email utility
                • I even added extra mail properties to dspace.cfg as suggested by someone on the dspace-tech mailing list:
                mail.extraproperties = mail.smtp.starttls.required = true, mail.smtp.auth=true
                @@ -671,8 +671,8 @@ dspacestatistics=# SELECT * FROM items WHERE downloads > 0 ORDER BY downloads
                 
                Error sending email:
                  - Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR06CA0001.eurprd06.prod.outlook.com]
                 
                  -
                • I tried to log into the Outlook 365 web mail and it doesn't work so I've emailed ILRI ICT again
                • -
                • After reading the common mistakes in the JavaMail FAQ I reconfigured the extra properties in DSpace's mail configuration to be simply:
                • +
                • I tried to log into the Outlook 365 web mail and it doesn’t work so I’ve emailed ILRI ICT again
                • +
                • After reading the common mistakes in the JavaMail FAQ I reconfigured the extra properties in DSpace’s mail configuration to be simply:
                mail.extraproperties = mail.smtp.starttls.enable=true
                 
                  @@ -707,7 +707,7 @@ $ sudo sysctl kernel.unprivileged_userns_clone=1 $ podman pull postgres:9.6-alpine $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
                  -
                • Which totally works, but Podman's rootless support doesn't work with port mappings yet…
                • +
                • Which totally works, but Podman’s rootless support doesn’t work with port mappings yet…
                • Deploy the Tomcat-7-from-tarball branch on CGSpace (linode18), but first stop the Ubuntu Tomcat 7 and do some basic prep before running the Ansible playbook:
                # systemctl stop tomcat7
                @@ -731,14 +731,14 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
                 
                # find /home/cgspace.cgiar.org/solr/ -iname "write.lock" -delete
                 
                • After restarting Tomcat the usage statistics are back
                • -
                • Interestingly, many of the locks were from last month, last year, and even 2015! I'm pretty sure that's not supposed to be how locks work…
                • +
                • Interestingly, many of the locks were from last month, last year, and even 2015! I’m pretty sure that’s not supposed to be how locks work…
                • Help Sarah Kasyoka finish an item submission that she was having issues with due to the file size
                • -
                • I increased the nginx upload limit, but she said she was having problems and couldn't really tell me why
                • +
                • I increased the nginx upload limit, but she said she was having problems and couldn’t really tell me why
                • I logged in as her and completed the submission with no problems…

                2019-02-15

                  -
                • Tomcat was killed around 3AM by the kernel's OOM killer according to dmesg:
                • +
                • Tomcat was killed around 3AM by the kernel’s OOM killer according to dmesg:
                [Fri Feb 15 03:10:42 2019] Out of memory: Kill process 12027 (java) score 670 or sacrifice child
                 [Fri Feb 15 03:10:42 2019] Killed process 12027 (java) total-vm:14108048kB, anon-rss:5450284kB, file-rss:0kB, shmem-rss:0kB
                @@ -748,7 +748,7 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
                 
              Feb 15 03:10:44 linode19 systemd[1]: tomcat7.service: Main process exited, code=killed, status=9/KILL
               
                -
              • I suspect it was related to the media-filter cron job that runs at 3AM but I don't see anything particular in the log files
              • +
              • I suspect it was related to the media-filter cron job that runs at 3AM but I don’t see anything particular in the log files
              • I want to try to normalize the text_lang values to make working with metadata easier
              • We currently have a bunch of weird values that DSpace uses like NULL, en_US, and en and others that have been entered manually by editors:
              @@ -769,19 +769,19 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
              • The majority are NULL, en_US, the blank string, and en—the rest are not enough to be significant
              • Theoretically this field could help if you wanted to search for Spanish-language fields in the API or something, but even for the English fields there are two different values (and those are from DSpace itself)!
              • -
              • I'm going to normalized these to NULL at least on DSpace Test for now:
              • +
              • I’m going to normalized these to NULL at least on DSpace Test for now:
              dspace=# UPDATE metadatavalue SET text_lang = NULL WHERE resource_type_id=2 AND text_lang IS NOT NULL;
               UPDATE 1045410
               
                -
              • I started proofing IITA's 2019-01 records that Sisay uploaded this week +
              • I started proofing IITA’s 2019-01 records that Sisay uploaded this week
                  -
                • There were 259 records in IITA's original spreadsheet, but there are 276 in Sisay's collection
                • +
                • There were 259 records in IITA’s original spreadsheet, but there are 276 in Sisay’s collection
                • Also, I found that there are at least twenty duplicates in these records that we will need to address
              • ILRI ICT fixed the password for the CGSpace support email account and I tested it on Outlook 365 web and DSpace and it works
              • -
              • Re-create my local PostgreSQL container to for new PostgreSQL version and to use podman's volumes:
              • +
              • Re-create my local PostgreSQL container to for new PostgreSQL version and to use podman’s volumes:
              $ podman pull postgres:9.6-alpine
               $ podman volume create dspacedb_data
              @@ -793,7 +793,7 @@ $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h loca
               $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
               $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
               
                -
              • And it's all running without root!
              • +
              • And it’s all running without root!
              • Then re-create my Artifactory container as well, taking into account ulimit open file requirements by Artifactory as well as the user limitations caused by rootless subuid mappings:
              $ podman volume create artifactory_data
              @@ -808,7 +808,7 @@ $ podman start artifactory
               

            2019-02-17

              -
            • I ran DSpace's cleanup task on CGSpace (linode18) and there were errors:
            • +
            • I ran DSpace’s cleanup task on CGSpace (linode18) and there were errors:
            $ dspace cleanup -v
             Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
            @@ -946,7 +946,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
             

            2019-02-19

            • Linode sent another alert about CPU usage on CGSpace (linode18) averaging 417% this morning
            • -
            • Unfortunately, I don't see any strange activity in the web server API or XMLUI logs at that time in particular
            • +
            • Unfortunately, I don’t see any strange activity in the web server API or XMLUI logs at that time in particular
            • So far today the top ten IPs in the XMLUI logs are:
            # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "19/Feb/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
            @@ -962,9 +962,9 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
               14686 143.233.242.130
             
            • 143.233.242.130 is in Greece and using the user agent “Indy Library”, like the top IP yesterday (94.71.244.172)
            • -
            • That user agent is in our Tomcat list of crawlers so at least its resource usage is controlled by forcing it to use a single Tomcat session, but I don't know if DSpace recognizes if this is a bot or not, so the logs are probably skewed because of this
            • -
            • The user is requesting only things like /handle/10568/56199?show=full so it's nothing malicious, only annoying
            • -
            • Otherwise there are still shit loads of IPs from Amazon still hammering the server, though I see HTTP 503 errors now after yesterday's nginx rate limiting updates +
            • That user agent is in our Tomcat list of crawlers so at least its resource usage is controlled by forcing it to use a single Tomcat session, but I don’t know if DSpace recognizes if this is a bot or not, so the logs are probably skewed because of this
            • +
            • The user is requesting only things like /handle/10568/56199?show=full so it’s nothing malicious, only annoying
            • +
            • Otherwise there are still shit loads of IPs from Amazon still hammering the server, though I see HTTP 503 errors now after yesterday’s nginx rate limiting updates
              • I should really try to script something around ipapi.co to get these quickly and easily
              @@ -984,7 +984,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i 12360 2a01:7e00::f03c:91ff:fe0a:d645
            • 2a01:7e00::f03c:91ff:fe0a:d645 is on Linode, and I can see from the XMLUI access logs that it is Drupal, so I assume it is part of the new ILRI website harvester…
            • -
            • Jesus, Linode just sent another alert as we speak that the load on CGSpace (linode18) has been at 450% the last two hours! I'm so fucking sick of this
            • +
            • Jesus, Linode just sent another alert as we speak that the load on CGSpace (linode18) has been at 450% the last two hours! I’m so fucking sick of this
            • Our usage stats have exploded the last few months:

            Usage stats

            @@ -1027,12 +1027,12 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
          Mozilla/5.0 (Linux; Android 7.0; TECNO Camon CX Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/33.0.0.0 Mobile Safari/537.36
           
            -
          • I wrote a quick and dirty Python script called resolve-addresses.py to resolve IP addresses to their owning organization's name, ASN, and country using the IPAPI.co API
          • +
          • I wrote a quick and dirty Python script called resolve-addresses.py to resolve IP addresses to their owning organization’s name, ASN, and country using the IPAPI.co API

          2019-02-20

          • Ben Hack was asking about getting authors publications programmatically from CGSpace for the new ILRI website
          • -
          • I told him that they should probably try to use the REST API's find-by-metadata-field endpoint
          • +
          • I told him that they should probably try to use the REST API’s find-by-metadata-field endpoint
          • The annoying thing is that you have to match the text language attribute of the field exactly, but it does work:
          $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://cgspace.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.creator.id","value": "Alan S. Orth: 0000-0002-1735-7458", "language": ""}'
          @@ -1041,7 +1041,7 @@ $ curl -s -H "accept: application/json" -H "Content-Type: applica
           
          • This returns six items for me, which is the same I see in a Discovery search
          • Hector Tobon from CIAT asked if it was possible to get item statistics from CGSpace so I told him to use my dspace-statistics-api
          • -
          • I was playing with YasGUI to query AGROVOC's SPARQL endpoint, but they must have a cached version or something because I get an HTTP 404 if I try to go to the endpoint manually
          • +
          • I was playing with YasGUI to query AGROVOC’s SPARQL endpoint, but they must have a cached version or something because I get an HTTP 404 if I try to go to the endpoint manually
          • I think I want to stick to the regular web services to validate AGROVOC terms

          YasGUI querying AGROVOC

          @@ -1064,7 +1064,7 @@ $ ./agrovoc-lookup.py -l fr -i /tmp/top-1500-subjects.txt -om /tmp/matched-subje
        $ cat /tmp/matched-subjects-* | sort | uniq > /tmp/2019-02-21-matched-subjects.txt
         
          -
        • And then a list of all the unique unmatched terms using some utility I've never heard of before called comm or with diff:
        • +
        • And then a list of all the unique unmatched terms using some utility I’ve never heard of before called comm or with diff:
        $ sort /tmp/top-1500-subjects.txt > /tmp/subjects-sorted.txt
         $ comm -13 /tmp/2019-02-21-matched-subjects.txt /tmp/subjects-sorted.txt > /tmp/2019-02-21-unmatched-subjects.txt
        @@ -1077,7 +1077,7 @@ COPY 202
         dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 227 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-regions.csv WITH CSV HEADER;
         COPY 33
         
          -
        • I did a bit more work on the IITA research theme (adding it to Discovery search filters) and it's almost ready so I created a pull request (#413)
        • +
        • I did a bit more work on the IITA research theme (adding it to Discovery search filters) and it’s almost ready so I created a pull request (#413)
        • I still need to test the batch tagging of IITA items with themes based on their IITA subjects:
          • NATURAL RESOURCE MANAGEMENT research theme to items with NATURAL RESOURCE MANAGEMENT subject
          • @@ -1095,13 +1095,13 @@ COPY 33

            Help Udana from WLE with some issues related to CGSpace items on their Publications website

            • He wanted some IWMI items to show up in their publications website
            • -
            • The items were mapped into WLE collections, but still weren't showing up on the publications website
            • +
            • The items were mapped into WLE collections, but still weren’t showing up on the publications website
            • I told him that he needs to add the cg.identifier.wletheme to the items so that the website indexer finds them
            • A few days ago he added the metadata to 10568/93011 and now I see that the item is present on the WLE publications website
          • -

            Start looking at IITA's latest round of batch uploads called “IITA_Feb_14” on DSpace Test

            +

            Start looking at IITA’s latest round of batch uploads called “IITA_Feb_14” on DSpace Test

            • One mispelled authorship type
            • A few dozen incorrect inconsistent affiliations (I dumped a list of the top 1500 affiliations and reconciled against it, but it was still a lot of work)
            • @@ -1110,7 +1110,7 @@ COPY 33
            • Some whitespace and consistency issues in sponsorships
            • Eight items with invalid ISBN: 0-471-98560-3
            • Two incorrectly formatted ISSNs
            • -
            • Lots of incorrect values in subjects, but that's a difficult problem to do in an automated way
            • +
            • Lots of incorrect values in subjects, but that’s a difficult problem to do in an automated way
          • @@ -1137,8 +1137,8 @@ return "unmatched"

          2019-02-24

            -
          • I decided to try to validate the AGROVOC subjects in IITA's recent batch upload by dumping all their terms, checking them in en/es/fr with agrovoc-lookup.py, then reconciling against the final list using reconcile-csv with OpenRefine
          • -
          • I'm not sure how to deal with terms like “CORN” that are alternative labels (altLabel) in AGROVOC where the preferred label (prefLabel) would be “MAIZE”
          • +
          • I decided to try to validate the AGROVOC subjects in IITA’s recent batch upload by dumping all their terms, checking them in en/es/fr with agrovoc-lookup.py, then reconciling against the final list using reconcile-csv with OpenRefine
          • +
          • I’m not sure how to deal with terms like “CORN” that are alternative labels (altLabel) in AGROVOC where the preferred label (prefLabel) would be “MAIZE”
          • For example, a query for CORN* returns:
              "results": [
          @@ -1160,7 +1160,7 @@ return "unmatched"
           
        • I did a duplicate check of the IITA Feb 14 records on DSpace Test and there were about fifteen or twenty items reported
          • A few of them are actually in previous IITA batch updates, which means they have been uploaded to CGSpace yet, so I worry that there would be many more
          • -
          • I want to re-synchronize CGSpace to DSpace Test to make sure that the duplicate checking is accurate, but I'm not sure I can because the Earlham guys are still testing COPO actively on DSpace Test
          • +
          • I want to re-synchronize CGSpace to DSpace Test to make sure that the duplicate checking is accurate, but I’m not sure I can because the Earlham guys are still testing COPO actively on DSpace Test
        @@ -1185,7 +1185,7 @@ return "unmatched" /home/cgspace.cgiar.org/log/solr.log.2019-02-23.xz:0 /home/cgspace.cgiar.org/log/solr.log.2019-02-24:34
          -
        • But I don't see anything interesting in yesterday's Solr log…
        • +
        • But I don’t see anything interesting in yesterday’s Solr log…
        • I see this in the Tomcat 7 logs yesterday:
        Feb 25 21:09:29 linode18 tomcat7[1015]: Error while updating
        @@ -1209,7 +1209,7 @@ Feb 25 21:37:49 linode18 tomcat7[28363]:         at java.lang.Throwable.readObje
         Feb 25 21:37:49 linode18 tomcat7[28363]:         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         Feb 25 21:37:49 linode18 tomcat7[28363]:         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
         
          -
        • I don't think that's related…
        • +
        • I don’t think that’s related…
        • Also, now the Solr admin UI says “statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher”
        • In the Solr log I see:
        @@ -1245,12 +1245,12 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
      • On a hunch I tried adding ulimit -v unlimited to the Tomcat catalina.sh and now Solr starts up with no core errors and I actually have statistics for January and February on some communities, but not others
      • I wonder if the address space limits that I added via LimitAS=infinity in the systemd service are somehow not working?
      • I did some tests with calling a shell script from systemd on DSpace Test (linode19) and the LimitAS setting does work, and the infinity setting in systemd does get translated to “unlimited” on the service
      • -
      • I thought it might be open file limit, but it seems we're nowhere near the current limit of 16384:
      • +
      • I thought it might be open file limit, but it seems we’re nowhere near the current limit of 16384:
      # lsof -u dspace | wc -l
       3016
       
        -
      • For what it's worth I see the same errors about solr_update_time_stamp on DSpace Test (linode19)
      • +
      • For what it’s worth I see the same errors about solr_update_time_stamp on DSpace Test (linode19)
      • Update DSpace Test to Tomcat 7.0.93
      • Something seems to have happened (some Atmire scheduled task, perhaps the CUA one at 7AM?) on CGSpace because I checked a few communities and collections on CGSpace and there are now statistics for January and February
      @@ -1267,27 +1267,27 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
      • According to the REST API collection 1021 appears to be CCAFS Tools, Maps, Datasets and Models
      • I looked at the WORKFLOW_STEP_1 (Accept/Reject) and the group is of course empty
      • -
      • As we've seen several times recently, we are not using this step so it should simply be deleted
      • +
      • As we’ve seen several times recently, we are not using this step so it should simply be deleted

      2019-02-27

      • Discuss batch uploads with Sisay
      • -
      • He's trying to upload some CTA records, but it's not possible to do collection mapping when using the web UI +
      • He’s trying to upload some CTA records, but it’s not possible to do collection mapping when using the web UI
        • I sent a mail to the dspace-tech mailing list to ask about the inability to perform mappings when uploading via the XMLUI batch upload
      • -
      • He asked me to upload the files for him via the command line, but the file he referenced (Thumbnails_feb_2019.zip) doesn't exist
      • -
      • I noticed that the command line batch import functionality is a bit weird when using zip files because you have to specify the directory where the zip file is location as well as the zip file's name:
      • +
      • He asked me to upload the files for him via the command line, but the file he referenced (Thumbnails_feb_2019.zip) doesn’t exist
      • +
      • I noticed that the command line batch import functionality is a bit weird when using zip files because you have to specify the directory where the zip file is location as well as the zip file’s name:
      $ ~/dspace/bin/dspace import -a -e aorth@stfu.com -m mapfile -s /home/aorth/Downloads/2019-02-27-test/ -z SimpleArchiveFormat.zip
       
        -
      • Why don't they just derive the directory from the path to the zip file?
      • -
      • Working on Udana's Restoring Degraded Landscapes (RDL) WLE records that we originally started in 2018-11 and fixing many of the same problems that I originally did then +
      • Why don’t they just derive the directory from the path to the zip file?
      • +
      • Working on Udana’s Restoring Degraded Landscapes (RDL) WLE records that we originally started in 2018-11 and fixing many of the same problems that I originally did then
        • I also added a few regions because they are obvious for the countries
        • Also I added some rights fields that I noticed were easily available from the publications pages
        • -
        • I imported the records into my local environment with a fresh snapshot of the CGSpace database and ran the Atmire duplicate checker against them and it didn't find any
        • +
        • I imported the records into my local environment with a fresh snapshot of the CGSpace database and ran the Atmire duplicate checker against them and it didn’t find any
        • I uploaded fifty-two records to the Restoring Degraded Landscapes collection on CGSpace
      • @@ -1299,7 +1299,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
        $ dspace import -a -e swebshet@stfu.org -s /home/swebshet/Thumbnails_feb_2019 -m 2019-02-28-CTA-Thumbnails.map
         
        • Mails from CGSpace stopped working, looks like ICT changed the password again or we got locked out sigh
        • -
        • Now I'm getting this message when trying to use DSpace's test-email script:
        • +
        • Now I’m getting this message when trying to use DSpace’s test-email script:
        $ dspace test-email
         
        @@ -1313,8 +1313,8 @@ Error sending email:
         
         Please see the DSpace documentation for assistance.
         
          -
        • I've tried to log in with the last two passwords that ICT reset it to earlier this month, but they are not working
        • -
        • I sent a mail to ILRI ICT to check if we're locked out or reset the password again
        • +
        • I’ve tried to log in with the last two passwords that ICT reset it to earlier this month, but they are not working
        • +
        • I sent a mail to ILRI ICT to check if we’re locked out or reset the password again
        diff --git a/docs/2019-03/index.html b/docs/2019-03/index.html index 3dde03db9..41e2899cd 100644 --- a/docs/2019-03/index.html +++ b/docs/2019-03/index.html @@ -8,9 +8,9 @@ - + @@ -73,7 +73,7 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca - + @@ -120,16 +120,16 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca

        March, 2019

        2019-03-01

          -
        • I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
        • +
        • I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
        • I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
        • -
        • Looking at the other half of Udana's WLE records from 2018-11 +
        • Looking at the other half of Udana’s WLE records from 2018-11
          • I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
          • I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
          • @@ -142,14 +142,14 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca

          2019-03-03

            -
          • Trying to finally upload IITA's 259 Feb 14 items to CGSpace so I exported them from DSpace Test:
          • +
          • Trying to finally upload IITA’s 259 Feb 14 items to CGSpace so I exported them from DSpace Test:
          $ mkdir 2019-03-03-IITA-Feb14
           $ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-03-03-IITA-Feb14
           
          • As I was inspecting the archive I noticed that there were some problems with the bitsreams:
              -
            • First, Sisay didn't include the bitstream descriptions
            • +
            • First, Sisay didn’t include the bitstream descriptions
            • Second, only five items had bitstreams and I remember in the discussion with IITA that there should have been nine!
            • I had to refer to the original CSV from January to find the file names, then download and add them to the export contents manually!
            @@ -158,11 +158,11 @@ $ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-03-03-IITA-Feb14
          $ dspace import -a -c 10568/99832 -e aorth@stfu.com -m 2019-03-03-IITA-Feb14.map -s /tmp/2019-03-03-IITA-Feb14
           
            -
          • DSpace's export function doesn't include the collections for some reason, so you need to import them somewhere first, then export the collection metadata and re-map the items to proper owning collections based on their types using OpenRefine or something
          • +
          • DSpace’s export function doesn’t include the collections for some reason, so you need to import them somewhere first, then export the collection metadata and re-map the items to proper owning collections based on their types using OpenRefine or something
          • After re-importing to CGSpace to apply the mappings, I deleted the collection on DSpace Test and ran the dspace cleanup script
          • Merge the IITA research theme changes from last month to the 5_x-prod branch (#413)
              -
            • I will deploy to CGSpace soon and then think about how to batch tag all IITA's existing items with this metadata
            • +
            • I will deploy to CGSpace soon and then think about how to batch tag all IITA’s existing items with this metadata
          • Deploy Tomcat 7.0.93 on CGSpace (linode18) after having tested it on DSpace Test (linode19) for a week
          • @@ -170,7 +170,7 @@ $ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-03-03-IITA-Feb14

            2019-03-06

            • Abenet was having problems with a CIP user account, I think that the user could not register
            • -
            • I suspect it's related to the email issue that ICT hasn't responded about since last week
            • +
            • I suspect it’s related to the email issue that ICT hasn’t responded about since last week
            • As I thought, I still cannot send emails from CGSpace:
            $ dspace test-email
            @@ -203,17 +203,17 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.x
             

          2019-03-08

            -
          • There's an issue with CGSpace right now where all items are giving a blank page in the XMLUI +
          • There’s an issue with CGSpace right now where all items are giving a blank page in the XMLUI
            • Interestingly, if I check an item in the REST API it is also mostly blank: only the title and the ID! On second thought I realize I probably was just seeing the default view without any “expands”
            • -
            • I don't see anything unusual in the Tomcat logs, though there are thousands of those solr_update_time_stamp errors:
            • +
            • I don’t see anything unusual in the Tomcat logs, though there are thousands of those solr_update_time_stamp errors:
          # journalctl -u tomcat7 | grep -c 'Multiple update components target the same field:solr_update_time_stamp'
           1076
           
            -
          • I restarted Tomcat and it's OK now…
          • +
          • I restarted Tomcat and it’s OK now…
          • Skype meeting with Peter and Abenet and Sisay
            • We want to try to crowd source the correction of invalid AGROVOC terms starting with the ~313 invalid ones from our top 1500
            • @@ -244,7 +244,7 @@ UPDATE 44

            2019-03-10

              -
            • Working on tagging IITA's items with their new research theme (cg.identifier.iitatheme) based on their existing IITA subjects (see notes from 2019-02)
            • +
            • Working on tagging IITA’s items with their new research theme (cg.identifier.iitatheme) based on their existing IITA subjects (see notes from 2019-02)
            • I exported the entire IITA community from CGSpace and then used csvcut to extract only the needed fields:
            $ csvcut -c 'id,cg.subject.iita,cg.subject.iita[],cg.subject.iita[en],cg.subject.iita[en_US]' ~/Downloads/10568-68616.csv > /tmp/iita.csv
            @@ -258,7 +258,7 @@ UPDATE 44
             
          if(isBlank(value), 'PLANT PRODUCTION & HEALTH', value + '||PLANT PRODUCTION & HEALTH')
           
            -
          • Then it's more annoying because there are four IITA subject columns…
          • +
          • Then it’s more annoying because there are four IITA subject columns…
          • In total this would add research themes to 1,755 items
          • I want to double check one last time with Bosede that they would like to do this, because I also see that this will tag a few hundred items from the 1970s and 1980s
          @@ -268,7 +268,7 @@ UPDATE 44

        2019-03-12

          -
        • I imported the changes to 256 of IITA's records on CGSpace
        • +
        • I imported the changes to 256 of IITA’s records on CGSpace

        2019-03-14

          @@ -291,21 +291,21 @@ UPDATE 44 done
            -
          • Then I couldn't figure out a clever way to join all the CSVs, so I just grepped them to find the IDs with dates from 2018 and 2019 and there are apparently only three:
          • +
          • Then I couldn’t figure out a clever way to join all the CSVs, so I just grepped them to find the IDs with dates from 2018 and 2019 and there are apparently only three:
          $ grep -oE '201[89]' /tmp/*.csv | sort -u
           /tmp/94834.csv:2018
           /tmp/95615.csv:2018
           /tmp/96747.csv:2018
           
            -
          • And looking at those items more closely, only one of them has an issue date of after 2018-04, so I will only update that one (as the countrie's name only changed in 2018-04)
          • +
          • And looking at those items more closely, only one of them has an issue date of after 2018-04, so I will only update that one (as the countrie’s name only changed in 2018-04)
          • Run all system updates and reboot linode20
          • -
          • Follow up with Felix from Earlham to see if he's done testing DSpace Test with COPO so I can re-sync the server from CGSpace
          • +
          • Follow up with Felix from Earlham to see if he’s done testing DSpace Test with COPO so I can re-sync the server from CGSpace

          2019-03-15

          • CGSpace (linode18) has the blank page error again
          • -
          • I'm not sure if it's related, but I see the following error in DSpace's log:
          • +
          • I’m not sure if it’s related, but I see the following error in DSpace’s log:
          2019-03-15 14:09:32,685 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
           java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is closed.
          @@ -354,7 +354,7 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is c
                10 dspaceCli
                15 dspaceWeb
           
            -
          • I didn't see anything interesting in the PostgreSQL logs, though this stack trace from the Tomcat logs (in the systemd journal) from earlier today might be related?
          • +
          • I didn’t see anything interesting in the PostgreSQL logs, though this stack trace from the Tomcat logs (in the systemd journal) from earlier today might be related?
          SEVERE: Servlet.service() for servlet [spring] in context with path [] threw exception [org.springframework.web.util.NestedServletException: Request processing failed; nested exception is java.util.EmptyStackException] with root cause
           java.util.EmptyStackException
          @@ -408,7 +408,7 @@ java.util.EmptyStackException
           
        • Last week Felix from Earlham said that they finished testing on DSpace Test (linode19) so I made backups of some things there and re-deployed the system on Ubuntu 18.04
          • During re-deployment I hit a few issues with the Ansible playbooks and made some minor improvements
          • -
          • There seems to be an issue with nodejs's dependencies now, which causes npm to get uninstalled when installing the certbot dependencies (due to a conflict in libssl dependencies)
          • +
          • There seems to be an issue with nodejs’s dependencies now, which causes npm to get uninstalled when installing the certbot dependencies (due to a conflict in libssl dependencies)
          • I re-worked the playbooks to use Node.js from the upstream official repository for now
        • @@ -421,13 +421,13 @@ java.util.EmptyStackException
          • After restarting Tomcat, Solr was giving the “Error opening new searcher” error for all cores
          • I stopped Tomcat, added ulimit -v unlimited to the catalina.sh script and deleted all old locks in the DSpace solr directory and then DSpace started up normally
          • -
          • I'm still not exactly sure why I see this error and if the ulimit trick actually helps, as the tomcat7.service has LimitAS=infinity anyways (and from checking the PID's limits file in /proc it seems to be applied)
          • +
          • I’m still not exactly sure why I see this error and if the ulimit trick actually helps, as the tomcat7.service has LimitAS=infinity anyways (and from checking the PID’s limits file in /proc it seems to be applied)
          • Then I noticed that the item displays were blank… so I checked the database info and saw there were some unfinished migrations
          • -
          • I'm not entirely sure if it's related, but I tried to delete the old migrations and then force running the ignored ones like when we upgraded to DSpace 5.8 in 2018-06 and then after restarting Tomcat I could see the item displays again
          • +
          • I’m not entirely sure if it’s related, but I tried to delete the old migrations and then force running the ignored ones like when we upgraded to DSpace 5.8 in 2018-06 and then after restarting Tomcat I could see the item displays again
        • I copied the 2019 Solr statistics core from CGSpace to DSpace Test and it works (and is only 5.5GB currently), so now we have some useful stats on DSpace Test for the CUA module and the dspace-statistics-api
        • -
        • I ran DSpace's cleanup task on CGSpace (linode18) and there were errors:
        • +
        • I ran DSpace’s cleanup task on CGSpace (linode18) and there were errors:
        $ dspace cleanup -v
         Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
        @@ -485,8 +485,8 @@ $ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F
              72 dspace.log.2019-03-17
               8 dspace.log.2019-03-18
         
          -
        • It seems to be something with grep doing binary matching on some log files for some reason, so I guess I need to always use -I to say binary files don't match
        • -
        • Anyways, the full error in DSpace's log is:
        • +
        • It seems to be something with grep doing binary matching on some log files for some reason, so I guess I need to always use -I to say binary files don’t match
        • +
        • Anyways, the full error in DSpace’s log is:
        2019-03-18 12:26:23,331 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - 
         java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@75eaa668 is closed.
        @@ -509,7 +509,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
         
        2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR:  column "waiting" does not exist at character 217
         
        • This is unrelated and apparently due to Munin checking a column that was changed in PostgreSQL 9.6
        • -
        • I suspect that this issue with the blank pages might not be PostgreSQL after all, perhaps it's a Cocoon thing?
        • +
        • I suspect that this issue with the blank pages might not be PostgreSQL after all, perhaps it’s a Cocoon thing?
        • Looking in the cocoon logs I see a large number of warnings about “Can not load requested doc” around 11AM and 12PM:
        $ grep 'Can not load requested doc' cocoon.log.2019-03-18 | grep -oE '2019-03-18 [0-9]{2}:' | sort | uniq -c
        @@ -567,7 +567,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
             717 2019-03-08 11:
              59 2019-03-08 12:
         
          -
        • I'm not sure if it's cocoon or that's just a symptom of something else
        • +
        • I’m not sure if it’s cocoon or that’s just a symptom of something else

        2019-03-19

          @@ -581,8 +581,8 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds (1 row)
        • Perhaps my agrovoc-lookup.py script could notify if it finds these because they potentially give false negatives
        • -
        • CGSpace (linode18) is having problems with Solr again, I'm seeing “Error opening new searcher” in the Solr logs and there are no stats for previous years
        • -
        • Apparently the Solr statistics shards didn't load properly when we restarted Tomcat yesterday:
        • +
        • CGSpace (linode18) is having problems with Solr again, I’m seeing “Error opening new searcher” in the Solr logs and there are no stats for previous years
        • +
        • Apparently the Solr statistics shards didn’t load properly when we restarted Tomcat yesterday:
        2019-03-18 12:32:39,799 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
         ...
        @@ -593,7 +593,7 @@ Caused by: org.apache.solr.common.SolrException: Error opening new searcher
                 ... 31 more
         Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
         
          -
        • For reference, I don't see the ulimit -v unlimited in the catalina.sh script, though the tomcat7 systemd service has LimitAS=infinity
        • +
        • For reference, I don’t see the ulimit -v unlimited in the catalina.sh script, though the tomcat7 systemd service has LimitAS=infinity
        • The limits of the current Tomcat java process are:
        # cat /proc/27182/limits 
        @@ -615,7 +615,7 @@ Max nice priority         0                    0
         Max realtime priority     0                    0                    
         Max realtime timeout      unlimited            unlimited            us
         
          -
        • I will try to add ulimit -v unlimited to the Catalina startup script and check the output of the limits to see if it's different in practice, as some wisdom on Stack Overflow says this solves the Solr core issues and I've superstitiously tried it various times in the past +
        • I will try to add ulimit -v unlimited to the Catalina startup script and check the output of the limits to see if it’s different in practice, as some wisdom on Stack Overflow says this solves the Solr core issues and I’ve superstitiously tried it various times in the past
          • The result is the same before and after, so adding the ulimit directly is unneccessary (whether or not unlimited address space is useful or not is another question)
          @@ -627,7 +627,7 @@ Max realtime timeout unlimited unlimited us # systemctl start tomcat7
          • After restarting I confirmed that all Solr statistics cores were loaded successfully…
          • -
          • Another avenue might be to look at point releases in Solr 4.10.x, as we're running 4.10.2 and they released 4.10.3 and 4.10.4 back in 2014 or 2015 +
          • Another avenue might be to look at point releases in Solr 4.10.x, as we’re running 4.10.2 and they released 4.10.3 and 4.10.4 back in 2014 or 2015
            • I see several issues regarding locks and IndexWriter that were fixed in Solr and Lucene 4.10.3 and 4.10.4…
            @@ -651,7 +651,7 @@ Max realtime timeout unlimited unlimited us

          2019-03-21

            -
          • It's been two days since we had the blank page issue on CGSpace, and looking in the Cocoon logs I see very low numbers of the errors that we were seeing the last time the issue occurred:
          • +
          • It’s been two days since we had the blank page issue on CGSpace, and looking in the Cocoon logs I see very low numbers of the errors that we were seeing the last time the issue occurred:
          $ grep 'Can not load requested doc' cocoon.log.2019-03-20 | grep -oE '2019-03-20 [0-9]{2}:' | sort | uniq -c
                 3 2019-03-20 00:
          @@ -732,7 +732,7 @@ $ grep 'Can not load requested doc' cocoon.log.2019-03-23 | grep -oE '2019-03-23
               440 2019-03-23 08:
               260 2019-03-23 09:
           
            -
          • I was curious to see if clearing the Cocoon cache in the XMLUI control panel would fix it, but it didn't
          • +
          • I was curious to see if clearing the Cocoon cache in the XMLUI control panel would fix it, but it didn’t
          • Trying to drill down more, I see that the bulk of the errors started aroundi 21:20:
          $ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 21:[0-9]' | sort | uniq -c
          @@ -794,16 +794,16 @@ org.postgresql.util.PSQLException: This statement has been closed.
           

          I restarted Tomcat and now the item displays are working again for now

        • -

          I am wondering if this is an issue with removing abandoned connections in Tomcat's JDBC pooling?

          +

          I am wondering if this is an issue with removing abandoned connections in Tomcat’s JDBC pooling?

            -
          • It's hard to tell because we have logAbanded enabled, but I don't see anything in the tomcat7 service logs in the systemd journal
          • +
          • It’s hard to tell because we have logAbanded enabled, but I don’t see anything in the tomcat7 service logs in the systemd journal
        • I sent another mail to the dspace-tech mailing list with my observations

        • -

          I spent some time trying to test and debug the Tomcat connection pool's settings, but for some reason our logs are either messed up or no connections are actually getting abandoned

          +

          I spent some time trying to test and debug the Tomcat connection pool’s settings, but for some reason our logs are either messed up or no connections are actually getting abandoned

        • I compiled this TomcatJdbcConnectionTest and created a bunch of database connections and waited a few minutes but they never got abandoned until I created over maxActive (75), after which almost all were purged at once

          @@ -820,7 +820,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
          $ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=3000 service:jmx:rmi:///jndi/rmi://localhost:5400/jmxrmi -J-DsocksNonProxyHosts=
           
          • I need to remember to check the active connections next time we have issues with blank item pages on CGSpace
          • -
          • In other news, I've been running G1GC on DSpace Test (linode19) since 2018-11-08 without realizing it, which is probably a good thing
          • +
          • In other news, I’ve been running G1GC on DSpace Test (linode19) since 2018-11-08 without realizing it, which is probably a good thing
          • I deployed the latest 5_x-prod branch on CGSpace (linode18) and added more validation to the JDBC pool in our Tomcat config
            • This includes the new testWhileIdle and testOnConnect pool settings as well as the two new JDBC interceptors: StatementFinalizer and ConnectionState that should hopefully make sure our connections in the pool are valid
            • @@ -828,7 +828,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
            • I spent one hour looking at the invalid AGROVOC terms from last week
                -
              • It doesn't seem like any of the editors did any work on this so I did most of them
              • +
              • It doesn’t seem like any of the editors did any work on this so I did most of them
            @@ -842,21 +842,21 @@ org.postgresql.util.PSQLException: This statement has been closed.
          • Looking at the DBCP status on CGSpace via jconsole and everything looks good, though I wonder why timeBetweenEvictionRunsMillis is -1, because the Tomcat 7.0 JDBC docs say the default is 5000…
            • Could be an error in the docs, as I see the Apache Commons DBCP has -1 as the default
            • -
            • Maybe I need to re-evaluate the “defauts” of Tomcat 7's DBCP and set them explicitly in our config
            • +
            • Maybe I need to re-evaluate the “defauts” of Tomcat 7’s DBCP and set them explicitly in our config
            • From Tomcat 8 they seem to default to Apache Commons’ DBCP 2.x
          • -
          • Also, CGSpace doesn't have many Cocoon errors yet this morning:
          • +
          • Also, CGSpace doesn’t have many Cocoon errors yet this morning:
          $ grep 'Can not load requested doc' cocoon.log.2019-03-25 | grep -oE '2019-03-25 [0-9]{2}:' | sort | uniq -c
                 4 2019-03-25 00:
                 1 2019-03-25 01:
           
            -
          • Holy shit I just realized we've been using the wrong DBCP pool in Tomcat +
          • Holy shit I just realized we’ve been using the wrong DBCP pool in Tomcat
            • By default you get the Commons DBCP one unless you specify factory org.apache.tomcat.jdbc.pool.DataSourceFactory
            • -
            • Now I see all my interceptor settings etc in jconsole, where I didn't see them before (also a new tomcat.jdbc mbean)!
            • -
            • No wonder our settings didn't quite match the ones in the Tomcat DBCP Pool docs
            • +
            • Now I see all my interceptor settings etc in jconsole, where I didn’t see them before (also a new tomcat.jdbc mbean)!
            • +
            • No wonder our settings didn’t quite match the ones in the Tomcat DBCP Pool docs
          • Uptime Robot reported that CGSpace went down and I see the load is very high
          • @@ -885,7 +885,7 @@ org.postgresql.util.PSQLException: This statement has been closed. 1222 35.174.184.209 1720 2a01:4f8:13b:1296::2
              -
            • The IPs look pretty normal except we've never seen 93.179.69.74 before, and it uses the following user agent:
            • +
            • The IPs look pretty normal except we’ve never seen 93.179.69.74 before, and it uses the following user agent:
            Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.20 Safari/535.1
             
              @@ -894,7 +894,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
              $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=93.179.69.74' dspace.log.2019-03-25 | sort | uniq | wc -l
               1
               
                -
              • That's weird because the total number of sessions today seems low compared to recent days:
              • +
              • That’s weird because the total number of sessions today seems low compared to recent days:
              $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-25 | sort -u | wc -l
               5657
              @@ -914,7 +914,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
               
              • I restarted Tomcat and deployed the new Tomcat JDBC settings on CGSpace since I had to restart the server anyways
                  -
                • I need to watch this carefully though because I've read some places that Tomcat's DBCP doesn't track statements and might create memory leaks if an application doesn't close statements before a connection gets returned back to the pool
                • +
                • I need to watch this carefully though because I’ve read some places that Tomcat’s DBCP doesn’t track statements and might create memory leaks if an application doesn’t close statements before a connection gets returned back to the pool
              • According the Uptime Robot the server was up and down a few more times over the next hour so I restarted Tomcat again
              • @@ -969,14 +969,14 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
              • 216.244.66.198 is DotBot
              • 93.179.69.74 is some IP in Ukraine, which I will add to the list of bot IPs in nginx
              • I can only hope that this helps the load go down because all this traffic is disrupting the service for normal users and well-behaved bots (and interrupting my dinner and breakfast)
              • -
              • Looking at the database usage I'm wondering why there are so many connections from the DSpace CLI:
              • +
              • Looking at the database usage I’m wondering why there are so many connections from the DSpace CLI:
              $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
                      5 dspaceApi
                    10 dspaceCli
                    13 dspaceWeb
               
                -
              • Looking closer I see they are all idle… so at least I know the load isn't coming from some background nightly task or something
              • +
              • Looking closer I see they are all idle… so at least I know the load isn’t coming from some background nightly task or something
              • Make a minor edit to my agrovoc-lookup.py script to match subject terms with parentheses like COCOA (PLANT)
              • Test 89 corrections and 79 deletions for AGROVOC subject terms from the ones I cleaned up in the last week
              @@ -984,12 +984,12 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d -n
              • UptimeRobot says CGSpace is down again, but it seems to just be slow, as the load is over 10.0
              • -
              • Looking at the nginx logs I don't see anything terribly abusive, but SemrushBot has made ~3,000 requests to Discovery and Browse pages today:
              • +
              • Looking at the nginx logs I don’t see anything terribly abusive, but SemrushBot has made ~3,000 requests to Discovery and Browse pages today:
              # grep SemrushBot /var/log/nginx/access.log | grep -E "26/Mar/2019" | grep -E '(discover|browse)' | wc -l
               2931
               
                -
              • So I'm adding it to the badbot rate limiting in nginx, and actually, I kinda feel like just blocking all user agents with “bot” in the name for a few days to see if things calm down… maybe not just yet
              • +
              • So I’m adding it to the badbot rate limiting in nginx, and actually, I kinda feel like just blocking all user agents with “bot” in the name for a few days to see if things calm down… maybe not just yet
              • Otherwise, these are the top users in the web and API logs the last hour (18–19):
              # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "26/Mar/2019:(18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 
              @@ -1021,7 +1021,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
               
              $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=(18.195.78.144|18.196.196.108)' dspace.log.2019-03-26 | sort | uniq | wc -l
               937
               
                -
              • I will add their IPs to the list of bot IPs in nginx so I can tag them as bots to let Tomcat's Crawler Session Manager Valve to force them to re-use their session
              • +
              • I will add their IPs to the list of bot IPs in nginx so I can tag them as bots to let Tomcat’s Crawler Session Manager Valve to force them to re-use their session
              • Another user agent behaving badly in Colombia is “GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1”
              • I will add curl to the Tomcat Crawler Session Manager because anyone using curl is most likely an automated read-only request
              • I will add GuzzleHttp to the nginx badbots rate limiting, because it is making requests to dynamic Discovery pages
              • @@ -1029,7 +1029,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
                # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep 45.5.184.72 | grep -E "26/Mar/2019:" | grep -E '(discover|browse)' | wc -l                                        
                 119
                 
                  -
                • What's strange is that I can't see any of their requests in the DSpace log…
                • +
                • What’s strange is that I can’t see any of their requests in the DSpace log…
                $ grep -I -c 45.5.184.72 dspace.log.2019-03-26 
                 0
                @@ -1050,7 +1050,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
                 
                • None of these 18.x.x.x IPs specify a user agent and they are all on Amazon!
                • Shortly after I started the re-indexing UptimeRobot began to complain that CGSpace was down, then up, then down, then up…
                • -
                • I see the load on the server is about 10.0 again for some reason though I don't know WHAT is causing that load +
                • I see the load on the server is about 10.0 again for some reason though I don’t know WHAT is causing that load
                  • It could be the CPU steal metric, as if Linode has oversold the CPU resources on this VM host…
                  @@ -1061,14 +1061,14 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds

                  CPU week

                  CPU year

                    -
                  • What's clear from this is that some other VM on our host has heavy usage for about four hours at 6AM and 6PM and that during that time the load on our server spikes +
                  • What’s clear from this is that some other VM on our host has heavy usage for about four hours at 6AM and 6PM and that during that time the load on our server spikes
                    • CPU steal has drastically increased since March 25th
                    • It might be time to move to a dedicated CPU VM instances, or even real servers
                    • -
                    • For now I just sent a support ticket to bring this to Linode's attention
                    • +
                    • For now I just sent a support ticket to bring this to Linode’s attention
                  • -
                  • In other news, I see that it's not even the end of the month yet and we have 3.6 million hits already:
                  • +
                  • In other news, I see that it’s not even the end of the month yet and we have 3.6 million hits already:
                  # zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2019"
                   3654911
                  @@ -1120,7 +1120,7 @@ sys     0m2.551s
                   
                • It has 64GB of ECC RAM, six core Xeon processor from 2018, and 2x960GB NVMe storage
                • The alternative of staying with Linode and using dedicated CPU instances with added block storage gets expensive quickly if we want to keep more than 16GB of RAM (do we?)
                    -
                  • Regarding RAM, our JVM heap is 8GB and we leave the rest of the system's 32GB of RAM to PostgreSQL and Solr buffers
                  • +
                  • Regarding RAM, our JVM heap is 8GB and we leave the rest of the system’s 32GB of RAM to PostgreSQL and Solr buffers
                  • Seeing as we have 56GB of Solr data it might be better to have more RAM in order to keep more of it in memory
                  • Also, I know that the Linode block storage is a major bottleneck for Solr indexing
                  @@ -1128,7 +1128,7 @@ sys 0m2.551s
              • Looking at the weird issue with shitloads of downloads on the CTA item again
              • -
              • The item was added on 2019-03-13 and these three IPs have attempted to download the item's bitstream 43,000 times since it was added eighteen days ago:
              • +
              • The item was added on 2019-03-13 and these three IPs have attempted to download the item’s bitstream 43,000 times since it was added eighteen days ago:
              # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.{2..17}.gz | grep 'Spore-192-EN-web.pdf' | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 5
                    42 196.43.180.134
              @@ -1147,7 +1147,7 @@ sys     0m2.551s
               
            2019-03-29 09:10:07,311 ERROR org.dspace.rest.Resource @ Could not delete collection(id=1451), AuthorizeException. Message: org.dspace.authorize.AuthorizeException: Authorization denied for action ADMIN on COLLECTION:1451 by user 9492
             
            diff --git a/docs/2019-04/index.html b/docs/2019-04/index.html index b0b241e7d..cecf8024e 100644 --- a/docs/2019-04/index.html +++ b/docs/2019-04/index.html @@ -61,7 +61,7 @@ $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u ds $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d "/> - + @@ -91,7 +91,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace - + @@ -138,7 +138,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

            April, 2019

            @@ -169,7 +169,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

            2019-04-02

            • CTA says the Amazon IPs are AWS gateways for real user traffic
            • -
            • I was trying to add Felix Shaw's account back to the Administrators group on DSpace Test, but I couldn't find his name in the user search of the groups page +
            • I was trying to add Felix Shaw’s account back to the Administrators group on DSpace Test, but I couldn’t find his name in the user search of the groups page
              • If I searched for “Felix” or “Shaw” I saw other matches, included one for his personal email address!
              • I ended up finding him via searching for his email address
              • @@ -192,12 +192,12 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
                $ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
                 
                • After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim
                • -
                • One user's name has changed so I will update those using my fix-metadata-values.py script:
                • +
                • One user’s name has changed so I will update those using my fix-metadata-values.py script:
                $ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
                 
                • I created a pull request and merged the changes to the 5_x-prod branch (#417)
                • -
                • A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it's still going:
                • +
                • A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it’s still going:
                2019-04-03 16:34:02,262 INFO  org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
                 
                  @@ -228,10 +228,10 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

                CPU usage week

                  -
                • The other thing visible there is that the past few days the load has spiked to 500% and I don't think it's a coincidence that the Solr updating thing is happening…
                • +
                • The other thing visible there is that the past few days the load has spiked to 500% and I don’t think it’s a coincidence that the Solr updating thing is happening…
                • I ran all system updates and rebooted the server
                    -
                  • The load was lower on the server after reboot, but Solr didn't come back up properly according to the Solr Admin UI:
                  • +
                  • The load was lower on the server after reboot, but Solr didn’t come back up properly according to the Solr Admin UI:
                @@ -241,7 +241,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

              2019-04-06

                -
              • Udana asked why item 10568/91278 didn't have an Altmetric badge on CGSpace, but on the WLE website it does +
              • Udana asked why item 10568/91278 didn’t have an Altmetric badge on CGSpace, but on the WLE website it does
                • I looked and saw that the WLE website is using the Altmetric score associated with the DOI, and that the Handle has no score at all
                • I tweeted the item and I assume this will link the Handle with the DOI in the system
                • @@ -273,12 +273,12 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace 4267 45.5.186.2 4893 205.186.128.185
                    -
                  • 45.5.184.72 is in Colombia so it's probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT's datasets collection:
                  • +
                  • 45.5.184.72 is in Colombia so it’s probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT’s datasets collection:
                  GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Dataset&filtertype=dateIssued&filter_relational_operator=equals&filter=2014
                   
                  • Their user agent is the one I added to the badbots list in nginx last week: “GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1”
                  • -
                  • They made 22,000 requests to Discover on this collection today alone (and it's only 11AM):
                  • +
                  • They made 22,000 requests to Discover on this collection today alone (and it’s only 11AM):
                  # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "06/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
                     22077 /handle/10568/72970/discover
                  @@ -332,7 +332,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
                       }
                   }
                   
                    -
                  • Strangely I don't see many hits in 2019-04:
                  • +
                  • Strangely I don’t see many hits in 2019-04:
                  $ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-04&rows=0&wt=json&indent=true'
                   {
                  @@ -417,7 +417,7 @@ X-XSS-Protection: 1; mode=block
                   
                  • So definitely the size of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
                      -
                    • After twenty minutes of waiting I still don't see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:
                    • +
                    • After twenty minutes of waiting I still don’t see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:
                  @@ -426,7 +426,7 @@ X-XSS-Protection: 1; mode=block
                  • So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned
                      -
                    • Strangely, the statistics Solr core says it hasn't been modified in 24 hours, so I tried to start the “optimize” process from the Admin UI and I see this in the Solr log:
                    • +
                    • Strangely, the statistics Solr core says it hasn’t been modified in 24 hours, so I tried to start the “optimize” process from the Admin UI and I see this in the Solr log:
                  @@ -434,7 +434,7 @@ X-XSS-Protection: 1; mode=block
                  • Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are statistics_type:view… very weird
                      -
                    • I don't even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)
                    • +
                    • I don’t even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)
                    • I will try to re-deploy the 5_x-dev branch and test again
                  • @@ -465,7 +465,7 @@ X-XSS-Protection: 1; mode=block }
                    • I confirmed the same on CGSpace itself after making one HEAD request
                    • -
                    • So I'm pretty sure it's something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week +
                    • So I’m pretty sure it’s something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week
                      • I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace
                      @@ -482,12 +482,12 @@ X-XSS-Protection: 1; mode=block
                    • See: DS-3986
                    • See: DS-4020
                    • See: DS-3832
                    • -
                    • DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been removed from MaxMind's server as of 2018-04-01
                    • +
                    • DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been removed from MaxMind’s server as of 2018-04-01
                    • Now I made 100 requests and I see them in the Solr statistics… fuck my life for wasting five hours debugging this
                  • UptimeRobot said CGSpace went down and up a few times tonight, and my first instict was to check iostat 1 10 and I saw that CPU steal is around 10–30 percent right now…
                  • -
                  • The load average is super high right now, as I've noticed the last few times UptimeRobot said that CGSpace went down:
                  • +
                  • The load average is super high right now, as I’ve noticed the last few times UptimeRobot said that CGSpace went down:
                  $ cat /proc/loadavg 
                   10.70 9.17 8.85 18/633 4198
                  @@ -532,7 +532,7 @@ X-XSS-Protection: 1; mode=block
                   

                2019-04-08

                  -
                • Start checking IITA's last round of batch uploads from March on DSpace Test (20193rd.xls) +
                • Start checking IITA’s last round of batch uploads from March on DSpace Test (20193rd.xls)
                  • Lots of problems with affiliations, I had to correct about sixty of them
                  • I used lein to host the latest CSV of our affiliations for OpenRefine to reconcile against:
                  • @@ -543,7 +543,7 @@ X-XSS-Protection: 1; mode=block
                    • After matching the values and creating some new matches I had trouble remembering how to copy the reconciled values to a new column
                        -
                      • The matched values can be accessed with cell.recon.match.name, but some of the new values don't appear, perhaps because I edited the original cell values?
                      • +
                      • The matched values can be accessed with cell.recon.match.name, but some of the new values don’t appear, perhaps because I edited the original cell values?
                      • I ended up using this GREL expression to copy all values to a new column:
                    • @@ -599,7 +599,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe

                    CPU usage week

                      -
                    • Linode Support still didn't respond to my ticket from yesterday, so I attached a new output of iostat 1 10 and asked them to move the VM to a less busy host
                    • +
                    • Linode Support still didn’t respond to my ticket from yesterday, so I attached a new output of iostat 1 10 and asked them to move the VM to a less busy host
                    • The web server logs are not very busy:
                    # zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                    @@ -679,7 +679,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
                     
                    $ http 'https://api.crossref.org/funders?query=mercator&mailto=me@cgiar.org'
                     
                    • Otherwise, they provide the funder data in CSV and RDF format
                    • -
                    • I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn't match will need a human to go and do some manual checking and informed decision making…
                    • +
                    • I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn’t match will need a human to go and do some manual checking and informed decision making…
                    • If I want to write a script for this I could use the Python habanero library:
                    from habanero import Crossref
                    @@ -687,7 +687,7 @@ cr = Crossref(mailto="me@cgiar.org")
                     x = cr.funders(query = "mercator")
                     

                    2019-04-11

                      -
                    • Continue proofing IITA's last round of batch uploads from March on DSpace Test (20193rd.xls) +
                    • Continue proofing IITA’s last round of batch uploads from March on DSpace Test (20193rd.xls)
                      • One misspelled country
                      • Three incorrect regions
                      • @@ -711,7 +711,7 @@ x = cr.funders(query = "mercator")
                    • -
                    • I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA's records, so I applied them to DSpace Test and CGSpace:
                    • +
                    • I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA’s records, so I applied them to DSpace Test and CGSpace:
                    $ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
                     $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d
                    @@ -719,9 +719,9 @@ $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspac
                     
                  • Answer more questions about DOIs and Altmetric scores from WLE
                  • Answer more questions about DOIs and Altmetric scores from IWMI
                      -
                    • They can't seem to understand the Altmetric + Twitter flow for associating Handles and DOIs
                    • -
                    • To make things worse, many of their items DON'T have DOIs, so when Altmetric harvests them of course there is no link! - Then, a bunch of their items don't have scores because they never tweeted them!
                    • -
                    • They added a DOI to this old item 10567/97087 this morning and wonder why Altmetric's score hasn't linked with the DOI magically
                    • +
                    • They can’t seem to understand the Altmetric + Twitter flow for associating Handles and DOIs
                    • +
                    • To make things worse, many of their items DON’T have DOIs, so when Altmetric harvests them of course there is no link! - Then, a bunch of their items don’t have scores because they never tweeted them!
                    • +
                    • They added a DOI to this old item 10567/97087 this morning and wonder why Altmetric’s score hasn’t linked with the DOI magically
                    • We should check in a week to see if Altmetric will make the association after one week when they harvest again
                  • @@ -734,7 +734,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspac
                    • It took about eight minutes to index 784 pages of item views and 268 of downloads, and you can see a clear “sawtooth” pattern in the garbage collection
                    • I am curious if the GC pattern would be different if I switched from the -XX:+UseConcMarkSweepGC to G1GC
                    • -
                    • I switched to G1GC and restarted Tomcat but for some reason I couldn't see the Tomcat PID in VisualVM… +
                    • I switched to G1GC and restarted Tomcat but for some reason I couldn’t see the Tomcat PID in VisualVM…
                      • Anyways, the indexing process took much longer, perhaps twice as long!
                      @@ -771,10 +771,10 @@ $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspac
                    • Tag version 1.0.0 and deploy it on DSpace Test
                    -
                  • Pretty annoying to see CGSpace (linode18) with 20–50% CPU steal according to iostat 1 10, though I haven't had any Linode alerts in a few days
                  • -
                  • Abenet sent me a list of ILRI items that don't have CRPs added to them +
                  • Pretty annoying to see CGSpace (linode18) with 20–50% CPU steal according to iostat 1 10, though I haven’t had any Linode alerts in a few days
                  • +
                  • Abenet sent me a list of ILRI items that don’t have CRPs added to them
                      -
                    • The spreadsheet only had Handles (no IDs), so I'm experimenting with using Python in OpenRefine to get the IDs
                    • +
                    • The spreadsheet only had Handles (no IDs), so I’m experimenting with using Python in OpenRefine to get the IDs
                    • I cloned the handle column and then did a transform to get the IDs from the CGSpace REST API:
                  • @@ -795,12 +795,12 @@ item_id = data['id'] return item_id
                      -
                    • Luckily none of the items already had CRPs, so I didn't have to worry about them getting removed +
                    • Luckily none of the items already had CRPs, so I didn’t have to worry about them getting removed
                      • It would have been much trickier if I had to get the CRPs for the items first, then add the CRPs…
                    • -
                    • I ran a full Discovery indexing on CGSpace because I didn't do it after all the metadata updates last week:
                    • +
                    • I ran a full Discovery indexing on CGSpace because I didn’t do it after all the metadata updates last week:
                    $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
                     
                    @@ -809,7 +809,7 @@ user    7m33.446s
                     sys     2m13.463s
                     

                    2019-04-16

                      -
                    • Export IITA's community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something
                    • +
                    • Export IITA’s community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something

                    2019-04-17

                      @@ -914,8 +914,8 @@ sys 2m13.463s
                    • The biggest takeaway I have is that this workload benefits from a larger filterCache (for Solr fq parameter), but barely uses the queryResultCache (for Solr q parameter) at all
                        -
                      • The number of hits goes up and the time taken decreases when we increase the filterCache, and total JVM heap memory doesn't seem to increase much at all
                      • -
                      • I guess the queryResultCache size is always 2 because I'm only doing two queries: type:0 and type:2 (downloads and views, respectively)
                      • +
                      • The number of hits goes up and the time taken decreases when we increase the filterCache, and total JVM heap memory doesn’t seem to increase much at all
                      • +
                      • I guess the queryResultCache size is always 2 because I’m only doing two queries: type:0 and type:2 (downloads and views, respectively)
                    • Here is the general pattern of running three sequential indexing runs as seen in VisualVM while monitoring the Tomcat process:
                    • @@ -959,7 +959,7 @@ sys 2m13.463s

                      CPU usage week

                      2019-04-18

                        -
                      • I've been trying to copy the statistics-2018 Solr core from CGSpace to DSpace Test since yesterday, but the network speed is like 20KiB/sec +
                      • I’ve been trying to copy the statistics-2018 Solr core from CGSpace to DSpace Test since yesterday, but the network speed is like 20KiB/sec
                        • I opened a support ticket to ask Linode to investigate
                        • They asked me to send an mtr report from Fremont to Frankfurt and vice versa
                        • @@ -968,10 +968,10 @@ sys 2m13.463s
                        • Deploy Tomcat 7.0.94 on DSpace Test (linode19)
                          • Also, I realized that the CMS GC changes I deployed a few days ago were ignored by Tomcat because of something with how Ansible formatted the options string
                          • -
                          • I needed to use the “folded” YAML variable format >- (with the dash so it doesn't add a return at the end)
                          • +
                          • I needed to use the “folded” YAML variable format >- (with the dash so it doesn’t add a return at the end)
                        • -
                        • UptimeRobot says that CGSpace went “down” this afternoon, but I looked at the CPU steal with iostat 1 10 and it's in the 50s and 60s +
                        • UptimeRobot says that CGSpace went “down” this afternoon, but I looked at the CPU steal with iostat 1 10 and it’s in the 50s and 60s
                          • The munin graph shows a lot of CPU steal (red) currently (and over all during the week):
                          @@ -1009,13 +1009,13 @@ TCP window size: 85.0 KByte (default) [ 5] 0.0-10.2 sec 172 MBytes 142 Mbits/sec [ 4] 0.0-10.5 sec 202 MBytes 162 Mbits/sec
                      -
                    • Even with the software firewalls disabled the rsync speed was low, so it's not a rate limiting issue
                    • +
                    • Even with the software firewalls disabled the rsync speed was low, so it’s not a rate limiting issue
                    • I also tried to download a file over HTTPS from CGSpace to DSpace Test, but it was capped at 20KiB/sec
                      • I updated the Linode issue with this information
                    • -
                    • I'm going to try to switch the kernel to the latest upstream (5.0.8) instead of Linode's latest x86_64 +
                    • I’m going to try to switch the kernel to the latest upstream (5.0.8) instead of Linode’s latest x86_64
                      • Nope, still 20KiB/sec
                      @@ -1026,7 +1026,7 @@ TCP window size: 85.0 KByte (default)
                    • Deploy Solr 4.10.4 on CGSpace (linode18)
                    • Deploy Tomcat 7.0.94 on CGSpace
                    • Deploy dspace-statistics-api v1.0.0 on CGSpace
                    • -
                    • Linode support replicated the results I had from the network speed testing and said they don't know why it's so slow +
                    • Linode support replicated the results I had from the network speed testing and said they don’t know why it’s so slow
                      • They offered to live migrate the instance to another host to see if that helps
                      @@ -1034,7 +1034,7 @@ TCP window size: 85.0 KByte (default)

                    2019-04-22

                      -
                    • Abenet pointed out an item that doesn't have an Altmetric score on CGSpace, but has a score of 343 in the CGSpace Altmetric dashboard +
                    • Abenet pointed out an item that doesn’t have an Altmetric score on CGSpace, but has a score of 343 in the CGSpace Altmetric dashboard
                      • I tweeted the Handle to see if it will pick it up…
                      • Like clockwork, after fifteen minutes there was a donut showing on CGSpace
                      • @@ -1062,7 +1062,7 @@ dspace.log.2019-04-20:1515
                        -
                      • Perhaps that's why the Azure pricing is so expensive!
                      • +
                      • Perhaps that’s why the Azure pricing is so expensive!
                      • Add a privacy page to CGSpace
                        • The work was mostly similar to the About page at /page/about, but in addition to adding i18n strings etc, I had to add the logic for the trail to dspace-xmlui-mirage2/src/main/webapp/xsl/preprocess/general.xsl
                        • @@ -1086,7 +1086,7 @@ dspace.log.2019-04-20:1515
                        • While I was uploading the IITA records I noticed that twenty of the records Sisay uploaded in 2018-09 had double Handles (dc.identifier.uri)
                            -
                          • According to my notes in 2018-09 I had noticed this when he uploaded the records and told him to remove them, but he didn't…
                          • +
                          • According to my notes in 2018-09 I had noticed this when he uploaded the records and told him to remove them, but he didn’t…
                          • I exported the IITA community as a CSV then used csvcut to extract the two URI columns and identify and fix the records:
                        • @@ -1097,14 +1097,14 @@ dspace.log.2019-04-20:1515
                          • I told him we never finished it, and that he should try to use the /items/find-by-metadata-field endpoint, with the caveat that you need to match the language attribute exactly (ie “en”, “en_US”, null, etc)
                          • I asked him how many terms they are interested in, as we could probably make it easier by normalizing the language attributes of these fields (it would help us anyways)
                          • -
                          • He says he's getting HTTP 401 errors when trying to search for CPWF subject terms, which I can reproduce:
                          • +
                          • He says he’s getting HTTP 401 errors when trying to search for CPWF subject terms, which I can reproduce:
                        $ curl -f -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
                         curl: (22) The requested URL returned error: 401
                         
                          -
                        • Note that curl only shows the HTTP 401 error if you use -f (fail), and only then if you don't include -s +
                        • Note that curl only shows the HTTP 401 error if you use -f (fail), and only then if you don’t include -s
                          • I see there are about 1,000 items using CPWF subject “WATER MANAGEMENT” in the database, so there should definitely be results
                          • The breakdown of text_lang fields used in those items is 942:
                          • @@ -1129,7 +1129,7 @@ dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AN 417 (1 row)
                              -
                            • I see that the HTTP 401 issue seems to be a bug due to an item that the user doesn't have permission to access… from the DSpace log:
                            • +
                            • I see that the HTTP 401 issue seems to be a bug due to an item that the user doesn’t have permission to access… from the DSpace log:
                            2019-04-24 08:11:51,129 INFO  org.dspace.rest.ItemsResource @ Looking for item with metadata(key=cg.subject.cpwf,value=WATER MANAGEMENT, language=en_US).
                             2019-04-24 08:11:51,231 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72448
                            @@ -1209,7 +1209,7 @@ $ curl -f -H "rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b"
                             COPY 65752
                             

                            2019-04-28

                              -
                            • Still trying to figure out the issue with the items that cause the REST API's /items/find-by-metadata-value endpoint to throw an exception +
                            • Still trying to figure out the issue with the items that cause the REST API’s /items/find-by-metadata-value endpoint to throw an exception
                              • I made the item private in the UI and then I see in the UI and PostgreSQL that it is no longer discoverable:
                              @@ -1234,7 +1234,7 @@ COPY 65752
                            $ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
                             
                              -
                            • Carlos from LandPortal asked if I could export CGSpace in a machine-readable format so I think I'll try to do a CSV +
                            • Carlos from LandPortal asked if I could export CGSpace in a machine-readable format so I think I’ll try to do a CSV
                              • In order to make it easier for him to understand the CSV I will normalize the text languages (minus the provenance field) on my local development instance before exporting:
                              diff --git a/docs/2019-05/index.html b/docs/2019-05/index.html index ef0c76b3b..c52fea0a5 100644 --- a/docs/2019-05/index.html +++ b/docs/2019-05/index.html @@ -45,7 +45,7 @@ DELETE 1 But after this I tried to delete the item from the XMLUI and it is still present… "/> - + @@ -75,7 +75,7 @@ But after this I tried to delete the item from the XMLUI and it is still present - + @@ -122,7 +122,7 @@ But after this I tried to delete the item from the XMLUI and it is still present

                              May, 2019

                              @@ -146,7 +146,7 @@ DELETE 1
                              • I managed to delete the problematic item from the database
                                  -
                                • First I deleted the item's bitstream in XMLUI and then ran dspace cleanup -v to remove it from the assetstore
                                • +
                                • First I deleted the item’s bitstream in XMLUI and then ran dspace cleanup -v to remove it from the assetstore
                                • Then I ran the following SQL:
                              • @@ -155,7 +155,7 @@ DELETE 1 dspace=# DELETE FROM workspaceitem WHERE item_id=74648; dspace=# DELETE FROM item WHERE item_id=74648;
                                  -
                                • Now the item is (hopefully) really gone and I can continue to troubleshoot the issue with REST API's /items/find-by-metadata-value endpoint +
                                • Now the item is (hopefully) really gone and I can continue to troubleshoot the issue with REST API’s /items/find-by-metadata-value endpoint
                                  • Of course I run into another HTTP 401 error when I continue trying the LandPortal search from last month:
                                  @@ -177,15 +177,15 @@ curl: (22) The requested URL returned error: 401 Unauthorized
                                • Some are in the workspaceitem table (pre-submission), others are in the workflowitem table (submitted), and others are actually approved, but withdrawn…
                                    -
                                  • This is actually a worthless exercise because the real issue is that the /items/find-by-metadata-value endpoint is simply designed flawed and shouldn't be fatally erroring when the search returns items the user doesn't have permission to access
                                  • -
                                  • It would take way too much time to try to fix the fucked up items that are in limbo by deleting them in SQL, but also, it doesn't actually fix the problem because some items are submitted but withdrawn, so they actually have handles and everything
                                  • -
                                  • I think the solution is to recommend people don't use the /items/find-by-metadata-value endpoint
                                  • +
                                  • This is actually a worthless exercise because the real issue is that the /items/find-by-metadata-value endpoint is simply designed flawed and shouldn’t be fatally erroring when the search returns items the user doesn’t have permission to access
                                  • +
                                  • It would take way too much time to try to fix the fucked up items that are in limbo by deleting them in SQL, but also, it doesn’t actually fix the problem because some items are submitted but withdrawn, so they actually have handles and everything
                                  • +
                                  • I think the solution is to recommend people don’t use the /items/find-by-metadata-value endpoint
                                • CIP is asking about embedding PDF thumbnail images in their RSS feeds again
                                    -
                                  • They asked in 2018-09 as well and I told them it wasn't possible
                                  • -
                                  • To make sure, I looked at the documentation for RSS media feeds and tried it, but couldn't get it to work
                                  • +
                                  • They asked in 2018-09 as well and I told them it wasn’t possible
                                  • +
                                  • To make sure, I looked at the documentation for RSS media feeds and tried it, but couldn’t get it to work
                                  • It seems to be geared towards iTunes and Podcasts… I dunno
                                • @@ -273,7 +273,7 @@ Please see the DSpace documentation for assistance.

                                  linode18 postgres connections day

                                  linode18 CPU day

                                    -
                                  • The number of unique sessions today is ridiculously high compared to the last few days considering it's only 12:30PM right now:
                                  • +
                                  • The number of unique sessions today is ridiculously high compared to the last few days considering it’s only 12:30PM right now:
                                  $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-06 | sort | uniq | wc -l
                                   101108
                                  @@ -326,7 +326,7 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -
                                      2845 HEAD
                                     98121 GET
                                   
                                    -
                                  • I'm not exactly sure what happened this morning, but it looks like some legitimate user traffic—perhaps someone launched a new publication and it got a bunch of hits?
                                  • +
                                  • I’m not exactly sure what happened this morning, but it looks like some legitimate user traffic—perhaps someone launched a new publication and it got a bunch of hits?
                                  • Looking again, I see 84,000 requests to /handle this morning (not including logs for library.cgiar.org because those get HTTP 301 redirect to CGSpace and appear here in access.log):
                                  # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -c -o -E " /handle/[0-9]+/[0-9]+"
                                  @@ -413,7 +413,7 @@ Error sending email:
                                   Please see the DSpace documentation for assistance.
                                   
                                  • I checked the settings and apparently I had updated it incorrectly last week after ICT reset the password
                                  • -
                                  • Help Moayad with certbot-auto for Let's Encrypt scripts on the new AReS server (linode20)
                                  • +
                                  • Help Moayad with certbot-auto for Let’s Encrypt scripts on the new AReS server (linode20)
                                  • Normalize all text_lang values for metadata on CGSpace and DSpace Test (as I had tested last month):
                                  UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
                                  @@ -455,7 +455,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
                                   
                                   
                                  • So this was definitely an attack of some sort… only God knows why
                                  • -
                                  • I noticed a few new bots that don't use the word “bot” in their user agent and therefore don't match Tomcat's Crawler Session Manager Valve: +
                                  • I noticed a few new bots that don’t use the word “bot” in their user agent and therefore don’t match Tomcat’s Crawler Session Manager Valve:
                                    • Blackboard Safeassign
                                    • Unpaywall
                                    • @@ -486,7 +486,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata

                                    2019-05-15

                                      -
                                    • Tezira says she's having issues with email reports for approved submissions, but I received an email about collection subscriptions this morning, and I tested with dspace test-email and it's also working…
                                    • +
                                    • Tezira says she’s having issues with email reports for approved submissions, but I received an email about collection subscriptions this morning, and I tested with dspace test-email and it’s also working…
                                    • Send a list of DSpace build tips to Panagis from AgroKnow
                                    • Finally fix the AReS v2 to work via DSpace Test and send it to Peter et al to give their feedback
                                        @@ -501,7 +501,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
                                        dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-05-16-investors.csv WITH CSV HEADER;
                                         COPY 995
                                         
                                  • I was going to make a new controlled vocabulary of the top 100 terms after these corrections, but I noticed a bunch of duplicates and variations when I sorted them alphabetically
                                  • Instead, I exported a new list and asked Peter to look at it again
                                  • -
                                  • Apply Peter's new corrections on DSpace Test and CGSpace:
                                  • +
                                  • Apply Peter’s new corrections on DSpace Test and CGSpace:
                                  $ ./fix-metadata-values.py -i /tmp/2019-05-17-fix-25-Investors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
                                   $ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
                                  @@ -581,7 +581,7 @@ COPY 64871
                                   
                                • Run all system updates on DSpace Test (linode19) and reboot it
                                • Paola from CIAT asked for a way to generate a report of the top keywords for each year of their articles and journals
                                    -
                                  • I told them that the best way (even though it's low tech) is to work on a CSV dump of the collection
                                  • +
                                  • I told them that the best way (even though it’s low tech) is to work on a CSV dump of the collection
                                @@ -600,7 +600,7 @@ COPY 64871
                              2019-05-30 07:19:35,166 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A5E0C836AF8F3ABB769FE47107AE1CFF:ip_addr=185.71.4.34:failed_login:no DN found for user sa.saini@cgiar.org
                               
                                -
                              • For now I just created an eperson with her personal email address until I have time to check LDAP to see what's up with her CGIAR account:
                              • +
                              • For now I just created an eperson with her personal email address until I have time to check LDAP to see what’s up with her CGIAR account:
                              $ dspace user -a -m blah@blah.com -g Sakshi -s Saini -p 'sknflksnfksnfdls'
                               
                              diff --git a/docs/2019-06/index.html b/docs/2019-06/index.html index 1c43af46c..a49fde881 100644 --- a/docs/2019-06/index.html +++ b/docs/2019-06/index.html @@ -31,7 +31,7 @@ Run system updates on CGSpace (linode18) and reboot it Skype with Marie-Angélique and Abenet about CG Core v2 "/> - + @@ -61,7 +61,7 @@ Skype with Marie-Angélique and Abenet about CG Core v2 - + @@ -108,7 +108,7 @@ Skype with Marie-Angélique and Abenet about CG Core v2

                              June, 2019

                              @@ -172,16 +172,16 @@ Skype with Marie-Angélique and Abenet about CG Core v2
                            • Create a new AReS repository: https://github.com/ilri/AReS
                            • Start looking at the 203 IITA records on DSpace Test from last month (IITA_May_16 aka “20194th.xls”) using OpenRefine
                                -
                              • Trim leading, trailing, and consecutive whitespace on all columns, but I didn't notice very many issues
                              • +
                              • Trim leading, trailing, and consecutive whitespace on all columns, but I didn’t notice very many issues
                              • Validate affiliations against latest list of top 1500 terms using reconcile-csv, correcting and standardizing about twenty-seven
                              • Validate countries against latest list of countries using reconcile-csv, correcting three
                              • -
                              • Convert all DOIs to “https://dx.doi.org" format
                              • +
                              • Convert all DOIs to “https://dx.doi.org” format
                              • Normalize all cg.identifier.url Google book fields to “books.google.com”
                              • Correct some inconsistencies in IITA subjects
                              • Correct two incorrect “Peer Review” in dc.description.version
                              • About fifteen items have incorrect ISBNs (looks like an Excel error because the values look like scientific numbers)
                              • Delete one blank item
                              • -
                              • I managed to get to subjects, so I'll continue from there when I start working next
                              • +
                              • I managed to get to subjects, so I’ll continue from there when I start working next
                            • Generate a new list of countries from the database for use with reconcile-csv @@ -194,7 +194,7 @@ Skype with Marie-Angélique and Abenet about CG Core v2 COPY 192 $ csvcut -l -c 0 /tmp/countries.csv > 2019-06-10-countries.csv
                                -
                              • Get a list of all the unique AGROVOC subject terms in IITA's data and export it to a text file so I can validate them with my agrovoc-lookup.py script:
                              • +
                              • Get a list of all the unique AGROVOC subject terms in IITA’s data and export it to a text file so I can validate them with my agrovoc-lookup.py script:
                              $ csvcut -c dc.subject ~/Downloads/2019-06-10-IITA-20194th-Round-2.csv| sed 's/||/\n/g' | grep -v dc.subject | sort -u > iita-agrovoc.txt
                               $ ./agrovoc-lookup.py -i iita-agrovoc.txt -om iita-agrovoc-matches.txt -or iita-agrovoc-rejects.txt
                              @@ -251,9 +251,9 @@ UPDATE 2
                               
                            • Lots of variation in affiliations, for example:
                              • Université Abomey-Calavi
                              • -
                              • Université d'Abomey
                              • -
                              • Université d'Abomey Calavi
                              • -
                              • Université d'Abomey-Calavi
                              • +
                              • Université d’Abomey
                              • +
                              • Université d’Abomey Calavi
                              • +
                              • Université d’Abomey-Calavi
                              • University of Abomey-Calavi
                            • diff --git a/docs/2019-07/index.html b/docs/2019-07/index.html index 07ab039ef..e6f68bb48 100644 --- a/docs/2019-07/index.html +++ b/docs/2019-07/index.html @@ -35,7 +35,7 @@ CGSpace Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community "/> - + @@ -65,7 +65,7 @@ Abenet had another similar issue a few days ago when trying to find the stats fo - + @@ -112,7 +112,7 @@ Abenet had another similar issue a few days ago when trying to find the stats fo

                              July, 2019

                              @@ -129,16 +129,16 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
                            • Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community
                              -
                            • If I change the parameters to 2019 I see stats, so I'm really thinking it has something to do with the sharded yearly Solr statistics cores +
                            • If I change the parameters to 2019 I see stats, so I’m really thinking it has something to do with the sharded yearly Solr statistics cores
                                -
                              • I checked the Solr admin UI and I see all Solr cores loaded, so I don't know what it could be
                              • +
                              • I checked the Solr admin UI and I see all Solr cores loaded, so I don’t know what it could be
                              • When I check the Atmire content and usage module it seems obvious that there is a problem with the old cores because I dont have anything before 2019-01

                            Atmire CUA 2018 stats missing

                              -
                            • I don't see anyone logged in right now so I'm going to try to restart Tomcat and see if the stats are accessible after Solr comes back up
                            • +
                            • I don’t see anyone logged in right now so I’m going to try to restart Tomcat and see if the stats are accessible after Solr comes back up
                            • I decided to run all system updates on the server (linode18) and reboot it
                              • After rebooting Tomcat came back up, but the the Solr statistics cores were not all loaded
                              • @@ -166,24 +166,24 @@ Abenet had another similar issue a few days ago when trying to find the stats fo # find /dspace/solr/statistics* -iname "*.lock" -print -delete # systemctl start tomcat7
                                  -
                                • But it still didn't work!
                                • +
                                • But it still didn’t work!
                                • I stopped Tomcat, deleted the old locks, and will try to use the “simple” lock file type in solr/statistics/conf/solrconfig.xml:
                                <lockType>${solr.lock.type:simple}</lockType>
                                 
                                  -
                                • And after restarting Tomcat it still doesn't work
                                • -
                                • Now I'll try going back to “native” locking with unlockAtStartup:
                                • +
                                • And after restarting Tomcat it still doesn’t work
                                • +
                                • Now I’ll try going back to “native” locking with unlockAtStartup:
                                <unlockOnStartup>true</unlockOnStartup>
                                 
                                  -
                                • Now the cores seem to load, but I still see an error in the Solr Admin UI and I still can't access any stats before 2018
                                • -
                                • I filed an issue with Atmire, so let's see if they can help
                                • -
                                • And since I'm annoyed and it's been a few months, I'm going to move the JVM heap settings that I've been testing on DSpace Test to CGSpace
                                • +
                                • Now the cores seem to load, but I still see an error in the Solr Admin UI and I still can’t access any stats before 2018
                                • +
                                • I filed an issue with Atmire, so let’s see if they can help
                                • +
                                • And since I’m annoyed and it’s been a few months, I’m going to move the JVM heap settings that I’ve been testing on DSpace Test to CGSpace
                                • The old ones were:
                                -Djava.awt.headless=true -Xms8192m -Xmx8192m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=5400 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
                                 
                                  -
                                • And the new ones come from Solr 4.10.x's startup scripts:
                                • +
                                • And the new ones come from Solr 4.10.x’s startup scripts:
                                    -Djava.awt.headless=true
                                     -Xms8192m -Xmx8192m
                                @@ -253,7 +253,7 @@ $ ./resolve-orcids.py -i /tmp/2019-07-04-orcid-ids.txt -o 2019-07-04-orcid-names
                                 "Mwungu: 0000-0001-6181-8445","Chris Miyinzi Mwungu: 0000-0001-6181-8445"
                                 "Mwungu: 0000-0003-1658-287X","Chris Miyinzi Mwungu: 0000-0003-1658-287X"
                                 
                                  -
                                • But when I ran fix-metadata-values.py I didn't see any changes:
                                • +
                                • But when I ran fix-metadata-values.py I didn’t see any changes:
                                $ ./fix-metadata-values.py -i 2019-07-04-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
                                 

                                2019-07-06

                                @@ -328,7 +328,7 @@ dc.identifier.issn
                              2019-07-10 11:50:27,433 INFO  org.dspace.submit.step.CompleteStep @ lewyllie@cta.int:session_id=A920730003BCAECE8A3B31DCDE11A97E:submission_complete:Completed submission with id=106658
                               
                                -
                              • I'm assuming something happened in his browser (like a refresh) after the item was submitted…
                              • +
                              • I’m assuming something happened in his browser (like a refresh) after the item was submitted…

                              2019-07-12

                                @@ -336,7 +336,7 @@ dc.identifier.issn
                                • Unfortunately there is no concrete feedback yet
                                • I think we need to upgrade our DSpace Test server so we can fit all the Solr cores…
                                • -
                                • Actually, I looked and there were over 40 GB free on DSpace Test so I copied the Solr statistics cores for the years 2017 to 2010 from CGSpace to DSpace Test because they weren't actually very large
                                • +
                                • Actually, I looked and there were over 40 GB free on DSpace Test so I copied the Solr statistics cores for the years 2017 to 2010 from CGSpace to DSpace Test because they weren’t actually very large
                                • I re-deployed DSpace for good measure, and I think all Solr cores are loading… I will do more tests later
                                @@ -353,7 +353,7 @@ $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bits UPDATE 1

                                2019-07-16

                                  -
                                • Completely reset the Podman configuration on my laptop because there were some layers that I couldn't delete and it had been some time since I did a cleanup:
                                • +
                                • Completely reset the Podman configuration on my laptop because there were some layers that I couldn’t delete and it had been some time since I did a cleanup:
                                $ podman system prune -a -f --volumes
                                 $ sudo rm -rf ~/.local/share/containers
                                @@ -376,7 +376,7 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
                                 
                                • Talk to Moayad about the remaining issues for OpenRXV / AReS
                                    -
                                  • He sent a pull request with some changes for the bar chart and documentation about configuration, and said he'd finish the export feature next week
                                  • +
                                  • He sent a pull request with some changes for the bar chart and documentation about configuration, and said he’d finish the export feature next week
                                • Sisay said a user was having problems registering on CGSpace and it looks like the email account expired again:
                                • @@ -399,13 +399,13 @@ Please see the DSpace documentation for assistance.
                                  • ICT reset the password for the CGSpace support account and apparently removed the expiry requirement
                                      -
                                    • I tested the account and it's working
                                    • +
                                    • I tested the account and it’s working

                                  2019-07-20

                                    -
                                  • Create an account for Lionelle Samnick on CGSpace because the registration isn't working for some reason:
                                  • +
                                  • Create an account for Lionelle Samnick on CGSpace because the registration isn’t working for some reason:
                                  $ dspace user --add --givenname Lionelle --surname Samnick --email blah@blah.com --password 'blah'
                                   
                                    @@ -413,12 +413,12 @@ Please see the DSpace documentation for assistance.
                                  • Start looking at 1429 records for the Bioversity batch import
                                    • Multiple authors should be specified with multi-value separatator (||) instead of ;
                                    • -
                                    • We don't use “(eds)” as an author
                                    • +
                                    • We don’t use “(eds)” as an author
                                    • Same issue with dc.publisher using “;” for multiple values
                                    • Some invalid ISSNs in dc.identifier.issn (they look like ISBNs)
                                    • I see some ISSNs in the dc.identifier.isbn field
                                    • I see some invalid ISBNs that look like Excel errors (9,78E+12)
                                    • -
                                    • For DOI we just use the URL, not “DOI: https://doi.org..."
                                    • +
                                    • For DOI we just use the URL, not “DOI: https://doi.org…”
                                    • I see an invalid “LEAVE BLANK” in the cg.contributor.crp field
                                    • Country field is using “,” for multiple values instead of “||”
                                    • Region field is using “,” for multiple values instead of “||”
                                    • @@ -462,7 +462,7 @@ Please see the DSpace documentation for assistance.
                                    • A few strange publishers after splitting multi-value cells, like “(Belgium)”
                                    • Deleted four ISSNs that are actually ISBNs and are already present in the ISBN field
                                    • Eight invalid ISBNs
                                    • -
                                    • Convert all DOIs to “https://doi.org" format and fix one invalid DOI
                                    • +
                                    • Convert all DOIs to “https://doi.org” format and fix one invalid DOI
                                    • Fix a handful of incorrect CRPs that seem to have been split on comma “,”
                                    • Lots of strange values in cg.link.reference, and I normalized all DOIs to https://doi.org format
                                        diff --git a/docs/2019-08/index.html b/docs/2019-08/index.html index a8767f4ab..025a9e025 100644 --- a/docs/2019-08/index.html +++ b/docs/2019-08/index.html @@ -8,7 +8,7 @@ - + @@ -73,7 +73,7 @@ Run system updates on DSpace Test (linode19) and reboot it - + @@ -120,14 +120,14 @@ Run system updates on DSpace Test (linode19) and reboot it

                                        August, 2019

                                        2019-08-03

                                          -
                                        • Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
                                        • +
                                        • Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…

                                        2019-08-04

                                          @@ -135,7 +135,7 @@ Run system updates on DSpace Test (linode19) and reboot it
                                        • Run system updates on CGSpace (linode18) and reboot it
                                          • Before updating it I checked Solr and verified that all statistics cores were loaded properly…
                                          • -
                                          • After rebooting, all statistics cores were loaded… wow, that's lucky.
                                          • +
                                          • After rebooting, all statistics cores were loaded… wow, that’s lucky.
                                        • Run system updates on DSpace Test (linode19) and reboot it
                                        • @@ -199,7 +199,7 @@ Run system updates on DSpace Test (linode19) and reboot it isNotNull(value.match(/^.*û.*$/)) ).toString()
                                  -
                                • I tried to extract the filenames and construct a URL to download the PDFs with my generate-thumbnails.py script, but there seem to be several paths for PDFs so I can't guess it properly
                                • +
                                • I tried to extract the filenames and construct a URL to download the PDFs with my generate-thumbnails.py script, but there seem to be several paths for PDFs so I can’t guess it properly
                                • I will have to wait for Francesco to respond about the PDFs, or perhaps proceed with a metadata-only upload so we can do other checks on DSpace Test

                                2019-08-06

                                @@ -231,7 +231,7 @@ Run system updates on DSpace Test (linode19) and reboot it
                                # /opt/certbot-auto renew --standalone --pre-hook "/usr/bin/docker stop angular_nginx; /bin/systemctl stop firewalld" --post-hook "/bin/systemctl start firewalld; /usr/bin/docker start angular_nginx"
                                 
                                • It is important that the firewall starts back up before the Docker container or else Docker will complain about missing iptables chains
                                • -
                                • Also, I updated to the latest TLS Intermediate settings as appropriate for Ubuntu 18.04's OpenSSL 1.1.0g with nginx 1.16.0
                                • +
                                • Also, I updated to the latest TLS Intermediate settings as appropriate for Ubuntu 18.04’s OpenSSL 1.1.0g with nginx 1.16.0
                                • Run all system updates on AReS dev server (linode20) and reboot it
                                • Get a list of all PDFs from the Bioversity migration that fail to download and save them so I can try again with a different path in the URL:
                                @@ -253,7 +253,7 @@ $ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d |
                            • -

                              Even so, there are still 52 items with incorrect filenames, so I can't derive their PDF URLs…

                              +

                              Even so, there are still 52 items with incorrect filenames, so I can’t derive their PDF URLs…

                              @@ -348,7 +348,7 @@ $ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
                              • I imported the 1,427 Bioversity records into DSpace Test
                                  -
                                • To make sure we didn't have memory issues I reduced Tomcat's JVM heap by 512m, increased the import processes's heap to 512m, and split the input file into two parts with about 700 each
                                • +
                                • To make sure we didn’t have memory issues I reduced Tomcat’s JVM heap by 512m, increased the import processes’s heap to 512m, and split the input file into two parts with about 700 each
                                • Then I had to create a few new temporary collections on DSpace Test that had been created on CGSpace after our last sync
                                • After that the import succeeded:
                                @@ -395,8 +395,8 @@ return os.path.basename(value)

                              2019-08-21

                                -
                              • Upload csv-metadata-quality repository to ILRI's GitHub organization
                              • -
                              • Fix a few invalid countries in IITA's July 29 records (aka “20195TH.xls”) +
                              • Upload csv-metadata-quality repository to ILRI’s GitHub organization
                              • +
                              • Fix a few invalid countries in IITA’s July 29 records (aka “20195TH.xls”)
                                • These were not caught by my csv-metadata-quality check script because of a logic error
                                • Remove dc.identified.uri fields from test data, set id values to “-1”, add collection mappings according to dc.type, and Upload 126 IITA records to CGSpace
                                • @@ -439,13 +439,13 @@ sys 2m24.715s
                                • Peter asked me to add related citation aka cg.link.citation to the item view

                                    -
                                  • I created a pull request with a draft implementation and asked for Peter's feedback
                                  • +
                                  • I created a pull request with a draft implementation and asked for Peter’s feedback
                                • Add the ability to skip certain fields from the csv-metadata-quality script using --exclude-fields

                                    -
                                  • For example, when I'm working on the author corrections I want to do the basic checks on the corrected fields, but on the original fields so I would use --exclude-fields dc.contributor.author for example
                                  • +
                                  • For example, when I’m working on the author corrections I want to do the basic checks on the corrected fields, but on the original fields so I would use --exclude-fields dc.contributor.author for example
                                @@ -493,7 +493,7 @@ COPY 65597
                                • Resume working on the CG Core v2 changes in the 5_x-cgcorev2 branch again
                                    -
                                  • I notice that CG Core doesn't currently have a field for CGSpace's “alternative title” (dc.title.alternative), but DCTERMS has dcterms.alternative so I raised an issue about adding it
                                  • +
                                  • I notice that CG Core doesn’t currently have a field for CGSpace’s “alternative title” (dc.title.alternative), but DCTERMS has dcterms.alternative so I raised an issue about adding it
                                  • Marie responded and said she would add dcterms.alternative
                                  • I created a sed script file to perform some replacements of metadata on the XMLUI XSL files:
                                  @@ -521,7 +521,7 @@ COPY 65597
                                "handles":["10986/30568","10568/97825"],"handle":"10986/30568"
                                 
                                  -
                                • So this is the same issue we had before, where Altmetric knows this Handle is associated with a DOI that has a score, but the client-side JavaScript code doesn't show it because it seems to a secondary handle or something
                                • +
                                • So this is the same issue we had before, where Altmetric knows this Handle is associated with a DOI that has a score, but the client-side JavaScript code doesn’t show it because it seems to a secondary handle or something

                                2019-08-31

                                  diff --git a/docs/2019-09/index.html b/docs/2019-09/index.html index 18871b4a0..16fce6aab 100644 --- a/docs/2019-09/index.html +++ b/docs/2019-09/index.html @@ -69,7 +69,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning: 7249 2a01:7e00::f03c:91ff:fe18:7396 9124 45.5.186.2 "/> - + @@ -99,7 +99,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning: - + @@ -146,7 +146,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:

                                  September, 2019

                                  @@ -197,7 +197,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning: 2350 discover 71 handle
                                    -
                                  • I'm not sure why the outbound traffic rate was so high…
                                  • +
                                  • I’m not sure why the outbound traffic rate was so high…

                                  2019-09-02

                                    @@ -304,7 +304,7 @@ dspace.log.2019-09-15:808 2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.OREDisseminationCrosswalk", name="ore" 2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.DIMDisseminationCrosswalk", name="dim"
                                      -
                                    • I restarted Tomcat and the item views came back, but then the Solr statistics cores didn't all load properly +
                                    • I restarted Tomcat and the item views came back, but then the Solr statistics cores didn’t all load properly
                                      • After restarting Tomcat once again, both the item views and the Solr statistics cores all came back OK
                                      @@ -312,7 +312,7 @@ dspace.log.2019-09-15:808

                                    2019-09-19

                                      -
                                    • For some reason my podman PostgreSQL container isn't working so I had to use Docker to re-create it for my testing work today:
                                    • +
                                    • For some reason my podman PostgreSQL container isn’t working so I had to use Docker to re-create it for my testing work today:
                                    # docker pull docker.io/library/postgres:9.6-alpine
                                     # docker create volume dspacedb_data
                                    @@ -357,14 +357,14 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
                                     
                                  • I ran the same updates on CGSpace and DSpace Test and then started a Discovery re-index to force the search index to update
                                  • Update the PostgreSQL JDBC driver to version 42.2.8 in our Ansible infrastructure scripts
                                  • Run system updates on DSpace Test (linode19) and reboot it
                                  • -
                                  • Start looking at IITA's latest round of batch updates that Sisay had uploaded to DSpace Test earlier this month +
                                  • Start looking at IITA’s latest round of batch updates that Sisay had uploaded to DSpace Test earlier this month
                                      -
                                    • For posterity, IITA's original input file was 20196th.xls and Sisay uploaded it as “IITA_Sep_06” to DSpace Test
                                    • -
                                    • Sisay said he did ran the csv-metadata-quality script on the records, but I assume he didn't run the unsafe fixes or AGROVOC checks because I still see unneccessary Unicode, excessive whitespace, one invalid ISBN, missing dates and a few invalid AGROVOC fields
                                    • +
                                    • For posterity, IITA’s original input file was 20196th.xls and Sisay uploaded it as “IITA_Sep_06” to DSpace Test
                                    • +
                                    • Sisay said he did ran the csv-metadata-quality script on the records, but I assume he didn’t run the unsafe fixes or AGROVOC checks because I still see unneccessary Unicode, excessive whitespace, one invalid ISBN, missing dates and a few invalid AGROVOC fields
                                    • In addition, a few records were missing authorship type
                                    • I deleted two invalid AGROVOC terms because they were ambiguous
                                    • Validate and normalize affiliations against our 2019-04 list using reconcile-csv and OpenRefine: @@ -391,19 +391,19 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
                                    • I created and merged a pull request for the updates
                                        -
                                      • This is the first time we've updated this controlled vocabulary since 2018-09
                                      • +
                                      • This is the first time we’ve updated this controlled vocabulary since 2018-09

                                    2019-09-20

                                      -
                                    • Deploy a fresh snapshot of CGSpace's PostgreSQL database on DSpace Test so we can get more accurate duplicate checking with the upcoming Bioversity and IITA migrations
                                    • +
                                    • Deploy a fresh snapshot of CGSpace’s PostgreSQL database on DSpace Test so we can get more accurate duplicate checking with the upcoming Bioversity and IITA migrations
                                    • Skype with Carol and Francesca to discuss the Bioveristy migration to CGSpace
                                      • They want to do some enrichment of the metadata to add countries and regions
                                      • Also, they noticed that some items have a blank ISSN in the citation like “ISSN:”
                                      • -
                                      • I told them it's probably best if we have Francesco produce a new export from Typo 3
                                      • -
                                      • But on second thought I think that I've already done so much work on this file as it is that I should fix what I can here and then do a new import to DSpace Test with the PDFs
                                      • +
                                      • I told them it’s probably best if we have Francesco produce a new export from Typo 3
                                      • +
                                      • But on second thought I think that I’ve already done so much work on this file as it is that I should fix what I can here and then do a new import to DSpace Test with the PDFs
                                      • Other corrections would be to replace “Inst.” and “Instit.” with “Institute” and remove those blank ISSNs from the citations
                                      • I will rename the files with multiple underscores so they match the filename column in the CSV using this command:
                                      @@ -415,14 +415,14 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
                                      • There are a few dozen that have completely fucked up names due to some encoding error
                                      • To make matters worse, when I tried to download them, some of the links in the “URL” column that Francesco included are wrong, so I had to go to the permalink and get a link that worked
                                      • -
                                      • After downloading everything I had to use Ubuntu's version of rename to get rid of all the double and triple underscores:
                                      • +
                                      • After downloading everything I had to use Ubuntu’s version of rename to get rid of all the double and triple underscores:
                                    $ rename -v 's/___/_/g'  *.pdf
                                     $ rename -v 's/__/_/g'  *.pdf
                                     
                                      -
                                    • I'm still waiting to hear what Carol and Francesca want to do with the 1195.pdf.LCK file (for now I've removed it from the CSV, but for future reference it has the number 630 in its permalink)
                                    • +
                                    • I’m still waiting to hear what Carol and Francesca want to do with the 1195.pdf.LCK file (for now I’ve removed it from the CSV, but for future reference it has the number 630 in its permalink)
                                    • I wrote two fairly long GREL expressions to clean up the institutional author names in the dc.contributor.author and dc.identifier.citation fields using OpenRefine
                                      • The first targets acronyms in parentheses like “International Livestock Research Institute (ILRI)":
                                      • @@ -469,14 +469,14 @@ $ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bio
                                      • Play with language identification using the langdetect, fasttext, polyglot, and langid libraries
                                        • ployglot requires too many system things to compile
                                        • -
                                        • langdetect didn't seem as accurate as the others
                                        • +
                                        • langdetect didn’t seem as accurate as the others
                                        • fasttext is likely the best, but prints a blank link to the console when loading a model
                                        • langid seems to be the best considering the above experiences
                                      • I added very experimental language detection to the csv-metadata-quality module
                                          -
                                        • It works by checking the predicted language of the dc.title field against the item's dc.language.iso field
                                        • +
                                        • It works by checking the predicted language of the dc.title field against the item’s dc.language.iso field
                                        • I tested it on the Bioversity migration data set and it actually helped me correct eleven language fields in their records!
                                      • @@ -504,7 +504,7 @@ $ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bio
                                      • I deleted another item that I had previously identified as a duplicate that she had fixed by incorrectly deleting the original (ugh)
                                    • -
                                    • Get a list of institutions from CCAFS's Clarisa API and try to parse it with jq, do some small cleanups and add a header in sed, and then pass it through csvcut to add line numbers:
                                    • +
                                    • Get a list of institutions from CCAFS’s Clarisa API and try to parse it with jq, do some small cleanups and add a header in sed, and then pass it through csvcut to add line numbers:
                                    $ cat ~/Downloads/institutions.json| jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
                                     $ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institutions-cleaned.csv -u
                                    @@ -516,8 +516,8 @@ $ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institut
                                     
                                    • Skype with Peter and Abenet about CGSpace actions
                                        -
                                      • Peter will respond to ICARDA's request to deposit items in to CGSpace, with a caveat that we agree on some vocabulary standards for institutions, countries, regions, etc
                                      • -
                                      • We discussed using ISO 3166 for countries, though Peter doesn't like the formal names like “Moldova, Republic of” and “Tanzania, United Republic of” +
                                      • Peter will respond to ICARDA’s request to deposit items in to CGSpace, with a caveat that we agree on some vocabulary standards for institutions, countries, regions, etc
                                      • +
                                      • We discussed using ISO 3166 for countries, though Peter doesn’t like the formal names like “Moldova, Republic of” and “Tanzania, United Republic of”
                                        • The Debian iso-codes package has ISO 3166-1 with “common name”, “name”, and “official name” representations, for example:
                                            @@ -528,14 +528,14 @@ $ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institut
                                          • There are still some unfortunate ones there, though:
                                              -
                                            • name: Korea, Democratic People's Republic of
                                            • -
                                            • official_name: Democratic People's Republic of Korea
                                            • +
                                            • name: Korea, Democratic People’s Republic of
                                            • +
                                            • official_name: Democratic People’s Republic of Korea
                                          • -
                                          • And this, which isn't even in English… +
                                          • And this, which isn’t even in English…
                                              -
                                            • name: Côte d'Ivoire
                                            • -
                                            • official_name: Republic of Côte d'Ivoire
                                            • +
                                            • name: Côte d’Ivoire
                                            • +
                                            • official_name: Republic of Côte d’Ivoire
                                          • The other alternative is to just keep using the names we have, which are mostly compliant with AGROVOC
                                          • diff --git a/docs/2019-10/index.html b/docs/2019-10/index.html index 4a12b3521..3d68756da 100644 --- a/docs/2019-10/index.html +++ b/docs/2019-10/index.html @@ -6,7 +6,7 @@ - + @@ -14,8 +14,8 @@ - - + + @@ -45,7 +45,7 @@ - + @@ -92,7 +92,7 @@

                                            October, 2019

                                            @@ -102,15 +102,15 @@
                                          • Udana from IWMI asked me for a CSV export of their community on CGSpace
                                            • I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data
                                            • -
                                            • I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix:
                                            • +
                                            • I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix:
                                          $ csvcut -c 'id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]' ~/Downloads/10568-16814.csv > /tmp/iwmi-title-region-subregion-river.csv
                                           
                                            -
                                          • Then I replace them in vim with :% s/\%u00a0/ /g because I can't figure out the correct sed syntax to do it directly from the pipe above
                                          • +
                                          • Then I replace them in vim with :% s/\%u00a0/ /g because I can’t figure out the correct sed syntax to do it directly from the pipe above
                                          • I uploaded those to CGSpace and then re-exported the metadata
                                          • -
                                          • Now that I think about it, I shouldn't be removing non-breaking spaces (U+00A0), I should be replacing them with normal spaces!
                                          • +
                                          • Now that I think about it, I shouldn’t be removing non-breaking spaces (U+00A0), I should be replacing them with normal spaces!
                                          • I modified the script so it replaces the non-breaking spaces instead of removing them
                                          • Then I ran the csv-metadata-quality script to do some general cleanups (though I temporarily commented out the whitespace fixes because it was too many thousands of rows):
                                          @@ -125,7 +125,7 @@

                                        2019-10-04

                                          -
                                        • Create an account for Bioversity's ICT consultant Francesco on DSpace Test:
                                        • +
                                        • Create an account for Bioversity’s ICT consultant Francesco on DSpace Test:
                                        $ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p 'fffff'
                                         
                                          @@ -162,7 +162,7 @@
                                        • Start looking at duplicates in the Bioversity migration data on DSpace Test
                                            -
                                          • I'm keeping track of the originals and duplicates in a Google Docs spreadsheet that I will share with Bioversity
                                          • +
                                          • I’m keeping track of the originals and duplicates in a Google Docs spreadsheet that I will share with Bioversity
                                        @@ -181,7 +181,7 @@
                                        • Felix Shaw from Earlham emailed me to ask about his admin account on DSpace Test
                                            -
                                          • His old one got lost when I re-sync'd DSpace Test with CGSpace a few weeks ago
                                          • +
                                          • His old one got lost when I re-sync’d DSpace Test with CGSpace a few weeks ago
                                          • I added a new account for him and added it to the Administrators group:
                                        • @@ -206,7 +206,7 @@ UPDATE 1
                                        • More work on identifying duplicates in the Bioversity migration data on DSpace Test
                                          • I mapped twenty-five more items on CGSpace and deleted them from the migration test collection on DSpace Test
                                          • -
                                          • After a few hours I think I finished all the duplicates that were identified by Atmire's Duplicate Checker module
                                          • +
                                          • After a few hours I think I finished all the duplicates that were identified by Atmire’s Duplicate Checker module
                                          • According to my spreadsheet there were fifty-two in total
                                        • @@ -234,8 +234,8 @@ International Maize and Wheat Improvement Centre,International Maize and Wheat I
                                        • I would still like to perhaps (re)move institutional authors from dc.contributor.author to cg.contributor.affiliation, but I will have to run that by Francesca, Carol, and Abenet
                                        • I could use a custom text facet like this in OpenRefine to find authors that likely match the “Last, F.” pattern: isNotNull(value.match(/^.*, \p{Lu}\.?.*$/))
                                        • The \p{Lu} is a cool regex character class to make sure this works for letters with accents
                                        • -
                                        • As cool as that is, it's actually more effective to just search for authors that have “.” in them!
                                        • -
                                        • I've decided to add a cg.contributor.affiliation column to 1,025 items based on the logic above where the author name is not an actual person
                                        • +
                                        • As cool as that is, it’s actually more effective to just search for authors that have “.” in them!
                                        • +
                                        • I’ve decided to add a cg.contributor.affiliation column to 1,025 items based on the logic above where the author name is not an actual person
                                      @@ -279,7 +279,7 @@ real 82m35.993s 10568/129 (1 row)
                                      -
                                    • So I'm still not sure where these weird authors in the “Top Author” stats are coming from
                                    • +
                                    • So I’m still not sure where these weird authors in the “Top Author” stats are coming from

                                    2019-10-14

                                      @@ -302,12 +302,12 @@ $ mkdir 2019-10-15-Bioversity $ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-10-15-Bioversity $ sed -i '/<dcvalue element="identifier" qualifier="uri">/d' 2019-10-15-Bioversity/*/dublin_core.xml
                                        -
                                      • It's really stupid, but for some reason the handles are included even though I specified the -m option, so after the export I removed the dc.identifier.uri metadata values from the items
                                      • +
                                      • It’s really stupid, but for some reason the handles are included even though I specified the -m option, so after the export I removed the dc.identifier.uri metadata values from the items
                                      • Then I imported a test subset of them in my local test environment:
                                      $ ~/dspace/bin/dspace import -a -c 10568/104049 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map -s /tmp/2019-10-15-Bioversity
                                       
                                        -
                                      • I had forgotten (again) that the dspace export command doesn't preserve collection ownership or mappings, so I will have to create a temporary collection on CGSpace to import these to, then do the mappings again after import…
                                      • +
                                      • I had forgotten (again) that the dspace export command doesn’t preserve collection ownership or mappings, so I will have to create a temporary collection on CGSpace to import these to, then do the mappings again after import…
                                      • On CGSpace I will increase the RAM of the command line Java process for good luck before import…
                                      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
                                      @@ -338,8 +338,8 @@ $ dspace import -a -c 10568/104057 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map
                                       
                                    • Move the CGSpace CG Core v2 notes from a GitHub Gist to a page on this site for archive and searchability sake
                                    • Work on the CG Core v2 implementation testing
                                        -
                                      • I noticed that the page title is messed up on the item view, and after over an hour of troubleshooting it I couldn't figure out why
                                      • -
                                      • It seems to be because the dc.titledcterms.title modifications cause the title metadata to disappear from DRI's <pageMeta> and therefore the title is not accessible to the XSL transformation
                                      • +
                                      • I noticed that the page title is messed up on the item view, and after over an hour of troubleshooting it I couldn’t figure out why
                                      • +
                                      • It seems to be because the dc.titledcterms.title modifications cause the title metadata to disappear from DRI’s <pageMeta> and therefore the title is not accessible to the XSL transformation
                                      • Also, I noticed a few places in the Java code where dc.title is hard coded so I think this might be one of the fields that we just assume DSpace relies on internally
                                      • I will revert all changes to dc.title and dc.title.alternative
                                      • TODO: there are similar issues with the citation_author metadata element missing from DRI, so I might have to revert those changes too
                                      • diff --git a/docs/2019-11/index.html b/docs/2019-11/index.html index f25c0ab76..bcf8ff89d 100644 --- a/docs/2019-11/index.html +++ b/docs/2019-11/index.html @@ -20,7 +20,7 @@ I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 milli 1277694 So 4.6 million from XMLUI and another 1.2 million from API requests -Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats): +Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats): # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019" 1183456 @@ -48,14 +48,14 @@ I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 milli 1277694 So 4.6 million from XMLUI and another 1.2 million from API requests -Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats): +Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats): # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019" 1183456 # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams" 106781 "/> - + @@ -85,7 +85,7 @@ Let's see how many of the REST API requests were for bitstreams (because the - + @@ -132,7 +132,7 @@ Let's see how many of the REST API requests were for bitstreams (because the

                                        November, 2019

                                        @@ -151,7 +151,7 @@ Let's see how many of the REST API requests were for bitstreams (because the 1277694
                                      • So 4.6 million from XMLUI and another 1.2 million from API requests
                                      • -
                                      • Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
                                      • +
                                      • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
                                      # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
                                       1183456 
                                      @@ -173,7 +173,7 @@ Let's see how many of the REST API requests were for bitstreams (because the
                                       
                                      # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
                                       365288
                                       
                                        -
                                      • Their user agent is one I've never seen before:
                                      • +
                                      • Their user agent is one I’ve never seen before:
                                      Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
                                       
                                        @@ -196,7 +196,7 @@ Let's see how many of the REST API requests were for bitstreams (because the
                                      $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"
                                       
                                        -
                                      • On the topic of spiders, I have been wanting to update DSpace's default list of spiders in config/spiders/agents, perhaps by dropping a new list in from Atmire's COUNTER-Robots project +
                                      • On the topic of spiders, I have been wanting to update DSpace’s default list of spiders in config/spiders/agents, perhaps by dropping a new list in from Atmire’s COUNTER-Robots project
                                        • First I checked for a user agent that is in COUNTER-Robots, but NOT in the current dspace/config/spiders/example list
                                        • Then I made some item and bitstream requests on DSpace Test using that user agent:
                                        • @@ -215,25 +215,25 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs <lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">ip:73.178.9.24 AND userAgent:iskanie</str><str name="fq">dateYearMonth:2019-11</str><str name="rows">0</str></lst></lst><result name="response" numFound="3" start="0"></result> </response>
                                        -
                                      • Now I want to make similar requests with a user agent that is included in DSpace's current user agent list:
                                      • +
                                      • Now I want to make similar requests with a user agent that is included in DSpace’s current user agent list:
                                      $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
                                       $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
                                       $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial"
                                       
                                        -
                                      • After twenty minutes I didn't see any requests in Solr, so I assume they did not get logged because they matched a bot list… +
                                      • After twenty minutes I didn’t see any requests in Solr, so I assume they did not get logged because they matched a bot list…
                                          -
                                        • What's strange is that the Solr spider agent configuration in dspace/config/modules/solr-statistics.cfg points to a file that doesn't exist…
                                        • +
                                        • What’s strange is that the Solr spider agent configuration in dspace/config/modules/solr-statistics.cfg points to a file that doesn’t exist…
                                      spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
                                       
                                        -
                                      • Apparently that is part of Atmire's CUA, despite being in a standard DSpace configuration file…
                                      • +
                                      • Apparently that is part of Atmire’s CUA, despite being in a standard DSpace configuration file…
                                      • I tried with some other garbage user agents like “fuuuualan” and they were visible in Solr
                                          -
                                        • Now I want to try adding “iskanie” and “fuuuualan” to the list of spider regexes in dspace/config/spiders/example and then try to use DSpace's “mark spiders” feature to change them to “isBot:true” in Solr
                                        • -
                                        • I restarted Tomcat and ran dspace stats-util -m and it did some stuff for awhile, but I still don't see any items in Solr with isBot:true
                                        • +
                                        • Now I want to try adding “iskanie” and “fuuuualan” to the list of spider regexes in dspace/config/spiders/example and then try to use DSpace’s “mark spiders” feature to change them to “isBot:true” in Solr
                                        • +
                                        • I restarted Tomcat and ran dspace stats-util -m and it did some stuff for awhile, but I still don’t see any items in Solr with isBot:true
                                        • According to dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java the patterns for user agents are loaded from any file in the config/spiders/agents directory
                                        • I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran dspace stats-util -m and still there were no new items marked as being bots in Solr, so I think there is still something wrong
                                        • Jesus, the code in ./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java says that stats-util -m marks spider requests by their IPs, not by their user agents… WTF:
                                        • @@ -267,17 +267,17 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanf $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound <result name="response" numFound="0" start="0"/>
                                            -
                                          • So basically it seems like a win to update the example file with the latest one from Atmire's COUNTER-Robots list +
                                          • So basically it seems like a win to update the example file with the latest one from Atmire’s COUNTER-Robots list
                                            • Even though the “mark by user agent” function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents
                                          • -
                                          • I'm curious how the special character matching is in Solr, so I will test two requests: one with “www.gnip.com" which is in the spider list, and one with “www.gnyp.com" which isn't:
                                          • +
                                          • I’m curious how the special character matching is in Solr, so I will test two requests: one with “www.gnip.com” which is in the spider list, and one with “www.gnyp.com” which isn’t:
                                          $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com"
                                           $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnyp.com"
                                           
                                            -
                                          • Then commit changes to Solr so we don't have to wait:
                                          • +
                                          • Then commit changes to Solr so we don’t have to wait:
                                          $ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
                                           $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound 
                                          @@ -352,7 +352,7 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
                                                 </lst>
                                               </lst>
                                           
                                            -
                                          • That answers Peter's question about why the stats jumped in October…
                                          • +
                                          • That answers Peter’s question about why the stats jumped in October…

                                          2019-11-08

                                            @@ -409,12 +409,12 @@ istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do

                                          2019-11-13

                                            -
                                          • The item with a low Altmetric score for its Handle that I tweeted yesterday still hasn't linked with the DOI's score +
                                          • The item with a low Altmetric score for its Handle that I tweeted yesterday still hasn’t linked with the DOI’s score
                                            • I tweeted it again with the Handle and the DOI
                                          • -
                                          • Testing modifying some of the COUNTER-Robots patterns to use [0-9] instead of \d digit character type, as Solr's regex search can't use those
                                          • +
                                          • Testing modifying some of the COUNTER-Robots patterns to use [0-9] instead of \d digit character type, as Solr’s regex search can’t use those
                                          $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"Scrapoo/1"
                                           $ http "http://localhost:8081/solr/statistics/update?commit=true"
                                          @@ -424,19 +424,19 @@ $ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
                                             <result name="response" numFound="1" start="0">
                                           
                                          • Nice, so searching with regex in Solr with // syntax works for those digits!
                                          • -
                                          • I realized that it's easier to search Solr from curl via POST using this syntax:
                                          • +
                                          • I realized that it’s easier to search Solr from curl via POST using this syntax:
                                          $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0")
                                           
                                          • If the parameters include something like “[0-9]” then curl interprets it as a range and will make ten requests
                                              -
                                            • You can disable this using the -g option, but there are other benefits to searching with POST, for example it seems that I have less issues with escaping special parameters when using Solr's regex search:
                                            • +
                                            • You can disable this using the -g option, but there are other benefits to searching with POST, for example it seems that I have less issues with escaping special parameters when using Solr’s regex search:
                                          $ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&rows=2'
                                           
                                            -
                                          • I updated the check-spider-hits.sh script to use the POST syntax, and I'm evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling
                                          • +
                                          • I updated the check-spider-hits.sh script to use the POST syntax, and I’m evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling

                                          2019-11-14

                                            @@ -456,14 +456,14 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
                                          • Greatly improve my check-spider-hits.sh script to handle regular expressions in the spider agents patterns file
                                            • This allows me to detect and purge many more hits from the Solr statistics core
                                            • -
                                            • I've tested it quite a bit on DSpace Test, but I need to do a little more before I feel comfortable running the new code on CGSpace's Solr cores
                                            • +
                                            • I’ve tested it quite a bit on DSpace Test, but I need to do a little more before I feel comfortable running the new code on CGSpace’s Solr cores

                                          2019-11-15

                                            -
                                          • Run the new version of check-spider-hits.sh on CGSpace's Solr statistics cores one by one, starting from the oldest just in case something goes wrong
                                          • -
                                          • But then I noticed that some (all?) of the hits weren't actually getting purged, all of which were using regular expressions like: +
                                          • Run the new version of check-spider-hits.sh on CGSpace’s Solr statistics cores one by one, starting from the oldest just in case something goes wrong
                                          • +
                                          • But then I noticed that some (all?) of the hits weren’t actually getting purged, all of which were using regular expressions like:
                                            • MetaURI[\+\s]API\/[0-9]\.[0-9]
                                            • FDM(\s|\+)[0-9]
                                            • @@ -474,10 +474,10 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
                                          • Upon closer inspection, the plus signs seem to be getting misinterpreted somehow in the delete, but not in the select!
                                          • -
                                          • Plus signs are special in regular expressions, URLs, and Solr's Lucene query parser, so I'm actually not sure where the issue is +
                                          • Plus signs are special in regular expressions, URLs, and Solr’s Lucene query parser, so I’m actually not sure where the issue is
                                            • I tried to do URL encoding of the +, double escaping, etc… but nothing worked
                                            • -
                                            • I'm going to ignore regular expressions that have pluses for now
                                            • +
                                            • I’m going to ignore regular expressions that have pluses for now
                                          • I think I might also have to ignore patterns that have percent signs, like ^\%?default\%?$
                                          • @@ -495,7 +495,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
                                          • statistics: 1043373
                                          -
                                        • That's 1.4 million hits in addition to the 2 million I purged earlier this week…
                                        • +
                                        • That’s 1.4 million hits in addition to the 2 million I purged earlier this week…
                                        • For posterity, the major contributors to the hits on the statistics core were:
                                          • Purging 812429 hits from curl/ in statistics
                                          • @@ -512,7 +512,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i

                                          2019-11-17

                                            -
                                          • Altmetric support responded about our dashboard question, asking if the second “department” (aka WLE's collection) was added recently and might have not been in the last harvesting yet +
                                          • Altmetric support responded about our dashboard question, asking if the second “department” (aka WLE’s collection) was added recently and might have not been in the last harvesting yet
                                            • I told her no, that the department is several years old, and the item was added in 2017
                                            • Then I looked again at the dashboard for each department and I see the item in both departments now… shit.
                                            • @@ -538,7 +538,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i

                                            2019-11-19

                                              -
                                            • Export IITA's community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something +
                                            • Export IITA’s community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something
                                              • I had previously sent them an export in 2019-04
                                              @@ -555,15 +555,15 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
                                            • Found 4429 hits from ^User-Agent in statistics-2016
                                          • -
                                          • Buck is one I've never heard of before, its user agent is:
                                          • +
                                          • Buck is one I’ve never heard of before, its user agent is:
                                          Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html)
                                           
                                            -
                                          • All in all that's about 85,000 more hits purged, in addition to the 3.4 million I purged last week
                                          • +
                                          • All in all that’s about 85,000 more hits purged, in addition to the 3.4 million I purged last week

                                          2019-11-20

                                            -
                                          • Email Usman Muchlish from CIFOR to see what he's doing with their DSpace lately
                                          • +
                                          • Email Usman Muchlish from CIFOR to see what he’s doing with their DSpace lately

                                          2019-11-21

                                            @@ -599,8 +599,8 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
                                            • I rebooted DSpace Test (linode19) and it kernel panicked at boot
                                                -
                                              • I looked on the console and saw that it can't mount the root filesystem
                                              • -
                                              • I switched the boot configuration to use the OS's kernel via GRUB2 instead of Linode's kernel and then it came up after reboot…
                                              • +
                                              • I looked on the console and saw that it can’t mount the root filesystem
                                              • +
                                              • I switched the boot configuration to use the OS’s kernel via GRUB2 instead of Linode’s kernel and then it came up after reboot…
                                              • I initiated a migration of the server from the Fremont, CA region to Frankfurt, DE
                                                • The migration is going very slowly, so I assume the network issues from earlier this year are still not fixed
                                                • diff --git a/docs/2019-12/index.html b/docs/2019-12/index.html index 5adabde30..5b2934fd0 100644 --- a/docs/2019-12/index.html +++ b/docs/2019-12/index.html @@ -43,7 +43,7 @@ Make sure all packages are up to date and the package manager is up to date, the # dpkg -C # reboot "/> - + @@ -73,7 +73,7 @@ Make sure all packages are up to date and the package manager is up to date, the - + @@ -120,7 +120,7 @@ Make sure all packages are up to date and the package manager is up to date, the

                                                  December, 2019

                                                  @@ -159,13 +159,13 @@ Make sure all packages are up to date and the package manager is up to date, the # apt install 'nginx=1.16.1-1~bionic' # reboot
                                                    -
                                                  • After the server comes back up, remove Python virtualenvs that were created with Python 3.5 and re-run certbot to make sure it's working:
                                                  • +
                                                  • After the server comes back up, remove Python virtualenvs that were created with Python 3.5 and re-run certbot to make sure it’s working:
                                                  # rm -rf /opt/eff.org/certbot/venv/bin/letsencrypt
                                                   # rm -rf /opt/ilri/dspace-statistics-api/venv
                                                   # /opt/certbot-auto
                                                   
                                                    -
                                                  • Clear Ansible's fact cache and re-run the playbooks to update the system's firewalls, SSH config, etc
                                                  • +
                                                  • Clear Ansible’s fact cache and re-run the playbooks to update the system’s firewalls, SSH config, etc
                                                  • Altmetric finally responded to my question about Dublin Core fields
                                                    • They shared a list of fields they use for tracking, but it only mentions HTML meta tags, and not fields considered when harvesting via OAI
                                                    • @@ -191,8 +191,8 @@ Make sure all packages are up to date and the package manager is up to date, the
                                                      $ http 'https://cgspace.cgiar.org/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:cgspace.cgiar.org:10568/104030' > /tmp/cgspace-104030.xml
                                                       $ http 'https://dspacetest.cgiar.org/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:cgspace.cgiar.org:10568/104030' > /tmp/dspacetest-104030.xml
                                                       
                                                        -
                                                      • The DSpace Test ones actually now capture the DOI, where the CGSpace doesn't…
                                                      • -
                                                      • And the DSpace Test one doesn't include review status as dc.description, but I don't think that's an important field
                                                      • +
                                                      • The DSpace Test ones actually now capture the DOI, where the CGSpace doesn’t…
                                                      • +
                                                      • And the DSpace Test one doesn’t include review status as dc.description, but I don’t think that’s an important field

                                                      2019-12-04

                                                        @@ -219,7 +219,7 @@ COPY 48
                                                        • Enrico noticed that the AReS Explorer on CGSpace (linode18) was down
                                                            -
                                                          • I only see HTTP 502 in the nginx logs on CGSpace… so I assume it's something wrong with the AReS server
                                                          • +
                                                          • I only see HTTP 502 in the nginx logs on CGSpace… so I assume it’s something wrong with the AReS server
                                                          • I ran all system updates on the AReS server (linode20) and rebooted it
                                                          • After rebooting the Explorer was accessible again
                                                          @@ -242,11 +242,11 @@ COPY 48
                                                          • Post message to Yammer about good practices for thumbnails on CGSpace
                                                              -
                                                            • On the topic of thumbnails, I'm thinking we might want to force regenerate all PDF thumbnails on CGSpace since we upgraded it to Ubuntu 18.04 and got a new ghostscript…
                                                            • +
                                                            • On the topic of thumbnails, I’m thinking we might want to force regenerate all PDF thumbnails on CGSpace since we upgraded it to Ubuntu 18.04 and got a new ghostscript…
                                                          • More discussion about report formats for AReS
                                                          • -
                                                          • Peter noticed that the Atmire reports weren't showing any statistics before 2019 +
                                                          • Peter noticed that the Atmire reports weren’t showing any statistics before 2019
                                                            • I checked and indeed Solr had an issue loading some core last time it was started
                                                            • I restarted Tomcat three times before all cores came up successfully
                                                            • @@ -278,7 +278,7 @@ COPY 48
                                                            • I created an issue for “extended” text reports on the AReS GitHub (#9)
                                                          • -
                                                          • I looked into creating RTF documents from HTML in Node.js and there is a library called html-to-rtf that works well, but doesn't support images
                                                          • +
                                                          • I looked into creating RTF documents from HTML in Node.js and there is a library called html-to-rtf that works well, but doesn’t support images
                                                          • Export a list of all investors (dc.description.sponsorship) for Peter to look through and correct:
                                                          dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.sponsor", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-12-17-investors.csv WITH CSV HEADER;
                                                          @@ -310,7 +310,7 @@ UPDATE 2
                                                           
                                                        • Add three new CCAFS Phase II project tags to CGSpace (#441)
                                                        • Linode said DSpace Test (linode19) had an outbound traffic rate of 73Mb/sec for the last two hours
                                                            -
                                                          • I see some Russian bot active in nginx's access logs:
                                                          • +
                                                          • I see some Russian bot active in nginx’s access logs:
                                                        @@ -349,7 +349,7 @@ UPDATE 1
                                                      @@ -357,7 +357,7 @@ UPDATE 1
                                                      • Follow up with Altmetric on the issue where an item has a different (lower) score for its Handle despite it having a correct DOI (with a higher score)
                                                          -
                                                        • I've raised this issue three times to Altmetric this year, and a few weeks ago they said they would re-process the item “before Christmas”
                                                        • +
                                                        • I’ve raised this issue three times to Altmetric this year, and a few weeks ago they said they would re-process the item “before Christmas”
                                                      • Abenet suggested we use cg.reviewStatus instead of cg.review-status and I agree that we should follow other examples like DCTERMS.accessRights and DCTERMS.isPartOf @@ -370,7 +370,7 @@ UPDATE 1
                                                        • Altmetric responded a few days ago about the item that has a different (lower) score for its Handle despite it having a correct DOI (with a higher score)
                                                            -
                                                          • She tweeted the repository link and agreed that it didn't get picked up by Altmetric
                                                          • +
                                                          • She tweeted the repository link and agreed that it didn’t get picked up by Altmetric
                                                          • She said she will add this to the existing ticket about the previous items I had raised an issue about
                                                        • diff --git a/docs/2020-01/index.html b/docs/2020-01/index.html index 6c1a3e0fb..6087a5de4 100644 --- a/docs/2020-01/index.html +++ b/docs/2020-01/index.html @@ -53,7 +53,7 @@ I tweeted the CGSpace repository link "/> - + @@ -63,7 +63,7 @@ I tweeted the CGSpace repository link "@type": "BlogPosting", "headline": "January, 2020", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-01\/", - "wordCount": "2117", + "wordCount": "2754", "datePublished": "2020-01-06T10:48:30+02:00", "dateModified": "2020-01-23T15:56:46+02:00", "author": { @@ -83,7 +83,7 @@ I tweeted the CGSpace repository link - + @@ -130,7 +130,7 @@ I tweeted the CGSpace repository link

                                                          January, 2020

                                                          @@ -185,7 +185,7 @@ $ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
                                                        <e>  101,  Hex 65,  Octal 145 < ́> 769, Hex 0301, Octal 1401
                                                         
                                                          -
                                                        • If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it's stored incorrectly in the database…
                                                        • +
                                                        • If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it’s stored incorrectly in the database…
                                                        • Other encodings like windows-1251 and windows-1257 also fail on different characters like “ž” and “é” that are legitimate UTF-8 characters
                                                        • Then there is the issue of Russian, Chinese, etc characters, which are simply not representable in any of those encodings
                                                        • I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me
                                                        • @@ -206,8 +206,8 @@ java.net.SocketTimeoutException: Read timed out
                                                        • I am not sure how I will fix that shard…
                                                        • I discovered a very interesting tool called ftfy that attempts to fix errors in UTF-8
                                                            -
                                                          • I'm curious to start checking input files with this to see what it highlights
                                                          • -
                                                          • I ran it on the authors file from last week and it converted characters like those with Spanish accents from multi-byte sequences (I don't know what it's called?) to digraphs (é→é), which vim identifies as:
                                                          • +
                                                          • I’m curious to start checking input files with this to see what it highlights
                                                          • +
                                                          • I ran it on the authors file from last week and it converted characters like those with Spanish accents from multi-byte sequences (I don’t know what it’s called?) to digraphs (é→é), which vim identifies as:
                                                          • <e> 101, Hex 65, Octal 145 < ́> 769, Hex 0301, Octal 1401
                                                          • <é> 233, Hex 00e9, Oct 351, Digr e'
                                                          @@ -283,10 +283,10 @@ COPY 35
                                                        • I opened a new pull request on the cg-core repository validate and fix the formatting of the HTML files
                                                        • Create more issues for OpenRXV:
                                                        @@ -352,7 +352,7 @@ $ wc -l hung-nguyen-a*handles.txt 56 hung-nguyen-atmire-handles.txt 102 total
                                                          -
                                                        • Comparing the lists of items, I see that nine of the ten missing items were added less than twenty-four hours ago, and the other was added last week, so they apparently just haven't been indexed yet +
                                                        • Comparing the lists of items, I see that nine of the ten missing items were added less than twenty-four hours ago, and the other was added last week, so they apparently just haven’t been indexed yet
                                                          • I am curious to check tomorrow to see if they are there
                                                          @@ -383,7 +383,7 @@ $ wc -l hung-nguyen-a*handles.txt
                                                        $ convert -density 288 -filter lagrange -thumbnail 25% -background white -alpha remove -sampling-factor 1:1 -colorspace sRGB 10568-97925.pdf\[0\] 10568-97925.jpg
                                                         
                                                          -
                                                        • Here I'm also explicitly setting the background to white and removing any alpha layers, but I could probably also just keep using -flatten like DSpace already does
                                                        • +
                                                        • Here I’m also explicitly setting the background to white and removing any alpha layers, but I could probably also just keep using -flatten like DSpace already does
                                                        • I did some tests with a modified version of above that uses uses -flatten and drops the sampling-factor and colorspace, but bumps up the image size to 600px (default on CGSpace is currently 300):
                                                        $ convert -density 288 -filter lagrange -resize 25% -flatten 10568-97925.pdf\[0\] 10568-97925-d288-lagrange.pdf.jpg
                                                        @@ -391,16 +391,58 @@ $ convert -flatten 10568-97925.pdf\[0\] 10568-97925.pdf.jpg
                                                         $ convert -thumbnail x600 10568-97925-d288-lagrange.pdf.jpg 10568-97925-d288-lagrange-thumbnail.pdf.jpg
                                                         $ convert -thumbnail x600 10568-97925.pdf.jpg 10568-97925-thumbnail.pdf.jpg
                                                         
                                                          -
                                                        • This emulate's DSpace's method of generating a high-quality image from the PDF and then creating a thumbnail
                                                        • -
                                                        • I put together a proof of concept of this by adding the extra options to dspace-api's ImageMagickThumbnailFilter.java and it works
                                                        • +
                                                        • This emulate’s DSpace’s method of generating a high-quality image from the PDF and then creating a thumbnail
                                                        • +
                                                        • I put together a proof of concept of this by adding the extra options to dspace-api’s ImageMagickThumbnailFilter.java and it works
                                                        • I need to run tests on a handful of PDFs to see if there are any side effects
                                                        • -
                                                        • The file size is about double the old ones, but the quality is very good and the file size is nowhere near ilri.org's 400KiB PNG!
                                                        • +
                                                        • The file size is about double the old ones, but the quality is very good and the file size is nowhere near ilri.org’s 400KiB PNG!
                                                        • Peter sent me the corrections and deletions for affiliations last night so I imported them into OpenRefine to work around the normal UTF-8 issue, ran them through csv-metadata-quality to make sure all Unicode values were normalized (NFC), then applied them on DSpace Test and CGSpace:
                                                        $ csv-metadata-quality -i ~/Downloads/2020-01-22-fix-1113-affiliations.csv -o /tmp/2020-01-22-fix-1113-affiliations.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
                                                         $ ./fix-metadata-values.py -i /tmp/2020-01-22-fix-1113-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct
                                                         $ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
                                                        -
                                                        +

                                                        2020-01-26

                                                        +
                                                          +
                                                        • Add “Gender” to controlled vocabulary for CRPs (#442)
                                                        • +
                                                        • Deploy the changes on CGSpace and run all updates on the server and reboot it +
                                                            +
                                                          • I had to restart the tomcat7 service several times until all Solr statistics cores came up OK
                                                          • +
                                                          +
                                                        • +
                                                        • I spent a few hours writing a script (create-thumbnails) to compare the default DSpace thumbnails with the improved parameters above and actually when comparing them at size 600px I don’t really notice much difference, other than the new ones have slightly crisper text +
                                                            +
                                                          • So that was a waste of time, though I think our 300px thumbnails are a bit small now
                                                          • +
                                                          • Another thread on the ImageMagick forum mentions that you need to set the density, then read the image, then set the density again:
                                                          • +
                                                          +
                                                        • +
                                                        +
                                                        $ convert -density 288 10568-97925.pdf\[0\] -density 72 -filter lagrange -flatten 10568-97925-density.jpg
                                                        +
                                                          +
                                                        • One thing worth mentioning was this syntax for extracting bits from JSON in bash using jq:
                                                        • +
                                                        +
                                                        $ RESPONSE=$(curl -s 'https://dspacetest.cgiar.org/rest/handle/10568/103447?expand=bitstreams')
                                                        +$ echo $RESPONSE | jq '.bitstreams[] | select(.bundleName=="ORIGINAL") | .retrieveLink'
                                                        +"/bitstreams/172559/retrieve"
                                                        +

                                                        2020-01-27

                                                        +
                                                          +
                                                        • Bizu has been having problems when she logs into CGSpace, she can’t see the community list on the front page +
                                                            +
                                                          • This last happened for another user in 2016-11, and it was related to the Tomcat maxHttpHeaderSize being too small because the user was in too many groups
                                                          • +
                                                          • I see that it is similar, with this message appearing in the DSpace log just after she logs in:
                                                          • +
                                                          +
                                                        • +
                                                        +
                                                        2020-01-27 06:02:23,681 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page
                                                        +org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'read:(g0 OR e610 OR g0 OR g3 OR g5 OR g4102 OR g9 OR g4105 OR g10 OR g4107 OR g4108 OR g13 OR g4109 OR g14 OR g15 OR g16 OR g18 OR g20 OR g23 OR g24 OR g2072 OR g2074 OR g28 OR g2076 OR g29 OR g2078 OR g2080 OR g34 OR g2082 OR g2084 OR g38 OR g2086 OR g2088 OR g43 OR g2093 OR g2095 OR g2097 OR g50 OR g51 OR g2101 OR g2103 OR g62 OR g65 OR g77 OR g78 OR g2127 OR g2142 OR g2151 OR g2152 OR g2153 OR g2154 OR g2156 OR g2165 OR g2171 OR g2174 OR g2175 OR g129 OR g2178 OR g2182 OR g2186 OR g153 OR g155 OR g158 OR g166 OR g167 OR g168 OR g169 OR g2225 OR g179 OR g2227 OR g2229 OR g183 OR g2231 OR g184 OR g2233 OR g186 OR g2235 OR g2237 OR g191 OR g192 OR g193 OR g2242 OR g2244 OR g2246 OR g2250 OR g204 OR g205 OR g207 OR g208 OR g2262 OR g2265 OR g218 OR g2268 OR g222 OR g223 OR g2271 OR g2274 OR g2277 OR g230 OR g231 OR g2280 OR g2283 OR g238 OR g2286 OR g241 OR g2289 OR g244 OR g2292 OR g2295 OR g2298 OR g2301 OR g254 OR g255 OR g2305 OR g2308 OR g262 OR g2311 OR g265 OR g268 OR g269 OR g273 OR g276 OR g277 OR g279 OR g282 OR g292 OR g293 OR g296 OR g297 OR g301 OR g303 OR g305 OR g2353 OR g310 OR g311 OR g313 OR g321 OR g325 OR g328 OR g333 OR g334 OR g342 OR g343 OR g345 OR g348 OR g2409 [...] ': too many boolean clauses
                                                        +
                                                          +
                                                        • Now this appears to be a Solr limit of some kind (“too many boolean clauses”) +
                                                            +
                                                          • I changed the maxBooleanClauses for all Solr cores on DSpace Test from 1024 to 2048 and then she was able to see her communities…
                                                          • +
                                                          • I made a pull request and merged it to the 5_x-prod branch and will deploy on CGSpace later tonight
                                                          • +
                                                          • I am curious if anyone on the dspace-tech mailing list has run into this, so I will try to send a message about this there when I get a chance
                                                          • +
                                                          +
                                                        • +
                                                        + diff --git a/docs/404.html b/docs/404.html deleted file mode 100644 index 1ed3a122a..000000000 --- a/docs/404.html +++ /dev/null @@ -1,144 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - CGSpace Notes - - - - - - - - - - - - - - - - - - - -
                                                        -
                                                        - -
                                                        -
                                                        - - - - -
                                                        -
                                                        -

                                                        CGSpace Notes

                                                        -

                                                        Documenting day-to-day work on the CGSpace repository.

                                                        -
                                                        -
                                                        - - - - -
                                                        -
                                                        -
                                                        - - - -
                                                        -
                                                        -

                                                        Page Not Found

                                                        -
                                                        -

                                                        Page not found. Go back home.

                                                        -
                                                        - - - -
                                                        - - - - -
                                                        -
                                                        - - - - - - - - - diff --git a/docs/categories/index.html b/docs/categories/index.html index e3b0c631d..429c32306 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -14,7 +14,7 @@ - + @@ -42,7 +42,7 @@ - + @@ -95,7 +95,7 @@

                                                        January, 2020

                                                        @@ -132,7 +132,7 @@

                                                        December, 2019

                                                        @@ -164,7 +164,7 @@

                                                        November, 2019

                                                        @@ -183,7 +183,7 @@ 1277694
                                                        • So 4.6 million from XMLUI and another 1.2 million from API requests
                                                        • -
                                                        • Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
                                                        • +
                                                        • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
                                                        # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
                                                         1183456 
                                                        @@ -202,10 +202,10 @@
                                                           

                                                        CGSpace CG Core v2 Migration

                                                        @@ -223,12 +223,12 @@

                                                        October, 2019

                                                        - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc. + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc. Read more → @@ -241,7 +241,7 @@

                                                        September, 2019

                                                        @@ -286,14 +286,14 @@

                                                        August, 2019

                                                        2019-08-03

                                                          -
                                                        • Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
                                                        • +
                                                        • Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…

                                                        2019-08-04

                                                          @@ -301,7 +301,7 @@
                                                        • Run system updates on CGSpace (linode18) and reboot it
                                                          • Before updating it I checked Solr and verified that all statistics cores were loaded properly…
                                                          • -
                                                          • After rebooting, all statistics cores were loaded… wow, that's lucky.
                                                          • +
                                                          • After rebooting, all statistics cores were loaded… wow, that’s lucky.
                                                        • Run system updates on DSpace Test (linode19) and reboot it
                                                        • @@ -318,7 +318,7 @@

                                                          July, 2019

                                                          @@ -346,7 +346,7 @@

                                                          June, 2019

                                                          @@ -372,7 +372,7 @@

                                                          May, 2019

                                                          diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index cc54c2dd4..e86902a90 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -14,7 +14,7 @@ - + @@ -28,7 +28,7 @@ - + @@ -80,7 +80,7 @@

                                                          January, 2020

                                                          @@ -117,7 +117,7 @@

                                                          December, 2019

                                                          @@ -149,7 +149,7 @@

                                                          November, 2019

                                                          @@ -168,7 +168,7 @@ 1277694
                                                        • So 4.6 million from XMLUI and another 1.2 million from API requests
                                                        • -
                                                        • Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
                                                        • +
                                                        • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
                                                        # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
                                                         1183456 
                                                        @@ -187,10 +187,10 @@
                                                           

                                                        CGSpace CG Core v2 Migration

                                                        @@ -208,12 +208,12 @@

                                                        October, 2019

                                                        - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc. + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc. Read more → @@ -226,7 +226,7 @@

                                                        September, 2019

                                                        @@ -271,14 +271,14 @@

                                                        August, 2019

                                                        2019-08-03

                                                          -
                                                        • Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
                                                        • +
                                                        • Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…

                                                        2019-08-04

                                                          @@ -286,7 +286,7 @@
                                                        • Run system updates on CGSpace (linode18) and reboot it
                                                          • Before updating it I checked Solr and verified that all statistics cores were loaded properly…
                                                          • -
                                                          • After rebooting, all statistics cores were loaded… wow, that's lucky.
                                                          • +
                                                          • After rebooting, all statistics cores were loaded… wow, that’s lucky.
                                                        • Run system updates on DSpace Test (linode19) and reboot it
                                                        • @@ -303,7 +303,7 @@

                                                          July, 2019

                                                          @@ -331,7 +331,7 @@

                                                          June, 2019

                                                          @@ -357,7 +357,7 @@

                                                          May, 2019

                                                          diff --git a/docs/categories/notes/index.xml b/docs/categories/notes/index.xml index 2d5395f5e..c08d14222 100644 --- a/docs/categories/notes/index.xml +++ b/docs/categories/notes/index.xml @@ -82,7 +82,7 @@ 1277694 </code></pre><ul> <li>So 4.6 million from XMLUI and another 1.2 million from API requests</li> -<li>Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li> +<li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li> </ul> <pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot; 1183456 @@ -107,7 +107,7 @@ Tue, 01 Oct 2019 13:20:51 +0300 https://alanorth.github.io/cgspace-notes/2019-10/ - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c 'id,dc. + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c 'id,dc. @@ -154,7 +154,7 @@ https://alanorth.github.io/cgspace-notes/2019-08/ <h2 id="2019-08-03">2019-08-03</h2> <ul> -<li>Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;</li> +<li>Look at Bioversity&rsquo;s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;</li> </ul> <h2 id="2019-08-04">2019-08-04</h2> <ul> @@ -162,7 +162,7 @@ <li>Run system updates on CGSpace (linode18) and reboot it <ul> <li>Before updating it I checked Solr and verified that all statistics cores were loaded properly&hellip;</li> -<li>After rebooting, all statistics cores were loaded&hellip; wow, that's lucky.</li> +<li>After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s lucky.</li> </ul> </li> <li>Run system updates on DSpace Test (linode19) and reboot it</li> @@ -269,9 +269,9 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace https://alanorth.github.io/cgspace-notes/2019-03/ <h2 id="2019-03-01">2019-03-01</h2> <ul> -<li>I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li> +<li>I checked IITA&rsquo;s 259 Feb 14 records from last month for duplicates using Atmire&rsquo;s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li> <li>I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc&hellip;</li> -<li>Looking at the other half of Udana's WLE records from 2018-11 +<li>Looking at the other half of Udana&rsquo;s WLE records from 2018-11 <ul> <li>I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)</li> <li>I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items</li> @@ -329,7 +329,7 @@ sys 0m1.979s <h2 id="2019-01-02">2019-01-02</h2> <ul> <li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li> -<li>I don't see anything interesting in the web server logs around that time though:</li> +<li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 92 40.77.167.4 @@ -390,7 +390,7 @@ sys 0m1.979s <h2 id="2018-10-01">2018-10-01</h2> <ul> <li>Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items</li> -<li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I'm super busy in Nairobi right now</li> +<li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I&rsquo;m super busy in Nairobi right now</li> </ul> @@ -403,9 +403,9 @@ sys 0m1.979s <h2 id="2018-09-02">2018-09-02</h2> <ul> <li>New <a href="https://jdbc.postgresql.org/documentation/changelog.html#version_42.2.5">PostgreSQL JDBC driver version 42.2.5</a></li> -<li>I'll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li> -<li>Also, I'll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month</li> -<li>I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:</li> +<li>I&rsquo;ll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li> +<li>Also, I&rsquo;ll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month</li> +<li>I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:</li> </ul> @@ -424,10 +424,10 @@ sys 0m1.979s [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB </code></pre><ul> <li>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</li> -<li>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat's</li> -<li>I'm not sure why Tomcat didn't crash with an OutOfMemoryError&hellip;</li> +<li>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat&rsquo;s</li> +<li>I&rsquo;m not sure why Tomcat didn&rsquo;t crash with an OutOfMemoryError&hellip;</li> <li>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</li> -<li>The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes</li> +<li>The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes</li> <li>I ran all system updates on DSpace Test and rebooted it</li> </ul> @@ -460,7 +460,7 @@ sys 0m1.979s <ul> <li>Test the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">DSpace 5.8 module upgrades from Atmire</a> (<a href="https://github.com/ilri/DSpace/pull/378">#378</a>) <ul> -<li>There seems to be a problem with the CUA and L&amp;R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn't build</li> +<li>There seems to be a problem with the CUA and L&amp;R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn&rsquo;t build</li> </ul> </li> <li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li> @@ -506,7 +506,7 @@ sys 2m7.289s https://alanorth.github.io/cgspace-notes/2018-04/ <h2 id="2018-04-01">2018-04-01</h2> <ul> -<li>I tried to test something on DSpace Test but noticed that it's down since god knows when</li> +<li>I tried to test something on DSpace Test but noticed that it&rsquo;s down since god knows when</li> <li>Catalina logs at least show some memory errors yesterday:</li> </ul> @@ -532,9 +532,9 @@ sys 2m7.289s <h2 id="2018-02-01">2018-02-01</h2> <ul> <li>Peter gave feedback on the <code>dc.rights</code> proof of concept that I had sent him last week</li> -<li>We don't need to distinguish between internal and external works, so that makes it just a simple list</li> +<li>We don&rsquo;t need to distinguish between internal and external works, so that makes it just a simple list</li> <li>Yesterday I figured out how to monitor DSpace sessions using JMX</li> -<li>I copied the logic in the <code>jmx_tomcat_dbpools</code> provided by Ubuntu's <code>munin-plugins-java</code> package and used the stuff I discovered about JMX <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-01/">in 2018-01</a></li> +<li>I copied the logic in the <code>jmx_tomcat_dbpools</code> provided by Ubuntu&rsquo;s <code>munin-plugins-java</code> package and used the stuff I discovered about JMX <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-01/">in 2018-01</a></li> </ul> @@ -547,7 +547,7 @@ sys 2m7.289s <h2 id="2018-01-02">2018-01-02</h2> <ul> <li>Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time</li> -<li>I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary</li> +<li>I didn&rsquo;t get any load alerts from Linode and the REST and XMLUI logs don&rsquo;t show anything out of the ordinary</li> <li>The nginx logs show HTTP 200s until <code>02/Jan/2018:11:27:17 +0000</code> when Uptime Robot got an HTTP 500</li> <li>In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;</li> <li>And just before that I see this:</li> @@ -555,8 +555,8 @@ sys 2m7.289s <pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000]. </code></pre><ul> <li>Ah hah! So the pool was actually empty!</li> -<li>I need to increase that, let's try to bump it up from 50 to 75</li> -<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw</li> +<li>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</li> +<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</li> <li>I notice this error quite a few times in dspace.log:</li> </ul> <pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets @@ -609,7 +609,7 @@ dspace.log.2017-12-31:53 dspace.log.2018-01-01:45 dspace.log.2018-01-02:34 </code></pre><ul> -<li>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains</li> +<li>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains</li> </ul> @@ -664,7 +664,7 @@ COPY 54701 </ul> <pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336 </code></pre><ul> -<li>There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li> +<li>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li> <li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li> </ul> diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 2d6dc71da..89c82dfbd 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -14,7 +14,7 @@ - + @@ -28,7 +28,7 @@ - + @@ -80,7 +80,7 @@

                                                          April, 2019

                                                          @@ -121,16 +121,16 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

                                                          March, 2019

                                                          2019-03-01

                                                            -
                                                          • I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
                                                          • +
                                                          • I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
                                                          • I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
                                                          • -
                                                          • Looking at the other half of Udana's WLE records from 2018-11 +
                                                          • Looking at the other half of Udana’s WLE records from 2018-11
                                                            • I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
                                                            • I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
                                                            • @@ -153,7 +153,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

                                                              February, 2019

                                                              @@ -198,7 +198,7 @@ sys 0m1.979s

                                                              January, 2019

                                                              @@ -206,7 +206,7 @@ sys 0m1.979s

                                                              2019-01-02

                                                              • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
                                                              • -
                                                              • I don't see anything interesting in the web server logs around that time though:
                                                              • +
                                                              • I don’t see anything interesting in the web server logs around that time though:
                                                              # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                    92 40.77.167.4
                                                              @@ -232,7 +232,7 @@ sys     0m1.979s
                                                                 

                                                              December, 2018

                                                              @@ -259,7 +259,7 @@ sys 0m1.979s

                                                              November, 2018

                                                              @@ -286,7 +286,7 @@ sys 0m1.979s

                                                              October, 2018

                                                              @@ -294,7 +294,7 @@ sys 0m1.979s

                                                              2018-10-01

                                                              • Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
                                                              • -
                                                              • I created a GitHub issue to track this #389, because I'm super busy in Nairobi right now
                                                              • +
                                                              • I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now
                                                              Read more → @@ -308,7 +308,7 @@ sys 0m1.979s

                                                              September, 2018

                                                              @@ -316,9 +316,9 @@ sys 0m1.979s

                                                              2018-09-02

                                                              • New PostgreSQL JDBC driver version 42.2.5
                                                              • -
                                                              • I'll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
                                                              • -
                                                              • Also, I'll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month
                                                              • -
                                                              • I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:
                                                              • +
                                                              • I’ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
                                                              • +
                                                              • Also, I’ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
                                                              • +
                                                              • I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
                                                              Read more → @@ -332,7 +332,7 @@ sys 0m1.979s

                                                              August, 2018

                                                              @@ -346,10 +346,10 @@ sys 0m1.979s [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
                                                              • Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
                                                              • -
                                                              • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat's
                                                              • -
                                                              • I'm not sure why Tomcat didn't crash with an OutOfMemoryError…
                                                              • +
                                                              • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s
                                                              • +
                                                              • I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…
                                                              • Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
                                                              • -
                                                              • The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
                                                              • +
                                                              • The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
                                                              • I ran all system updates on DSpace Test and rebooted it
                                                              Read more → @@ -364,7 +364,7 @@ sys 0m1.979s

                                                              July, 2018

                                                              diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 8c555efd9..15e790af3 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -14,7 +14,7 @@ - + @@ -28,7 +28,7 @@ - + @@ -80,7 +80,7 @@

                                                              June, 2018

                                                              @@ -89,7 +89,7 @@
                                                              • Test the DSpace 5.8 module upgrades from Atmire (#378)
                                                                  -
                                                                • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn't build
                                                                • +
                                                                • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
                                                              • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
                                                              • @@ -118,7 +118,7 @@ sys 2m7.289s

                                                                May, 2018

                                                                @@ -146,14 +146,14 @@ sys 2m7.289s

                                                                April, 2018

                                                                2018-04-01

                                                                  -
                                                                • I tried to test something on DSpace Test but noticed that it's down since god knows when
                                                                • +
                                                                • I tried to test something on DSpace Test but noticed that it’s down since god knows when
                                                                • Catalina logs at least show some memory errors yesterday:
                                                                Read more → @@ -168,7 +168,7 @@ sys 2m7.289s

                                                                March, 2018

                                                                @@ -189,7 +189,7 @@ sys 2m7.289s

                                                                February, 2018

                                                                @@ -197,9 +197,9 @@ sys 2m7.289s

                                                                2018-02-01

                                                                • Peter gave feedback on the dc.rights proof of concept that I had sent him last week
                                                                • -
                                                                • We don't need to distinguish between internal and external works, so that makes it just a simple list
                                                                • +
                                                                • We don’t need to distinguish between internal and external works, so that makes it just a simple list
                                                                • Yesterday I figured out how to monitor DSpace sessions using JMX
                                                                • -
                                                                • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
                                                                • +
                                                                • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
                                                                Read more → @@ -213,7 +213,7 @@ sys 2m7.289s

                                                                January, 2018

                                                                @@ -221,7 +221,7 @@ sys 2m7.289s

                                                                2018-01-02

                                                                • Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
                                                                • -
                                                                • I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
                                                                • +
                                                                • I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
                                                                • The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
                                                                • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
                                                                • And just before that I see this:
                                                                • @@ -229,8 +229,8 @@ sys 2m7.289s
                                                                  Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
                                                                   
                                                                  • Ah hah! So the pool was actually empty!
                                                                  • -
                                                                  • I need to increase that, let's try to bump it up from 50 to 75
                                                                  • -
                                                                  • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
                                                                  • +
                                                                  • I need to increase that, let’s try to bump it up from 50 to 75
                                                                  • +
                                                                  • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw
                                                                  • I notice this error quite a few times in dspace.log:
                                                                  2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
                                                                  @@ -283,7 +283,7 @@ dspace.log.2017-12-31:53
                                                                   dspace.log.2018-01-01:45
                                                                   dspace.log.2018-01-02:34
                                                                   
                                                                    -
                                                                  • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains
                                                                  • +
                                                                  • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains
                                                                  Read more → @@ -297,7 +297,7 @@ dspace.log.2018-01-02:34

                                                                  December, 2017

                                                                  @@ -321,7 +321,7 @@ dspace.log.2018-01-02:34

                                                                  November, 2017

                                                                  @@ -354,7 +354,7 @@ COPY 54701

                                                                  October, 2017

                                                                  @@ -365,7 +365,7 @@ COPY 54701
                                                                http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
                                                                 
                                                                  -
                                                                • There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
                                                                • +
                                                                • There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
                                                                • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
                                                                Read more → @@ -380,10 +380,10 @@ COPY 54701

                                                                CGIAR Library Migration

                                                                diff --git a/docs/categories/page/2/index.html b/docs/categories/page/2/index.html index 9df738b2d..0b37b00a7 100644 --- a/docs/categories/page/2/index.html +++ b/docs/categories/page/2/index.html @@ -14,7 +14,7 @@ - + @@ -42,7 +42,7 @@ - + @@ -95,7 +95,7 @@

                                                                April, 2019

                                                                @@ -136,16 +136,16 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

                                                                March, 2019

                                                                2019-03-01

                                                                  -
                                                                • I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
                                                                • +
                                                                • I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
                                                                • I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
                                                                • -
                                                                • Looking at the other half of Udana's WLE records from 2018-11 +
                                                                • Looking at the other half of Udana’s WLE records from 2018-11
                                                                  • I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
                                                                  • I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
                                                                  • @@ -168,7 +168,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

                                                                    February, 2019

                                                                    @@ -213,7 +213,7 @@ sys 0m1.979s

                                                                    January, 2019

                                                                    @@ -221,7 +221,7 @@ sys 0m1.979s

                                                                    2019-01-02

                                                                    • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
                                                                    • -
                                                                    • I don't see anything interesting in the web server logs around that time though:
                                                                    • +
                                                                    • I don’t see anything interesting in the web server logs around that time though:
                                                                    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                                                                          92 40.77.167.4
                                                                    @@ -247,7 +247,7 @@ sys     0m1.979s
                                                                       

                                                                    December, 2018

                                                                    @@ -274,7 +274,7 @@ sys 0m1.979s

                                                                    November, 2018

                                                                    @@ -301,7 +301,7 @@ sys 0m1.979s

                                                                    October, 2018

                                                                    @@ -309,7 +309,7 @@ sys 0m1.979s

                                                                    2018-10-01

                                                                    • Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
                                                                    • -
                                                                    • I created a GitHub issue to track this #389, because I'm super busy in Nairobi right now
                                                                    • +
                                                                    • I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now
                                                                    Read more → @@ -323,7 +323,7 @@ sys 0m1.979s

                                                                    September, 2018

                                                                    @@ -331,9 +331,9 @@ sys 0m1.979s

                                                                    2018-09-02

                                                                    • New PostgreSQL JDBC driver version 42.2.5
                                                                    • -
                                                                    • I'll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
                                                                    • -
                                                                    • Also, I'll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month
                                                                    • -
                                                                    • I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:
                                                                    • +
                                                                    • I’ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
                                                                    • +
                                                                    • Also, I’ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
                                                                    • +
                                                                    • I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
                                                                    Read more → @@ -347,7 +347,7 @@ sys 0m1.979s

                                                                    August, 2018

                                                                    @@ -361,10 +361,10 @@ sys 0m1.979s [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
                                                                    • Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
                                                                    • -
                                                                    • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat's
                                                                    • -
                                                                    • I'm not sure why Tomcat didn't crash with an OutOfMemoryError…
                                                                    • +
                                                                    • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s
                                                                    • +
                                                                    • I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…
                                                                    • Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
                                                                    • -
                                                                    • The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
                                                                    • +
                                                                    • The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
                                                                    • I ran all system updates on DSpace Test and rebooted it
                                                                    Read more → @@ -379,7 +379,7 @@ sys 0m1.979s

                                                                    July, 2018

                                                                    diff --git a/docs/categories/page/3/index.html b/docs/categories/page/3/index.html index 00939343d..99f31be73 100644 --- a/docs/categories/page/3/index.html +++ b/docs/categories/page/3/index.html @@ -14,7 +14,7 @@ - + @@ -42,7 +42,7 @@ - + @@ -95,7 +95,7 @@

                                                                    June, 2018

                                                                    @@ -104,7 +104,7 @@
                                                                    • Test the DSpace 5.8 module upgrades from Atmire (#378)
                                                                        -
                                                                      • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn't build
                                                                      • +
                                                                      • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
                                                                    • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
                                                                    • @@ -133,7 +133,7 @@ sys 2m7.289s

                                                                      May, 2018

                                                                      @@ -161,14 +161,14 @@ sys 2m7.289s

                                                                      April, 2018

                                                                      2018-04-01

                                                                        -
                                                                      • I tried to test something on DSpace Test but noticed that it's down since god knows when
                                                                      • +
                                                                      • I tried to test something on DSpace Test but noticed that it’s down since god knows when
                                                                      • Catalina logs at least show some memory errors yesterday:
                                                                      Read more → @@ -183,7 +183,7 @@ sys 2m7.289s

                                                                      March, 2018

                                                                      @@ -204,7 +204,7 @@ sys 2m7.289s

                                                                      February, 2018

                                                                      @@ -212,9 +212,9 @@ sys 2m7.289s

                                                                      2018-02-01

                                                                      • Peter gave feedback on the dc.rights proof of concept that I had sent him last week
                                                                      • -
                                                                      • We don't need to distinguish between internal and external works, so that makes it just a simple list
                                                                      • +
                                                                      • We don’t need to distinguish between internal and external works, so that makes it just a simple list
                                                                      • Yesterday I figured out how to monitor DSpace sessions using JMX
                                                                      • -
                                                                      • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
                                                                      • +
                                                                      • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
                                                                      Read more → @@ -228,7 +228,7 @@ sys 2m7.289s

                                                                      January, 2018

                                                                      @@ -236,7 +236,7 @@ sys 2m7.289s

                                                                      2018-01-02

                                                                      • Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
                                                                      • -
                                                                      • I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
                                                                      • +
                                                                      • I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
                                                                      • The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
                                                                      • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
                                                                      • And just before that I see this:
                                                                      • @@ -244,8 +244,8 @@ sys 2m7.289s
                                                                        Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
                                                                         
                                                                        • Ah hah! So the pool was actually empty!
                                                                        • -
                                                                        • I need to increase that, let's try to bump it up from 50 to 75
                                                                        • -
                                                                        • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
                                                                        • +
                                                                        • I need to increase that, let’s try to bump it up from 50 to 75
                                                                        • +
                                                                        • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw
                                                                        • I notice this error quite a few times in dspace.log:
                                                                        2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
                                                                        @@ -298,7 +298,7 @@ dspace.log.2017-12-31:53
                                                                         dspace.log.2018-01-01:45
                                                                         dspace.log.2018-01-02:34
                                                                         
                                                                          -
                                                                        • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains
                                                                        • +
                                                                        • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains
                                                                        Read more → @@ -312,7 +312,7 @@ dspace.log.2018-01-02:34

                                                                        December, 2017

                                                                        @@ -336,7 +336,7 @@ dspace.log.2018-01-02:34

                                                                        November, 2017

                                                                        @@ -369,7 +369,7 @@ COPY 54701

                                                                        October, 2017

                                                                        @@ -380,7 +380,7 @@ COPY 54701
                                                                      http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
                                                                       
                                                                        -
                                                                      • There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
                                                                      • +
                                                                      • There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
                                                                      • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
                                                                      Read more → @@ -395,10 +395,10 @@ COPY 54701

                                                                      CGIAR Library Migration

                                                                      diff --git a/docs/categories/page/4/index.html b/docs/categories/page/4/index.html index 25e851434..aea4a4d1a 100644 --- a/docs/categories/page/4/index.html +++ b/docs/categories/page/4/index.html @@ -14,7 +14,7 @@ - + @@ -42,7 +42,7 @@ - + @@ -96,7 +96,7 @@

                                                                      September, 2017

                                                                      @@ -106,7 +106,7 @@

                                                                    2017-09-07

                                                                      -
                                                                    • Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
                                                                    • +
                                                                    • Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
                                                                    Read more → @@ -121,7 +121,7 @@

                                                                    August, 2017

                                                                    @@ -139,7 +139,7 @@
                                                                  • The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
                                                                  • Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
                                                                  • -
                                                                  • It turns out that we're already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
                                                                  • +
                                                                  • It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
                                                                  • Also, the bot has to successfully browse the page first so it can receive the HTTP header…
                                                                  • We might actually have to block these requests with HTTP 403 depending on the user agent
                                                                  • Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
                                                                  • @@ -160,7 +160,7 @@

                                                                    July, 2017

                                                                    @@ -171,8 +171,8 @@

                                                                    2017-07-04

                                                                    • Merge changes for WLE Phase II theme rename (#329)
                                                                    • -
                                                                    • Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace
                                                                    • -
                                                                    • We can use PostgreSQL's extended output format (-x) plus sed to format the output into quasi XML:
                                                                    • +
                                                                    • Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
                                                                    • +
                                                                    • We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML:
                                                                    Read more → @@ -187,11 +187,11 @@

                                                                    June, 2017

                                                                    - 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. + 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. Read more → @@ -205,11 +205,11 @@

                                                                    May, 2017

                                                                    - 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. + 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. Read more → @@ -223,7 +223,7 @@

                                                                    April, 2017

                                                                @@ -252,7 +252,7 @@

                                                                March, 2017

                                                                @@ -270,7 +270,7 @@
                                                              • Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
                                                              • Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
                                                              • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
                                                              • -
                                                              • Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
                                                              • +
                                                              • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
                                                              $ identify ~/Desktop/alc_contrastes_desafios.jpg
                                                               /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
                                                              @@ -288,7 +288,7 @@
                                                                   

                                                              February, 2017

                                                              @@ -307,7 +307,7 @@ dspace=# delete from collection2item where id = 92551 and item_id = 80278; DELETE 1
                                                        • Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
                                                        • -
                                                        • Looks like we'll be using cg.identifier.ccafsprojectpii as the field name
                                                        • +
                                                        • Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name
                                                        Read more → @@ -322,15 +322,15 @@ DELETE 1

                                                        January, 2017

                                                        2017-01-02

                                                        • I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
                                                        • -
                                                        • I tested on DSpace Test as well and it doesn't work there either
                                                        • -
                                                        • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years
                                                        • +
                                                        • I tested on DSpace Test as well and it doesn’t work there either
                                                        • +
                                                        • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
                                                        Read more → @@ -345,7 +345,7 @@ DELETE 1

                                                        December, 2016

                                                        @@ -360,8 +360,8 @@ DELETE 1 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607") 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
                                                          -
                                                        • I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade
                                                        • -
                                                        • I've raised a ticket with Atmire to ask
                                                        • +
                                                        • I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
                                                        • +
                                                        • I’ve raised a ticket with Atmire to ask
                                                        • Another worrying error from dspace.log is:
                                                        Read more → diff --git a/docs/categories/page/5/index.html b/docs/categories/page/5/index.html index 3d7f69049..d411321b3 100644 --- a/docs/categories/page/5/index.html +++ b/docs/categories/page/5/index.html @@ -14,7 +14,7 @@ - + @@ -42,7 +42,7 @@ - + @@ -96,13 +96,13 @@

                                                        November, 2016

                                                        2016-11-01

                                                          -
                                                        • Add dc.type to the output options for Atmire's Listings and Reports module (#286)
                                                        • +
                                                        • Add dc.type to the output options for Atmire’s Listings and Reports module (#286)

                                                        Listings and Reports with output type

                                                        Read more → @@ -118,7 +118,7 @@

                                                        October, 2016

                                                  @@ -131,7 +131,7 @@
                                                • ORCIDs plus normal authors
                                              • -
                                              • I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
                                              • +
                                              • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
                                              0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
                                               
                                              @@ -148,14 +148,14 @@

                                              September, 2016

                                  2016-09-01

                                  • Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
                                  • -
                                  • Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace
                                  • +
                                  • Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
                                  • We had been using DC=ILRI to determine whether a user was ILRI or not
                                  • It looks like we might be able to use OUs now, instead of DCs:
                                  @@ -174,7 +174,7 @@

                                  August, 2016

            @@ -204,7 +204,7 @@ $ git rebase -i dspace-5.5

            July, 2016

      @@ -235,14 +235,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

      June, 2016

    2016-06-01

    @@ -287,7 +287,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

    April, 2016

    by Alan Orth in -  + 

    @@ -295,8 +295,8 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and Read more → @@ -312,14 +312,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

    March, 2016

    by Alan Orth in -  + 

    2016-03-02

    Read more → @@ -335,7 +335,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

    February, 2016

    by Alan Orth in -  + 

    diff --git a/docs/categories/page/6/index.html b/docs/categories/page/6/index.html index 9982f5817..0ee3d3153 100644 --- a/docs/categories/page/6/index.html +++ b/docs/categories/page/6/index.html @@ -14,7 +14,7 @@ - + @@ -42,7 +42,7 @@ - + @@ -96,7 +96,7 @@

    January, 2016

    by Alan Orth in -  + 

    @@ -119,7 +119,7 @@

    December, 2015

    by Alan Orth in -  + 

    @@ -146,7 +146,7 @@

    November, 2015

    by Alan Orth in -  + 

    diff --git a/docs/cgiar-library-migration/index.html b/docs/cgiar-library-migration/index.html index 3b461491c..6edb9b537 100644 --- a/docs/cgiar-library-migration/index.html +++ b/docs/cgiar-library-migration/index.html @@ -15,7 +15,7 @@ - + @@ -46,7 +46,7 @@ - + @@ -93,10 +93,10 @@

    CGIAR Library Migration

    @@ -122,8 +122,8 @@
  • SELECT * FROM pg_stat_activity; seems to show ~6 extra connections used by the command line tools during import
  • -
  • Temporarily disable nightly index-discovery cron job because the import process will be taking place during some of this time and I don't want them to be competing to update the Solr index
  • -
  • Copy HTTPS certificate key pair from CGIAR Library server's Tomcat keystore:
  • +
  • Temporarily disable nightly index-discovery cron job because the import process will be taking place during some of this time and I don’t want them to be competing to update the Solr index
  • +
  • Copy HTTPS certificate key pair from CGIAR Library server’s Tomcat keystore:
  • $ keytool -list -keystore tomcat.keystore
     $ keytool -importkeystore -srckeystore tomcat.keystore -destkeystore library.cgiar.org.p12 -deststoretype PKCS12 -srcalias tomcat
    @@ -172,7 +172,7 @@ $ for item in 10947-2527/ITEM@10947-*; do dspace packager -r -f -u -t AIP -e aor
     $ dspace packager -s -t AIP -o ignoreHandle=false -e aorth@mjanja.ch -p 10568/83389 10947-1/10947-1.zip
     $ for collection in 10947-1/COLLECTION@10947-*; do dspace packager -s -o ignoreHandle=false -t AIP -e aorth@mjanja.ch -p 10947/1 $collection; done
     $ for item in 10947-1/ITEM@10947-*; do dspace packager -r -f -u -t AIP -e aorth@mjanja.ch $item; done
    -

    This submits AIP hierarchies recursively (-r) and suppresses errors when an item's parent collection hasn't been created yet—for example, if the item is mapped. The large historic archive (10947/1) is created in several steps because it requires a lot of memory and often crashes.

    +

    This submits AIP hierarchies recursively (-r) and suppresses errors when an item’s parent collection hasn’t been created yet—for example, if the item is mapped. The large historic archive (10947/1) is created in several steps because it requires a lot of memory and often crashes.

    Create new subcommunities and collections for content we reorganized into new hierarchies from the original:

    -
  • I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • +
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • 0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
     
    @@ -148,14 +148,14 @@

    September, 2016

    2016-09-01

    • Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
    • -
    • Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace
    • +
    • Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
    • We had been using DC=ILRI to determine whether a user was ILRI or not
    • It looks like we might be able to use OUs now, instead of DCs:
    @@ -174,7 +174,7 @@

    August, 2016

    @@ -204,7 +204,7 @@ $ git rebase -i dspace-5.5

    July, 2016

    @@ -235,14 +235,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

    June, 2016

    2016-06-01

    • Experimenting with IFPRI OAI (we want to harvest their publications)
    • -
    • After reading the ContentDM documentation I found IFPRI's OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
    • +
    • After reading the ContentDM documentation I found IFPRI’s OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
    • After reading the OAI documentation and testing with an OAI validator I found out how to get their publications
    • This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&from=2016-01-01&set=p15738coll2&metadataPrefix=oai_dc
    • You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
    • @@ -261,7 +261,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

      May, 2016

      @@ -287,7 +287,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

      April, 2016

      @@ -295,8 +295,8 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
      • Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit
      • We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc
      • -
      • After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!
      • -
      • This will save us a few gigs of backup space we're paying for on S3
      • +
      • After running DSpace for over five years I’ve never needed to look in any other log file than dspace.log, leave alone one from last year!
      • +
      • This will save us a few gigs of backup space we’re paying for on S3
      • Also, I noticed the checker log has some errors we should pay attention to:
      Read more → @@ -312,14 +312,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

      March, 2016

      2016-03-02

      • Looking at issues with author authorities on CGSpace
      • -
      • For some reason we still have the index-lucene-update cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module
      • +
      • For some reason we still have the index-lucene-update cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module
      • Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
      Read more → @@ -335,7 +335,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

      February, 2016

      diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 5f92fe671..8aa15d50c 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -14,7 +14,7 @@ - + @@ -42,7 +42,7 @@ - + @@ -96,7 +96,7 @@

      January, 2016

      @@ -119,7 +119,7 @@

      December, 2015

      @@ -146,7 +146,7 @@

      November, 2015

      diff --git a/docs/posts/index.html b/docs/posts/index.html index 4e7d38b1f..08b96f158 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -14,7 +14,7 @@ - + @@ -42,7 +42,7 @@ - + @@ -95,7 +95,7 @@

      January, 2020

      @@ -132,7 +132,7 @@

      December, 2019

      @@ -164,7 +164,7 @@

      November, 2019

      @@ -183,7 +183,7 @@ 1277694
      • So 4.6 million from XMLUI and another 1.2 million from API requests
      • -
      • Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
      • +
      • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
      # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
       1183456 
      @@ -202,10 +202,10 @@
         

      CGSpace CG Core v2 Migration

      @@ -223,12 +223,12 @@

      October, 2019

      - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc. + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc. Read more → @@ -241,7 +241,7 @@

      September, 2019

      @@ -286,14 +286,14 @@

      August, 2019

      2019-08-03

        -
      • Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
      • +
      • Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…

      2019-08-04

        @@ -301,7 +301,7 @@
      • Run system updates on CGSpace (linode18) and reboot it
        • Before updating it I checked Solr and verified that all statistics cores were loaded properly…
        • -
        • After rebooting, all statistics cores were loaded… wow, that's lucky.
        • +
        • After rebooting, all statistics cores were loaded… wow, that’s lucky.
      • Run system updates on DSpace Test (linode19) and reboot it
      • @@ -318,7 +318,7 @@

        July, 2019

        @@ -346,7 +346,7 @@

        June, 2019

        @@ -372,7 +372,7 @@

        May, 2019

        diff --git a/docs/posts/index.xml b/docs/posts/index.xml index a5abc1a10..2f06a2baa 100644 --- a/docs/posts/index.xml +++ b/docs/posts/index.xml @@ -82,7 +82,7 @@ 1277694 </code></pre><ul> <li>So 4.6 million from XMLUI and another 1.2 million from API requests</li> -<li>Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li> +<li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li> </ul> <pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot; 1183456 @@ -107,7 +107,7 @@ Tue, 01 Oct 2019 13:20:51 +0300 https://alanorth.github.io/cgspace-notes/2019-10/ - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c 'id,dc. + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c 'id,dc. @@ -154,7 +154,7 @@ https://alanorth.github.io/cgspace-notes/2019-08/ <h2 id="2019-08-03">2019-08-03</h2> <ul> -<li>Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;</li> +<li>Look at Bioversity&rsquo;s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;</li> </ul> <h2 id="2019-08-04">2019-08-04</h2> <ul> @@ -162,7 +162,7 @@ <li>Run system updates on CGSpace (linode18) and reboot it <ul> <li>Before updating it I checked Solr and verified that all statistics cores were loaded properly&hellip;</li> -<li>After rebooting, all statistics cores were loaded&hellip; wow, that's lucky.</li> +<li>After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s lucky.</li> </ul> </li> <li>Run system updates on DSpace Test (linode19) and reboot it</li> @@ -269,9 +269,9 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace https://alanorth.github.io/cgspace-notes/2019-03/ <h2 id="2019-03-01">2019-03-01</h2> <ul> -<li>I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li> +<li>I checked IITA&rsquo;s 259 Feb 14 records from last month for duplicates using Atmire&rsquo;s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li> <li>I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc&hellip;</li> -<li>Looking at the other half of Udana's WLE records from 2018-11 +<li>Looking at the other half of Udana&rsquo;s WLE records from 2018-11 <ul> <li>I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)</li> <li>I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items</li> @@ -329,7 +329,7 @@ sys 0m1.979s <h2 id="2019-01-02">2019-01-02</h2> <ul> <li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li> -<li>I don't see anything interesting in the web server logs around that time though:</li> +<li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 92 40.77.167.4 @@ -390,7 +390,7 @@ sys 0m1.979s <h2 id="2018-10-01">2018-10-01</h2> <ul> <li>Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items</li> -<li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I'm super busy in Nairobi right now</li> +<li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I&rsquo;m super busy in Nairobi right now</li> </ul> @@ -403,9 +403,9 @@ sys 0m1.979s <h2 id="2018-09-02">2018-09-02</h2> <ul> <li>New <a href="https://jdbc.postgresql.org/documentation/changelog.html#version_42.2.5">PostgreSQL JDBC driver version 42.2.5</a></li> -<li>I'll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li> -<li>Also, I'll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month</li> -<li>I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:</li> +<li>I&rsquo;ll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li> +<li>Also, I&rsquo;ll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month</li> +<li>I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:</li> </ul> @@ -424,10 +424,10 @@ sys 0m1.979s [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB </code></pre><ul> <li>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</li> -<li>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat's</li> -<li>I'm not sure why Tomcat didn't crash with an OutOfMemoryError&hellip;</li> +<li>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat&rsquo;s</li> +<li>I&rsquo;m not sure why Tomcat didn&rsquo;t crash with an OutOfMemoryError&hellip;</li> <li>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</li> -<li>The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes</li> +<li>The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes</li> <li>I ran all system updates on DSpace Test and rebooted it</li> </ul> @@ -460,7 +460,7 @@ sys 0m1.979s <ul> <li>Test the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">DSpace 5.8 module upgrades from Atmire</a> (<a href="https://github.com/ilri/DSpace/pull/378">#378</a>) <ul> -<li>There seems to be a problem with the CUA and L&amp;R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn't build</li> +<li>There seems to be a problem with the CUA and L&amp;R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn&rsquo;t build</li> </ul> </li> <li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li> @@ -506,7 +506,7 @@ sys 2m7.289s https://alanorth.github.io/cgspace-notes/2018-04/ <h2 id="2018-04-01">2018-04-01</h2> <ul> -<li>I tried to test something on DSpace Test but noticed that it's down since god knows when</li> +<li>I tried to test something on DSpace Test but noticed that it&rsquo;s down since god knows when</li> <li>Catalina logs at least show some memory errors yesterday:</li> </ul> @@ -532,9 +532,9 @@ sys 2m7.289s <h2 id="2018-02-01">2018-02-01</h2> <ul> <li>Peter gave feedback on the <code>dc.rights</code> proof of concept that I had sent him last week</li> -<li>We don't need to distinguish between internal and external works, so that makes it just a simple list</li> +<li>We don&rsquo;t need to distinguish between internal and external works, so that makes it just a simple list</li> <li>Yesterday I figured out how to monitor DSpace sessions using JMX</li> -<li>I copied the logic in the <code>jmx_tomcat_dbpools</code> provided by Ubuntu's <code>munin-plugins-java</code> package and used the stuff I discovered about JMX <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-01/">in 2018-01</a></li> +<li>I copied the logic in the <code>jmx_tomcat_dbpools</code> provided by Ubuntu&rsquo;s <code>munin-plugins-java</code> package and used the stuff I discovered about JMX <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-01/">in 2018-01</a></li> </ul> @@ -547,7 +547,7 @@ sys 2m7.289s <h2 id="2018-01-02">2018-01-02</h2> <ul> <li>Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time</li> -<li>I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary</li> +<li>I didn&rsquo;t get any load alerts from Linode and the REST and XMLUI logs don&rsquo;t show anything out of the ordinary</li> <li>The nginx logs show HTTP 200s until <code>02/Jan/2018:11:27:17 +0000</code> when Uptime Robot got an HTTP 500</li> <li>In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;</li> <li>And just before that I see this:</li> @@ -555,8 +555,8 @@ sys 2m7.289s <pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000]. </code></pre><ul> <li>Ah hah! So the pool was actually empty!</li> -<li>I need to increase that, let's try to bump it up from 50 to 75</li> -<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw</li> +<li>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</li> +<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</li> <li>I notice this error quite a few times in dspace.log:</li> </ul> <pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets @@ -609,7 +609,7 @@ dspace.log.2017-12-31:53 dspace.log.2018-01-01:45 dspace.log.2018-01-02:34 </code></pre><ul> -<li>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains</li> +<li>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains</li> </ul> @@ -664,7 +664,7 @@ COPY 54701 </ul> <pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336 </code></pre><ul> -<li>There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li> +<li>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li> <li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li> </ul> @@ -690,7 +690,7 @@ COPY 54701 </ul> <h2 id="2017-09-07">2017-09-07</h2> <ul> -<li>Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group</li> +<li>Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account is both in the approvers step as well as the group</li> </ul> @@ -714,7 +714,7 @@ COPY 54701 </li> <li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs&hellip; we will need to find a way to forbid them from accessing these!</li> <li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li> -<li>It turns out that we're already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li> +<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li> <li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li> <li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li> <li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li> @@ -737,8 +737,8 @@ COPY 54701 <h2 id="2017-07-04">2017-07-04</h2> <ul> <li>Merge changes for WLE Phase II theme rename (<a href="https://github.com/ilri/DSpace/pull/329">#329</a>)</li> -<li>Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace</li> -<li>We can use PostgreSQL's extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li> +<li>Looking at extracting the metadata registries from ICARDA&rsquo;s MEL DSpace database so we can compare fields with CGSpace</li> +<li>We can use PostgreSQL&rsquo;s extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li> </ul> @@ -748,7 +748,7 @@ COPY 54701 Thu, 01 Jun 2017 10:14:52 +0300 https://alanorth.github.io/cgspace-notes/2017-06/ - 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. + 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. @@ -757,7 +757,7 @@ COPY 54701 Mon, 01 May 2017 16:21:52 +0200 https://alanorth.github.io/cgspace-notes/2017-05/ - 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. + 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it&rsquo;s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire&rsquo;s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. @@ -800,7 +800,7 @@ COPY 54701 <li>Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI</li> <li>Filed an issue on DSpace issue tracker for the <code>filter-media</code> bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: <a href="https://jira.duraspace.org/browse/DS-3516">DS-3516</a></li> <li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li> -<li>Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li> +<li>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li> </ul> <pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000 @@ -828,7 +828,7 @@ dspace=# delete from collection2item where id = 92551 and item_id = 80278; DELETE 1 </code></pre><ul> <li>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</li> -<li>Looks like we'll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li> +<li>Looks like we&rsquo;ll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li> </ul> @@ -841,8 +841,8 @@ DELETE 1 <h2 id="2017-01-02">2017-01-02</h2> <ul> <li>I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error</li> -<li>I tested on DSpace Test as well and it doesn't work there either</li> -<li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years</li> +<li>I tested on DSpace Test as well and it doesn&rsquo;t work there either</li> +<li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I&rsquo;m not sure if we&rsquo;ve ever had the sharding task run successfully over all these years</li> </ul> @@ -863,8 +863,8 @@ DELETE 1 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) </code></pre><ul> -<li>I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade</li> -<li>I've raised a ticket with Atmire to ask</li> +<li>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</li> +<li>I&rsquo;ve raised a ticket with Atmire to ask</li> <li>Another worrying error from dspace.log is:</li> </ul> @@ -877,7 +877,7 @@ DELETE 1 https://alanorth.github.io/cgspace-notes/2016-11/ <h2 id="2016-11-01">2016-11-01</h2> <ul> -<li>Add <code>dc.type</code> to the output options for Atmire's Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li> +<li>Add <code>dc.type</code> to the output options for Atmire&rsquo;s Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/11/listings-and-reports.png" alt="Listings and Reports with output type"></p> @@ -897,7 +897,7 @@ DELETE 1 <li>ORCIDs plus normal authors</li> </ul> </li> -<li>I exported a random item's metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li> +<li>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li> </ul> <pre><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X </code></pre> @@ -912,7 +912,7 @@ DELETE 1 <h2 id="2016-09-01">2016-09-01</h2> <ul> <li>Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors</li> -<li>Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace</li> +<li>Discuss how the migration of CGIAR&rsquo;s Active Directory to a flat structure will break our LDAP groups in DSpace</li> <li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li> <li>It looks like we might be able to use OUs now, instead of DCs:</li> </ul> @@ -972,7 +972,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <h2 id="2016-06-01">2016-06-01</h2> <ul> <li>Experimenting with IFPRI OAI (we want to harvest their publications)</li> -<li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI's OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li> +<li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI&rsquo;s OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li> <li>After reading the <a href="https://www.openarchives.org/OAI/openarchivesprotocol.html">OAI documentation</a> and testing with an <a href="http://validator.oaipmh.com/">OAI validator</a> I found out how to get their publications</li> <li>This is their publications set: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&amp;from=2016-01-01&amp;set=p15738coll2&amp;metadataPrefix=oai_dc">http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&amp;from=2016-01-01&amp;set=p15738coll2&amp;metadataPrefix=oai_dc</a></li> <li>You can see the others by using the OAI <code>ListSets</code> verb: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListSets">http://ebrary.ifpri.org/oai/oai.php?verb=ListSets</a></li> @@ -1007,8 +1007,8 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <ul> <li>Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit</li> <li>We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc</li> -<li>After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!</li> -<li>This will save us a few gigs of backup space we're paying for on S3</li> +<li>After running DSpace for over five years I&rsquo;ve never needed to look in any other log file than dspace.log, leave alone one from last year!</li> +<li>This will save us a few gigs of backup space we&rsquo;re paying for on S3</li> <li>Also, I noticed the <code>checker</code> log has some errors we should pay attention to:</li> </ul> @@ -1022,7 +1022,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <h2 id="2016-03-02">2016-03-02</h2> <ul> <li>Looking at issues with author authorities on CGSpace</li> -<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module</li> +<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I&rsquo;m pretty sure we don&rsquo;t need it as of the latest few versions of Atmire&rsquo;s Listings and Reports module</li> <li>Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server</li> </ul> diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 8f6375717..931ad202b 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -14,7 +14,7 @@ - + @@ -42,7 +42,7 @@ - + @@ -95,7 +95,7 @@

        April, 2019

        @@ -136,16 +136,16 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

        March, 2019

        2019-03-01

          -
        • I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
        • +
        • I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
        • I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
        • -
        • Looking at the other half of Udana's WLE records from 2018-11 +
        • Looking at the other half of Udana’s WLE records from 2018-11
          • I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
          • I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
          • @@ -168,7 +168,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

            February, 2019

            @@ -213,7 +213,7 @@ sys 0m1.979s

            January, 2019

            @@ -221,7 +221,7 @@ sys 0m1.979s

            2019-01-02

            • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
            • -
            • I don't see anything interesting in the web server logs around that time though:
            • +
            • I don’t see anything interesting in the web server logs around that time though:
            # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                  92 40.77.167.4
            @@ -247,7 +247,7 @@ sys     0m1.979s
               

            December, 2018

            @@ -274,7 +274,7 @@ sys 0m1.979s

            November, 2018

            @@ -301,7 +301,7 @@ sys 0m1.979s

            October, 2018

            @@ -309,7 +309,7 @@ sys 0m1.979s

            2018-10-01

            • Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
            • -
            • I created a GitHub issue to track this #389, because I'm super busy in Nairobi right now
            • +
            • I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now
            Read more → @@ -323,7 +323,7 @@ sys 0m1.979s

            September, 2018

            @@ -331,9 +331,9 @@ sys 0m1.979s

            2018-09-02

            • New PostgreSQL JDBC driver version 42.2.5
            • -
            • I'll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
            • -
            • Also, I'll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month
            • -
            • I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:
            • +
            • I’ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
            • +
            • Also, I’ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
            • +
            • I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
            Read more → @@ -347,7 +347,7 @@ sys 0m1.979s

            August, 2018

            @@ -361,10 +361,10 @@ sys 0m1.979s [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
            • Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
            • -
            • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat's
            • -
            • I'm not sure why Tomcat didn't crash with an OutOfMemoryError…
            • +
            • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s
            • +
            • I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…
            • Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
            • -
            • The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
            • +
            • The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
            • I ran all system updates on DSpace Test and rebooted it
            Read more → @@ -379,7 +379,7 @@ sys 0m1.979s

            July, 2018

            diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 0b049ac8d..63c6e429f 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -14,7 +14,7 @@ - + @@ -42,7 +42,7 @@ - + @@ -95,7 +95,7 @@

            June, 2018

            @@ -104,7 +104,7 @@
            • Test the DSpace 5.8 module upgrades from Atmire (#378)
                -
              • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn't build
              • +
              • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
            • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
            • @@ -133,7 +133,7 @@ sys 2m7.289s

              May, 2018

              @@ -161,14 +161,14 @@ sys 2m7.289s

              April, 2018

              2018-04-01

                -
              • I tried to test something on DSpace Test but noticed that it's down since god knows when
              • +
              • I tried to test something on DSpace Test but noticed that it’s down since god knows when
              • Catalina logs at least show some memory errors yesterday:
              Read more → @@ -183,7 +183,7 @@ sys 2m7.289s

              March, 2018

              @@ -204,7 +204,7 @@ sys 2m7.289s

              February, 2018

              @@ -212,9 +212,9 @@ sys 2m7.289s

              2018-02-01

              • Peter gave feedback on the dc.rights proof of concept that I had sent him last week
              • -
              • We don't need to distinguish between internal and external works, so that makes it just a simple list
              • +
              • We don’t need to distinguish between internal and external works, so that makes it just a simple list
              • Yesterday I figured out how to monitor DSpace sessions using JMX
              • -
              • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
              • +
              • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
              Read more → @@ -228,7 +228,7 @@ sys 2m7.289s

              January, 2018

              @@ -236,7 +236,7 @@ sys 2m7.289s

              2018-01-02

              • Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
              • -
              • I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
              • +
              • I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
              • The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
              • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
              • And just before that I see this:
              • @@ -244,8 +244,8 @@ sys 2m7.289s
                Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
                 
                • Ah hah! So the pool was actually empty!
                • -
                • I need to increase that, let's try to bump it up from 50 to 75
                • -
                • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
                • +
                • I need to increase that, let’s try to bump it up from 50 to 75
                • +
                • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw
                • I notice this error quite a few times in dspace.log:
                2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
                @@ -298,7 +298,7 @@ dspace.log.2017-12-31:53
                 dspace.log.2018-01-01:45
                 dspace.log.2018-01-02:34
                 
                  -
                • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains
                • +
                • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains
                Read more → @@ -312,7 +312,7 @@ dspace.log.2018-01-02:34

                December, 2017

                @@ -336,7 +336,7 @@ dspace.log.2018-01-02:34

                November, 2017

                @@ -369,7 +369,7 @@ COPY 54701

                October, 2017

                @@ -380,7 +380,7 @@ COPY 54701
              http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
               
                -
              • There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
              • +
              • There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
              • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
              Read more → @@ -395,10 +395,10 @@ COPY 54701

              CGIAR Library Migration

              diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index 8739170b4..e2c05d378 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -14,7 +14,7 @@ - + @@ -42,7 +42,7 @@ - + @@ -96,7 +96,7 @@

              September, 2017

              @@ -106,7 +106,7 @@

            2017-09-07

              -
            • Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
            • +
            • Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
            Read more → @@ -121,7 +121,7 @@

            August, 2017

            @@ -139,7 +139,7 @@
          • The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
          • Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
          • -
          • It turns out that we're already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
          • +
          • It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
          • Also, the bot has to successfully browse the page first so it can receive the HTTP header…
          • We might actually have to block these requests with HTTP 403 depending on the user agent
          • Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
          • @@ -160,7 +160,7 @@

            July, 2017

            @@ -171,8 +171,8 @@

            2017-07-04

            • Merge changes for WLE Phase II theme rename (#329)
            • -
            • Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace
            • -
            • We can use PostgreSQL's extended output format (-x) plus sed to format the output into quasi XML:
            • +
            • Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
            • +
            • We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML:
            Read more → @@ -187,11 +187,11 @@

            June, 2017

            - 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. + 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. Read more → @@ -205,11 +205,11 @@

            May, 2017

            - 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. + 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. Read more → @@ -223,7 +223,7 @@

            April, 2017

        @@ -252,7 +252,7 @@

        March, 2017

        @@ -270,7 +270,7 @@
      • Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
      • Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
      • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
      • -
      • Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
      • +
      • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
      $ identify ~/Desktop/alc_contrastes_desafios.jpg
       /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
      @@ -288,7 +288,7 @@
           

      February, 2017

      @@ -307,7 +307,7 @@ dspace=# delete from collection2item where id = 92551 and item_id = 80278; DELETE 1
      • Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
      • -
      • Looks like we'll be using cg.identifier.ccafsprojectpii as the field name
      • +
      • Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name
      Read more → @@ -322,15 +322,15 @@ DELETE 1

      January, 2017

      2017-01-02

      • I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
      • -
      • I tested on DSpace Test as well and it doesn't work there either
      • -
      • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years
      • +
      • I tested on DSpace Test as well and it doesn’t work there either
      • +
      • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
      Read more → @@ -345,7 +345,7 @@ DELETE 1

      December, 2016

      @@ -360,8 +360,8 @@ DELETE 1 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607") 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
        -
      • I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade
      • -
      • I've raised a ticket with Atmire to ask
      • +
      • I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
      • +
      • I’ve raised a ticket with Atmire to ask
      • Another worrying error from dspace.log is:
      Read more → diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 4a5c3b17f..b145e5997 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -14,7 +14,7 @@ - + @@ -42,7 +42,7 @@ - + @@ -96,13 +96,13 @@

      November, 2016

      2016-11-01

        -
      • Add dc.type to the output options for Atmire's Listings and Reports module (#286)
      • +
      • Add dc.type to the output options for Atmire’s Listings and Reports module (#286)

      Listings and Reports with output type

      Read more → @@ -118,7 +118,7 @@

      October, 2016

      @@ -131,7 +131,7 @@
    • ORCIDs plus normal authors
    -
  • I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • +
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • 0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
     
    @@ -148,14 +148,14 @@

    September, 2016

    2016-09-01

    • Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
    • -
    • Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace
    • +
    • Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
    • We had been using DC=ILRI to determine whether a user was ILRI or not
    • It looks like we might be able to use OUs now, instead of DCs:
    @@ -174,7 +174,7 @@

    August, 2016

    @@ -204,7 +204,7 @@ $ git rebase -i dspace-5.5

    July, 2016

    @@ -235,14 +235,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

    June, 2016

    2016-06-01

    • Experimenting with IFPRI OAI (we want to harvest their publications)
    • -
    • After reading the ContentDM documentation I found IFPRI's OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
    • +
    • After reading the ContentDM documentation I found IFPRI’s OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
    • After reading the OAI documentation and testing with an OAI validator I found out how to get their publications
    • This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&from=2016-01-01&set=p15738coll2&metadataPrefix=oai_dc
    • You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
    • @@ -261,7 +261,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

      May, 2016

      @@ -287,7 +287,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

      April, 2016

      @@ -295,8 +295,8 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
      • Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit
      • We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc
      • -
      • After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!
      • -
      • This will save us a few gigs of backup space we're paying for on S3
      • +
      • After running DSpace for over five years I’ve never needed to look in any other log file than dspace.log, leave alone one from last year!
      • +
      • This will save us a few gigs of backup space we’re paying for on S3
      • Also, I noticed the checker log has some errors we should pay attention to:
      Read more → @@ -312,14 +312,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

      March, 2016

      2016-03-02

      • Looking at issues with author authorities on CGSpace
      • -
      • For some reason we still have the index-lucene-update cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module
      • +
      • For some reason we still have the index-lucene-update cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module
      • Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
      Read more → @@ -335,7 +335,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

      February, 2016

      diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 9272c6fec..dcef508ce 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -14,7 +14,7 @@ - + @@ -42,7 +42,7 @@ - + @@ -96,7 +96,7 @@

      January, 2016

      @@ -119,7 +119,7 @@

      December, 2015

      @@ -146,7 +146,7 @@

      November, 2015

      diff --git a/docs/tags/index.html b/docs/tags/index.html index 2401a8f18..8d653d639 100644 --- a/docs/tags/index.html +++ b/docs/tags/index.html @@ -14,7 +14,7 @@ - + @@ -42,7 +42,7 @@ - + @@ -95,7 +95,7 @@

      January, 2020

      @@ -132,7 +132,7 @@

      December, 2019

      @@ -164,7 +164,7 @@

      November, 2019

      @@ -183,7 +183,7 @@ 1277694
      • So 4.6 million from XMLUI and another 1.2 million from API requests
      • -
      • Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
      • +
      • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
      # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
       1183456 
      @@ -202,10 +202,10 @@
         

      CGSpace CG Core v2 Migration

      @@ -223,12 +223,12 @@

      October, 2019

      - 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc. + 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc. Read more → @@ -241,7 +241,7 @@

      September, 2019

      @@ -286,14 +286,14 @@

      August, 2019

      2019-08-03

        -
      • Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
      • +
      • Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…

      2019-08-04

        @@ -301,7 +301,7 @@
      • Run system updates on CGSpace (linode18) and reboot it
        • Before updating it I checked Solr and verified that all statistics cores were loaded properly…
        • -
        • After rebooting, all statistics cores were loaded… wow, that's lucky.
        • +
        • After rebooting, all statistics cores were loaded… wow, that’s lucky.
      • Run system updates on DSpace Test (linode19) and reboot it
      • @@ -318,7 +318,7 @@

        July, 2019

        @@ -346,7 +346,7 @@

        June, 2019

        @@ -372,7 +372,7 @@

        May, 2019

        diff --git a/docs/tags/migration/index.html b/docs/tags/migration/index.html index 7449b80c1..11d1f0eb0 100644 --- a/docs/tags/migration/index.html +++ b/docs/tags/migration/index.html @@ -14,7 +14,7 @@ - + @@ -28,7 +28,7 @@ - + @@ -80,10 +80,10 @@

        CGSpace CG Core v2 Migration

        @@ -101,10 +101,10 @@

        CGIAR Library Migration

        diff --git a/docs/tags/notes/index.html b/docs/tags/notes/index.html index ff5f2be40..076a27ea8 100644 --- a/docs/tags/notes/index.html +++ b/docs/tags/notes/index.html @@ -14,7 +14,7 @@ - + @@ -28,7 +28,7 @@ - + @@ -81,7 +81,7 @@

        September, 2017

        @@ -91,7 +91,7 @@

      2017-09-07

        -
      • Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
      • +
      • Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
      Read more → @@ -106,7 +106,7 @@

      August, 2017

      @@ -124,7 +124,7 @@
    • The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
    • Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
    • -
    • It turns out that we're already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
    • +
    • It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
    • Also, the bot has to successfully browse the page first so it can receive the HTTP header…
    • We might actually have to block these requests with HTTP 403 depending on the user agent
    • Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
    • @@ -145,7 +145,7 @@

      July, 2017

      @@ -156,8 +156,8 @@

      2017-07-04

      • Merge changes for WLE Phase II theme rename (#329)
      • -
      • Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace
      • -
      • We can use PostgreSQL's extended output format (-x) plus sed to format the output into quasi XML:
      • +
      • Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
      • +
      • We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML:
      Read more → @@ -172,11 +172,11 @@

      June, 2017

      - 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. + 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. Read more → @@ -190,11 +190,11 @@

      May, 2017

      - 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. + 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. Read more → @@ -208,7 +208,7 @@

      April, 2017

      @@ -237,7 +237,7 @@

      March, 2017

      @@ -255,7 +255,7 @@
    • Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
    • Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
    • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
    • -
    • Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
    • +
    • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
    $ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
    @@ -273,7 +273,7 @@
         

    February, 2017

    @@ -292,7 +292,7 @@ dspace=# delete from collection2item where id = 92551 and item_id = 80278; DELETE 1
    • Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
    • -
    • Looks like we'll be using cg.identifier.ccafsprojectpii as the field name
    • +
    • Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name
    Read more → @@ -307,15 +307,15 @@ DELETE 1

    January, 2017

    2017-01-02

    • I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
    • -
    • I tested on DSpace Test as well and it doesn't work there either
    • -
    • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years
    • +
    • I tested on DSpace Test as well and it doesn’t work there either
    • +
    • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
    Read more → @@ -330,7 +330,7 @@ DELETE 1

    December, 2016

    @@ -345,8 +345,8 @@ DELETE 1 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607") 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
      -
    • I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade
    • -
    • I've raised a ticket with Atmire to ask
    • +
    • I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
    • +
    • I’ve raised a ticket with Atmire to ask
    • Another worrying error from dspace.log is:
    Read more → diff --git a/docs/tags/notes/index.xml b/docs/tags/notes/index.xml index d1210445e..fb1a87cc0 100644 --- a/docs/tags/notes/index.xml +++ b/docs/tags/notes/index.xml @@ -23,7 +23,7 @@ </ul> <h2 id="2017-09-07">2017-09-07</h2> <ul> -<li>Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group</li> +<li>Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account is both in the approvers step as well as the group</li> </ul> @@ -47,7 +47,7 @@ </li> <li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs&hellip; we will need to find a way to forbid them from accessing these!</li> <li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li> -<li>It turns out that we're already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li> +<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li> <li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li> <li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li> <li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li> @@ -70,8 +70,8 @@ <h2 id="2017-07-04">2017-07-04</h2> <ul> <li>Merge changes for WLE Phase II theme rename (<a href="https://github.com/ilri/DSpace/pull/329">#329</a>)</li> -<li>Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace</li> -<li>We can use PostgreSQL's extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li> +<li>Looking at extracting the metadata registries from ICARDA&rsquo;s MEL DSpace database so we can compare fields with CGSpace</li> +<li>We can use PostgreSQL&rsquo;s extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li> </ul> @@ -81,7 +81,7 @@ Thu, 01 Jun 2017 10:14:52 +0300 https://alanorth.github.io/cgspace-notes/2017-06/ - 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. + 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. @@ -90,7 +90,7 @@ Mon, 01 May 2017 16:21:52 +0200 https://alanorth.github.io/cgspace-notes/2017-05/ - 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. + 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it&rsquo;s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire&rsquo;s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. @@ -133,7 +133,7 @@ <li>Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI</li> <li>Filed an issue on DSpace issue tracker for the <code>filter-media</code> bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: <a href="https://jira.duraspace.org/browse/DS-3516">DS-3516</a></li> <li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li> -<li>Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li> +<li>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li> </ul> <pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000 @@ -161,7 +161,7 @@ dspace=# delete from collection2item where id = 92551 and item_id = 80278; DELETE 1 </code></pre><ul> <li>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</li> -<li>Looks like we'll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li> +<li>Looks like we&rsquo;ll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li> </ul> @@ -174,8 +174,8 @@ DELETE 1 <h2 id="2017-01-02">2017-01-02</h2> <ul> <li>I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error</li> -<li>I tested on DSpace Test as well and it doesn't work there either</li> -<li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years</li> +<li>I tested on DSpace Test as well and it doesn&rsquo;t work there either</li> +<li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I&rsquo;m not sure if we&rsquo;ve ever had the sharding task run successfully over all these years</li> </ul> @@ -196,8 +196,8 @@ DELETE 1 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) </code></pre><ul> -<li>I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade</li> -<li>I've raised a ticket with Atmire to ask</li> +<li>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</li> +<li>I&rsquo;ve raised a ticket with Atmire to ask</li> <li>Another worrying error from dspace.log is:</li> </ul> @@ -210,7 +210,7 @@ DELETE 1 https://alanorth.github.io/cgspace-notes/2016-11/ <h2 id="2016-11-01">2016-11-01</h2> <ul> -<li>Add <code>dc.type</code> to the output options for Atmire's Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li> +<li>Add <code>dc.type</code> to the output options for Atmire&rsquo;s Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/11/listings-and-reports.png" alt="Listings and Reports with output type"></p> @@ -230,7 +230,7 @@ DELETE 1 <li>ORCIDs plus normal authors</li> </ul> </li> -<li>I exported a random item's metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li> +<li>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li> </ul> <pre><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X </code></pre> @@ -245,7 +245,7 @@ DELETE 1 <h2 id="2016-09-01">2016-09-01</h2> <ul> <li>Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors</li> -<li>Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace</li> +<li>Discuss how the migration of CGIAR&rsquo;s Active Directory to a flat structure will break our LDAP groups in DSpace</li> <li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li> <li>It looks like we might be able to use OUs now, instead of DCs:</li> </ul> @@ -305,7 +305,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <h2 id="2016-06-01">2016-06-01</h2> <ul> <li>Experimenting with IFPRI OAI (we want to harvest their publications)</li> -<li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI's OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li> +<li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI&rsquo;s OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li> <li>After reading the <a href="https://www.openarchives.org/OAI/openarchivesprotocol.html">OAI documentation</a> and testing with an <a href="http://validator.oaipmh.com/">OAI validator</a> I found out how to get their publications</li> <li>This is their publications set: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&amp;from=2016-01-01&amp;set=p15738coll2&amp;metadataPrefix=oai_dc">http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&amp;from=2016-01-01&amp;set=p15738coll2&amp;metadataPrefix=oai_dc</a></li> <li>You can see the others by using the OAI <code>ListSets</code> verb: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListSets">http://ebrary.ifpri.org/oai/oai.php?verb=ListSets</a></li> @@ -340,8 +340,8 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <ul> <li>Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit</li> <li>We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc</li> -<li>After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!</li> -<li>This will save us a few gigs of backup space we're paying for on S3</li> +<li>After running DSpace for over five years I&rsquo;ve never needed to look in any other log file than dspace.log, leave alone one from last year!</li> +<li>This will save us a few gigs of backup space we&rsquo;re paying for on S3</li> <li>Also, I noticed the <code>checker</code> log has some errors we should pay attention to:</li> </ul> @@ -355,7 +355,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and <h2 id="2016-03-02">2016-03-02</h2> <ul> <li>Looking at issues with author authorities on CGSpace</li> -<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module</li> +<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I&rsquo;m pretty sure we don&rsquo;t need it as of the latest few versions of Atmire&rsquo;s Listings and Reports module</li> <li>Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server</li> </ul> diff --git a/docs/tags/notes/page/2/index.html b/docs/tags/notes/page/2/index.html index ee46ae5f9..7e0f55936 100644 --- a/docs/tags/notes/page/2/index.html +++ b/docs/tags/notes/page/2/index.html @@ -14,7 +14,7 @@ - + @@ -28,7 +28,7 @@ - + @@ -81,13 +81,13 @@

    November, 2016

    2016-11-01

      -
    • Add dc.type to the output options for Atmire's Listings and Reports module (#286)
    • +
    • Add dc.type to the output options for Atmire’s Listings and Reports module (#286)

    Listings and Reports with output type

    Read more → @@ -103,7 +103,7 @@

    October, 2016

    @@ -116,7 +116,7 @@
  • ORCIDs plus normal authors
  • -
  • I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • +
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • 0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
     
    @@ -133,14 +133,14 @@

    September, 2016

    2016-09-01

    • Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
    • -
    • Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace
    • +
    • Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
    • We had been using DC=ILRI to determine whether a user was ILRI or not
    • It looks like we might be able to use OUs now, instead of DCs:
    @@ -159,7 +159,7 @@

    August, 2016

    @@ -189,7 +189,7 @@ $ git rebase -i dspace-5.5

    July, 2016

    @@ -220,14 +220,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

    June, 2016

    2016-06-01

    • Experimenting with IFPRI OAI (we want to harvest their publications)
    • -
    • After reading the ContentDM documentation I found IFPRI's OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
    • +
    • After reading the ContentDM documentation I found IFPRI’s OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
    • After reading the OAI documentation and testing with an OAI validator I found out how to get their publications
    • This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&from=2016-01-01&set=p15738coll2&metadataPrefix=oai_dc
    • You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
    • @@ -246,7 +246,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

      May, 2016

      @@ -272,7 +272,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

      April, 2016

      @@ -280,8 +280,8 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
      • Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit
      • We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc
      • -
      • After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!
      • -
      • This will save us a few gigs of backup space we're paying for on S3
      • +
      • After running DSpace for over five years I’ve never needed to look in any other log file than dspace.log, leave alone one from last year!
      • +
      • This will save us a few gigs of backup space we’re paying for on S3
      • Also, I noticed the checker log has some errors we should pay attention to:
      Read more → @@ -297,14 +297,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

      March, 2016

      2016-03-02

      • Looking at issues with author authorities on CGSpace
      • -
      • For some reason we still have the index-lucene-update cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module
      • +
      • For some reason we still have the index-lucene-update cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module
      • Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
      Read more → @@ -320,7 +320,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

      February, 2016

      diff --git a/docs/tags/notes/page/3/index.html b/docs/tags/notes/page/3/index.html index 2259afadf..a8d537694 100644 --- a/docs/tags/notes/page/3/index.html +++ b/docs/tags/notes/page/3/index.html @@ -14,7 +14,7 @@ - + @@ -28,7 +28,7 @@ - + @@ -81,7 +81,7 @@

      January, 2016

      @@ -104,7 +104,7 @@

      December, 2015

      @@ -131,7 +131,7 @@

      November, 2015

      diff --git a/docs/tags/page/2/index.html b/docs/tags/page/2/index.html index 772ff68d2..9b88ad184 100644 --- a/docs/tags/page/2/index.html +++ b/docs/tags/page/2/index.html @@ -14,7 +14,7 @@ - + @@ -42,7 +42,7 @@ - + @@ -95,7 +95,7 @@

      April, 2019

      @@ -136,16 +136,16 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

      March, 2019

      2019-03-01

        -
      • I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
      • +
      • I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
      • I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
      • -
      • Looking at the other half of Udana's WLE records from 2018-11 +
      • Looking at the other half of Udana’s WLE records from 2018-11
        • I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
        • I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
        • @@ -168,7 +168,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

          February, 2019

          @@ -213,7 +213,7 @@ sys 0m1.979s

          January, 2019

          @@ -221,7 +221,7 @@ sys 0m1.979s

          2019-01-02

          • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
          • -
          • I don't see anything interesting in the web server logs around that time though:
          • +
          • I don’t see anything interesting in the web server logs around that time though:
          # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
                92 40.77.167.4
          @@ -247,7 +247,7 @@ sys     0m1.979s
             

          December, 2018

          @@ -274,7 +274,7 @@ sys 0m1.979s

          November, 2018

          @@ -301,7 +301,7 @@ sys 0m1.979s

          October, 2018

          @@ -309,7 +309,7 @@ sys 0m1.979s

          2018-10-01

          • Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
          • -
          • I created a GitHub issue to track this #389, because I'm super busy in Nairobi right now
          • +
          • I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now
          Read more → @@ -323,7 +323,7 @@ sys 0m1.979s

          September, 2018

          @@ -331,9 +331,9 @@ sys 0m1.979s

          2018-09-02

          • New PostgreSQL JDBC driver version 42.2.5
          • -
          • I'll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
          • -
          • Also, I'll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month
          • -
          • I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:
          • +
          • I’ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
          • +
          • Also, I’ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
          • +
          • I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
          Read more → @@ -347,7 +347,7 @@ sys 0m1.979s

          August, 2018

          @@ -361,10 +361,10 @@ sys 0m1.979s [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
          • Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
          • -
          • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat's
          • -
          • I'm not sure why Tomcat didn't crash with an OutOfMemoryError…
          • +
          • From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s
          • +
          • I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…
          • Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
          • -
          • The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
          • +
          • The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
          • I ran all system updates on DSpace Test and rebooted it
          Read more → @@ -379,7 +379,7 @@ sys 0m1.979s

          July, 2018

          diff --git a/docs/tags/page/3/index.html b/docs/tags/page/3/index.html index 8f124aa5e..8829a83b3 100644 --- a/docs/tags/page/3/index.html +++ b/docs/tags/page/3/index.html @@ -14,7 +14,7 @@ - + @@ -42,7 +42,7 @@ - + @@ -95,7 +95,7 @@

          June, 2018

          @@ -104,7 +104,7 @@
          • Test the DSpace 5.8 module upgrades from Atmire (#378)
              -
            • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn't build
            • +
            • There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
          • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
          • @@ -133,7 +133,7 @@ sys 2m7.289s

            May, 2018

            @@ -161,14 +161,14 @@ sys 2m7.289s

            April, 2018

            2018-04-01

              -
            • I tried to test something on DSpace Test but noticed that it's down since god knows when
            • +
            • I tried to test something on DSpace Test but noticed that it’s down since god knows when
            • Catalina logs at least show some memory errors yesterday:
            Read more → @@ -183,7 +183,7 @@ sys 2m7.289s

            March, 2018

            @@ -204,7 +204,7 @@ sys 2m7.289s

            February, 2018

            @@ -212,9 +212,9 @@ sys 2m7.289s

            2018-02-01

            • Peter gave feedback on the dc.rights proof of concept that I had sent him last week
            • -
            • We don't need to distinguish between internal and external works, so that makes it just a simple list
            • +
            • We don’t need to distinguish between internal and external works, so that makes it just a simple list
            • Yesterday I figured out how to monitor DSpace sessions using JMX
            • -
            • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
            • +
            • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
            Read more → @@ -228,7 +228,7 @@ sys 2m7.289s

            January, 2018

            @@ -236,7 +236,7 @@ sys 2m7.289s

            2018-01-02

            • Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
            • -
            • I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
            • +
            • I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
            • The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
            • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
            • And just before that I see this:
            • @@ -244,8 +244,8 @@ sys 2m7.289s
              Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
               
              • Ah hah! So the pool was actually empty!
              • -
              • I need to increase that, let's try to bump it up from 50 to 75
              • -
              • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
              • +
              • I need to increase that, let’s try to bump it up from 50 to 75
              • +
              • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw
              • I notice this error quite a few times in dspace.log:
              2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
              @@ -298,7 +298,7 @@ dspace.log.2017-12-31:53
               dspace.log.2018-01-01:45
               dspace.log.2018-01-02:34
               
                -
              • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains
              • +
              • Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains
              Read more → @@ -312,7 +312,7 @@ dspace.log.2018-01-02:34

              December, 2017

              @@ -336,7 +336,7 @@ dspace.log.2018-01-02:34

              November, 2017

              @@ -369,7 +369,7 @@ COPY 54701

              October, 2017

              @@ -380,7 +380,7 @@ COPY 54701
            http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
             
              -
            • There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
            • +
            • There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
            • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
            Read more → @@ -395,10 +395,10 @@ COPY 54701

            CGIAR Library Migration

            diff --git a/docs/tags/page/4/index.html b/docs/tags/page/4/index.html index 581514b3e..2a6800130 100644 --- a/docs/tags/page/4/index.html +++ b/docs/tags/page/4/index.html @@ -14,7 +14,7 @@ - + @@ -42,7 +42,7 @@ - + @@ -96,7 +96,7 @@

            September, 2017

            @@ -106,7 +106,7 @@

          2017-09-07

            -
          • Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
          • +
          • Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
          Read more → @@ -121,7 +121,7 @@

          August, 2017

          @@ -139,7 +139,7 @@
        • The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
        • Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
        • -
        • It turns out that we're already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
        • +
        • It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
        • Also, the bot has to successfully browse the page first so it can receive the HTTP header…
        • We might actually have to block these requests with HTTP 403 depending on the user agent
        • Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
        • @@ -160,7 +160,7 @@

          July, 2017

          @@ -171,8 +171,8 @@

          2017-07-04

          • Merge changes for WLE Phase II theme rename (#329)
          • -
          • Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace
          • -
          • We can use PostgreSQL's extended output format (-x) plus sed to format the output into quasi XML:
          • +
          • Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
          • +
          • We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML:
          Read more → @@ -187,11 +187,11 @@

          June, 2017

          - 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. + 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg. Read more → @@ -205,11 +205,11 @@

          May, 2017

          - 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. + 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace. Read more → @@ -223,7 +223,7 @@

          April, 2017

      @@ -252,7 +252,7 @@

      March, 2017

      @@ -270,7 +270,7 @@
    • Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
    • Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
    • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
    • -
    • Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
    • +
    • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
    $ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
    @@ -288,7 +288,7 @@
         

    February, 2017

    @@ -307,7 +307,7 @@ dspace=# delete from collection2item where id = 92551 and item_id = 80278; DELETE 1
    • Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
    • -
    • Looks like we'll be using cg.identifier.ccafsprojectpii as the field name
    • +
    • Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name
    Read more → @@ -322,15 +322,15 @@ DELETE 1

    January, 2017

    2017-01-02

    • I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
    • -
    • I tested on DSpace Test as well and it doesn't work there either
    • -
    • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years
    • +
    • I tested on DSpace Test as well and it doesn’t work there either
    • +
    • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
    Read more → @@ -345,7 +345,7 @@ DELETE 1

    December, 2016

    @@ -360,8 +360,8 @@ DELETE 1 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607") 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
      -
    • I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade
    • -
    • I've raised a ticket with Atmire to ask
    • +
    • I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
    • +
    • I’ve raised a ticket with Atmire to ask
    • Another worrying error from dspace.log is:
    Read more → diff --git a/docs/tags/page/5/index.html b/docs/tags/page/5/index.html index b1a9fa521..12cae9196 100644 --- a/docs/tags/page/5/index.html +++ b/docs/tags/page/5/index.html @@ -14,7 +14,7 @@ - + @@ -42,7 +42,7 @@ - + @@ -96,13 +96,13 @@

    November, 2016

    2016-11-01

      -
    • Add dc.type to the output options for Atmire's Listings and Reports module (#286)
    • +
    • Add dc.type to the output options for Atmire’s Listings and Reports module (#286)

    Listings and Reports with output type

    Read more → @@ -118,7 +118,7 @@

    October, 2016

    @@ -131,7 +131,7 @@
  • ORCIDs plus normal authors
  • -
  • I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • +
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • 0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
     
    @@ -148,14 +148,14 @@

    September, 2016

    2016-09-01

    • Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
    • -
    • Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace
    • +
    • Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
    • We had been using DC=ILRI to determine whether a user was ILRI or not
    • It looks like we might be able to use OUs now, instead of DCs:
    @@ -174,7 +174,7 @@

    August, 2016

    @@ -204,7 +204,7 @@ $ git rebase -i dspace-5.5

    July, 2016

    @@ -235,14 +235,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and

    June, 2016

    2016-06-01