Page Not Found
-Page not found. Go back home.
-diff --git a/content/posts/2020-01.md b/content/posts/2020-01.md index a76efdd54..2bfdfd1d1 100644 --- a/content/posts/2020-01.md +++ b/content/posts/2020-01.md @@ -264,4 +264,41 @@ $ ./fix-metadata-values.py -i /tmp/2020-01-22-fix-1113-affiliations.csv -db dspa $ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 ``` +## 2020-01-26 + +- Add "Gender" to controlled vocabulary for CRPs ([#442](https://github.com/ilri/DSpace/pull/442)) +- Deploy the changes on CGSpace and run all updates on the server and reboot it + - I had to restart the `tomcat7` service several times until all Solr statistics cores came up OK +- I spent a few hours writing a script ([create-thumbnails](https://gist.github.com/alanorth/1c7c8b2131a19559e273fbc1e58d6a71)) to compare the default DSpace thumbnails with the improved parameters above and actually when comparing them at size 600px I don't really notice much difference, other than the new ones have slightly crisper text + - So that was a waste of time, though I think our 300px thumbnails are a bit small now + - [Another thread on the ImageMagick forum](https://www.imagemagick.org/discourse-server/viewtopic.php?t=14561) mentions that you need to set the density, then read the image, then set the density again: + +``` +$ convert -density 288 10568-97925.pdf\[0\] -density 72 -filter lagrange -flatten 10568-97925-density.jpg +``` + +- One thing worth mentioning was this syntax for extracting bits from JSON in bash using `jq`: + +``` +$ RESPONSE=$(curl -s 'https://dspacetest.cgiar.org/rest/handle/10568/103447?expand=bitstreams') +$ echo $RESPONSE | jq '.bitstreams[] | select(.bundleName=="ORIGINAL") | .retrieveLink' +"/bitstreams/172559/retrieve" +``` + +## 2020-01-27 + +- Bizu has been having problems when she logs into CGSpace, she can't see the community list on the front page + - This last happened for another user in [2016-11]({{< ref "2016-11.md" >}}), and it was related to the Tomcat `maxHttpHeaderSize` being too small because the user was in too many groups + - I see that it is similar, with this message appearing in the DSpace log just after she logs in: + +``` +2020-01-27 06:02:23,681 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page +org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'read:(g0 OR e610 OR g0 OR g3 OR g5 OR g4102 OR g9 OR g4105 OR g10 OR g4107 OR g4108 OR g13 OR g4109 OR g14 OR g15 OR g16 OR g18 OR g20 OR g23 OR g24 OR g2072 OR g2074 OR g28 OR g2076 OR g29 OR g2078 OR g2080 OR g34 OR g2082 OR g2084 OR g38 OR g2086 OR g2088 OR g43 OR g2093 OR g2095 OR g2097 OR g50 OR g51 OR g2101 OR g2103 OR g62 OR g65 OR g77 OR g78 OR g2127 OR g2142 OR g2151 OR g2152 OR g2153 OR g2154 OR g2156 OR g2165 OR g2171 OR g2174 OR g2175 OR g129 OR g2178 OR g2182 OR g2186 OR g153 OR g155 OR g158 OR g166 OR g167 OR g168 OR g169 OR g2225 OR g179 OR g2227 OR g2229 OR g183 OR g2231 OR g184 OR g2233 OR g186 OR g2235 OR g2237 OR g191 OR g192 OR g193 OR g2242 OR g2244 OR g2246 OR g2250 OR g204 OR g205 OR g207 OR g208 OR g2262 OR g2265 OR g218 OR g2268 OR g222 OR g223 OR g2271 OR g2274 OR g2277 OR g230 OR g231 OR g2280 OR g2283 OR g238 OR g2286 OR g241 OR g2289 OR g244 OR g2292 OR g2295 OR g2298 OR g2301 OR g254 OR g255 OR g2305 OR g2308 OR g262 OR g2311 OR g265 OR g268 OR g269 OR g273 OR g276 OR g277 OR g279 OR g282 OR g292 OR g293 OR g296 OR g297 OR g301 OR g303 OR g305 OR g2353 OR g310 OR g311 OR g313 OR g321 OR g325 OR g328 OR g333 OR g334 OR g342 OR g343 OR g345 OR g348 OR g2409 [...] ': too many boolean clauses +``` + +- Now this appears to be a Solr limit of some kind ("too many boolean clauses") + - I changed the `maxBooleanClauses` for all Solr cores on DSpace Test from 1024 to 2048 and then she was able to see her communities... + - I made a [pull request](https://github.com/ilri/DSpace/pull/443) and merged it to the `5_x-prod` branch and will deploy on CGSpace later tonight + - I am curious if anyone on the dspace-tech mailing list has run into this, so I will try to send a message about this there when I get a chance + diff --git a/docs/2015-11/index.html b/docs/2015-11/index.html index c9cfc9cd9..1aa3d7293 100644 --- a/docs/2015-11/index.html +++ b/docs/2015-11/index.html @@ -31,7 +31,7 @@ Last week I had increased the limit from 30 to 60, which seemed to help, but now $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 78 "/> - + @@ -61,7 +61,7 @@ $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspac - + @@ -109,7 +109,7 @@ $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspac
$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
@@ -145,9 +145,9 @@ location ~ /(themes|static|aspects/ReportingSuite) {
try_files $uri @tomcat;
...
@tomcat
/static
, and the more important point is to handle all the static theme assets, so we can just remove static
from the regex for now (who cares if we can't use nginx to send Etags for OAI CSS!)add_header
in a child block it doesn't inherit the others@tomcat
/static
, and the more important point is to handle all the static theme assets, so we can just remove static
from the regex for now (who cares if we can’t use nginx to send Etags for OAI CSS!)add_header
in a child block it doesn’t inherit the othersinclude extra-security.conf;
to the above location block (but research and test first)location ~ /(themes|aspects/ReportingSuite|aspects/Statistics) {
/about
on CGSpace, as it's blank on my local test server and we might need to add something there/about
on CGSpace, as it’s blank on my local test server and we might need to add something there$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
@@ -173,7 +173,7 @@ datid | datname | pid | usesysid | usename | application_name | client_addr
...
db.maxidle
from unlimited (-1) to something like 8 (Tomcat default) or 10 (Confluence default)db.maxidle
from unlimited (-1) to something like 8 (Tomcat default) or 10 (Confluence default)db.maxidle
from -1 to 10, reduce db.maxconnections
from 90 to 50, and restart postgres and tomcat7try_files
location block as well as the expires block$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
49
$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
39
pgtune
script to tune the postgres settings:# apt-get install pgtune
@@ -155,7 +155,7 @@ shared_buffers = 1920MB
max_connections = 80
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.474
@@ -173,9 +173,9 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle

-- The authorizations for the item are all public READ, and I don't see any errors in dspace.log when browsing that item
-- I filed a ticket on Atmire's issue tracker
-- I also filed a ticket on Atmire's issue tracker for the PostgreSQL stuff
+- The authorizations for the item are all public READ, and I don’t see any errors in dspace.log when browsing that item
+- I filed a ticket on Atmire’s issue tracker
+- I also filed a ticket on Atmire’s issue tracker for the PostgreSQL stuff
2015-12-03
@@ -187,7 +187,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
-Xms3584m -Xmx3584m
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
@@ -210,7 +210,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
postgres@linode01:~$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
28
-- I have reverted all the pgtune tweaks from the other day, as they didn't fix the stability issues, so I'd rather not have them introducing more variables into the equation
+- I have reverted all the pgtune tweaks from the other day, as they didn’t fix the stability issues, so I’d rather not have them introducing more variables into the equation
- The PostgreSQL stats from Munin all point to something database-related with the DSpace 5 upgrade around mid–late November
@@ -219,7 +219,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle

2015-12-07
-- Atmire sent some fixes to DSpace's REST API code that was leaving contexts open (causing the slow performance and database issues)
+- Atmire sent some fixes to DSpace’s REST API code that was leaving contexts open (causing the slow performance and database issues)
- After deploying the fix to CGSpace the REST API is consistently faster:
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
@@ -234,8 +234,8 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
0.497
2015-12-08
-- Switch CGSpace log compression cron jobs from using lzop to xz—the compression isn't as good, but it's much faster and causes less IO/CPU load
-- Since we figured out (and fixed) the cause of the performance issue, I reverted Google Bot's crawl rate to the “Let Google optimize” setting
+- Switch CGSpace log compression cron jobs from using lzop to xz—the compression isn’t as good, but it’s much faster and causes less IO/CPU load
+- Since we figured out (and fixed) the cause of the performance issue, I reverted Google Bot’s crawl rate to the “Let Google optimize” setting
diff --git a/docs/2016-01/index.html b/docs/2016-01/index.html
index e2250c903..7d9a1974c 100644
--- a/docs/2016-01/index.html
+++ b/docs/2016-01/index.html
@@ -25,7 +25,7 @@ Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_
I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
Update GitHub wiki for documentation of maintenance tasks.
"/>
-
+
@@ -55,7 +55,7 @@ Update GitHub wiki for documentation of maintenance tasks.
-
+
@@ -103,7 +103,7 @@ Update GitHub wiki for documentation of maintenance tasks.
January, 2016
@@ -124,19 +124,19 @@ Update GitHub wiki for documentation of maintenance tasks.
2016-01-19
-- Work on tweaks and updates for the social sharing icons on item pages: add Delicious and Mendeley (from Academicons), make links open in new windows, and set the icon color to the theme's primary color (#157)
+- Work on tweaks and updates for the social sharing icons on item pages: add Delicious and Mendeley (from Academicons), make links open in new windows, and set the icon color to the theme’s primary color (#157)
- Tweak date-based facets to show more values in drill-down ranges (#162)
-- Need to remember to clear the Cocoon cache after deployment or else you don't see the new ranges immediately
+- Need to remember to clear the Cocoon cache after deployment or else you don’t see the new ranges immediately
- Set up recipe on IFTTT to tweet new items from the CGSpace Atom feed to my twitter account
-- Altmetrics’ support for Handles is kinda weak, so they can't associate our items with DOIs until they are tweeted or blogged, etc first.
+- Altmetrics’ support for Handles is kinda weak, so they can’t associate our items with DOIs until they are tweeted or blogged, etc first.
2016-01-21
- Still waiting for my IFTTT recipe to fire, two days later
-- It looks like the Atom feed on CGSpace hasn't changed in two days, but there have definitely been new items
+- It looks like the Atom feed on CGSpace hasn’t changed in two days, but there have definitely been new items
- The RSS feed is nearly as old, but has different old items there
- On a hunch I cleared the Cocoon cache and now the feeds are fresh
-- Looks like there is configuration option related to this,
webui.feed.cache.age
, which defaults to 48 hours, though I'm not sure what relation it has to the Cocoon cache
+- Looks like there is configuration option related to this,
webui.feed.cache.age
, which defaults to 48 hours, though I’m not sure what relation it has to the Cocoon cache
- In any case, we should change this cache to be something more like 6 hours, as we publish new items several times per day.
- Work around a CSS issue with long URLs in the item view (#172)
diff --git a/docs/2016-02/index.html b/docs/2016-02/index.html
index 142e6cead..47cdbd618 100644
--- a/docs/2016-02/index.html
+++ b/docs/2016-02/index.html
@@ -35,7 +35,7 @@ I noticed we have a very interesting list of countries on CGSpace:
Not only are there 49,000 countries, we have some blanks (25)…
Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE”
"/>
-
+
@@ -65,7 +65,7 @@ Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE&r
-
+
@@ -113,7 +113,7 @@ Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE&r
February, 2016
@@ -144,7 +144,7 @@ Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE&r
dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
-- It's 25 items so editing in the web UI is annoying, let's try SQL!
+- It’s 25 items so editing in the web UI is annoying, let’s try SQL!
dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
DELETE 25
@@ -157,7 +157,7 @@ DELETE 25
2016-02-07
-- Working on cleaning up Abenet's DAGRIS data with OpenRefine
+- Working on cleaning up Abenet’s DAGRIS data with OpenRefine
- I discovered two really nice functions in OpenRefine:
value.trim()
and value.escape("javascript")
which shows whitespace characters like \r\n
!
- For some reason when you import an Excel file into OpenRefine it exports dates like 1949 to 1949.0 in the CSV
- I re-import the resulting CSV and run a GREL on the date issued column:
value.replace("\.0", "")
@@ -178,7 +178,7 @@ postgres=# \q
$ vacuumdb dspacetest
$ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost
-- After building and running a
fresh_install
I symlinked the webapps into Tomcat's webapps folder:
+- After building and running a
fresh_install
I symlinked the webapps into Tomcat’s webapps folder:
$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig
$ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT
@@ -199,7 +199,7 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
2016-02-08
- Finish cleaning up and importing ~400 DAGRIS items into CGSpace
-- Whip up some quick CSS to make the button in the submission workflow use the XMLUI theme's brand colors (#154)
+- Whip up some quick CSS to make the button in the submission workflow use the XMLUI theme’s brand colors (#154)

@@ -207,7 +207,7 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
- Re-sync DSpace Test with CGSpace
- Help Sisay with OpenRefine
-- Enable HTTPS on DSpace Test using Let's Encrypt:
+- Enable HTTPS on DSpace Test using Let’s Encrypt:
$ cd ~/src/git
$ git clone https://github.com/letsencrypt/letsencrypt
@@ -222,7 +222,7 @@ $ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-becom
I had to export some CIAT items that were being cleaned up on the test server and I noticed their dc.contributor.author
fields have DSpace 5 authority index UUIDs…
To clean those up in OpenRefine I used this GREL expression: value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,"")
Getting more and more hangs on DSpace Test, seemingly random but also during CSV import
-Logs don't always show anything right when it fails, but eventually one of these appears:
+Logs don’t always show anything right when it fails, but eventually one of these appears:
org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
@@ -230,7 +230,7 @@ $ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-becom
Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
-- Right now DSpace Test's Tomcat heap is set to 1536m and we have quite a bit of free RAM:
+- Right now DSpace Test’s Tomcat heap is set to 1536m and we have quite a bit of free RAM:
# free -m
total used free shared buffers cached
@@ -238,7 +238,7 @@ Mem: 3950 3902 48 9 37 1311
-/+ buffers/cache: 2552 1397
Swap: 255 57 198
-- So I'll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)
+- So I’ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)
2016-02-11
@@ -259,16 +259,16 @@ Processing 64195.pdf
> Creating thumbnail for 64195.pdf
2016-02-12
-- Looking at CIAT's records again, there are some problems with a dozen or so files (out of 1200)
+- Looking at CIAT’s records again, there are some problems with a dozen or so files (out of 1200)
- A few items are using the same exact PDF
- A few items are using HTM or DOC files
-- A few items link to PDFs on IFPRI's e-Library or Research Gate
+- A few items link to PDFs on IFPRI’s e-Library or Research Gate
- A few items have no item
-- Also, I'm not sure if we import these items, will be remove the
dc.identifier.url
field from the records?
+- Also, I’m not sure if we import these items, will be remove the
dc.identifier.url
field from the records?
2016-02-12
-- Looking at CIAT's records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I'm not sure if we can use those
+- Looking at CIAT’s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I’m not sure if we can use those
- 265 items have dirty, URL-encoded filenames:
$ ls | grep -c -E "%"
@@ -291,7 +291,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
- This turns the URLs into human-readable versions that we can use as proper filenames
- Run web server and system updates on DSpace Test and reboot
-- To merge
dc.identifier.url
and dc.identifier.url[]
, rename the second column so it doesn't have the brackets, like dc.identifier.url2
+- To merge
dc.identifier.url
and dc.identifier.url[]
, rename the second column so it doesn’t have the brackets, like dc.identifier.url2
- Then you create a facet for blank values on each column, show the rows that have values for one and not the other, then transform each independently to have the contents of the other, with “||” in between
- Work on Python script for parsing and downloading PDF records from
dc.identifier.url
- To get filenames from
dc.identifier.url
, create a new column based on this transform: forEach(value.split('||'), v, v.split('/')[-1]).join('||')
@@ -306,8 +306,8 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
2016-02-20
-- Turns out the “bug” in SAFBuilder isn't a bug, it's a feature that allows you to encode extra information like the destintion bundle in the filename
-- Also, it seems DSpace's SAF import tool doesn't like importing filenames that have accents in them:
+- Turns out the “bug” in SAFBuilder isn’t a bug, it’s a feature that allows you to encode extra information like the destintion bundle in the filename
+- Also, it seems DSpace’s SAF import tool doesn’t like importing filenames that have accents in them:
java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory)
@@ -327,29 +327,29 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
Bitstream: tést señora.pdf
Bitstream: tést señora alimentación.pdf
5_x-prod
branch that is currently based on DSpace 5.15_x-prod
branch that is currently based on DSpace 5.1test.pdf__description:Blah
test.pdf__description:Blah
'
or ,
or =
or [
or ]
or (
or )
or _.pdf
or ._
etcvalue.replace("'",'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_')
index-lucene-update
cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports moduleindex-lucene-update
cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports moduleException in thread "Lucene Merge Thread #19" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
dc.type
indexes (#187)dc.type
(a Dublin Core value) with dc.type.output
(a value we invented) for a few years and it had permeated all aspects of our data, indexes, item displays, etc.outputtype
and output
dc.contributor.corporateauthor
fielddc.language
field isn't really used, but we should delete these valuesdc.language
field isn’t really used, but we should delete these valuesdc.language.iso
has some weird values, like “En” and “English”robots.txt
, and there's a Jira ticket since December, 2015: https://jira.duraspace.org/browse/DS-2962robots.txt
, and there’s a Jira ticket since December, 2015: https://jira.duraspace.org/browse/DS-2962
5_x-prod
branch4.5.0
)Can't find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967)
dc.type.output
dc.type.*
dc.type.*
checker
log has some errors we should pay attention to:Run start time: 03/06/2016 04:00:22
@@ -143,13 +143,13 @@ java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290
******************************************************
tomcat7
Unix user, who seems to have a default limit of 1024 files in its shell/etc/default/tomcat7
)/etc/default/tomcat7
)/etc/security/limits.*
so we can do something for the tomcat7 user there# s3cmd ls s3://cgspace.cgiar.org/log/ > /tmp/s3-logs.txt
# grep checker.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
@@ -184,8 +184,8 @@ UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76
UPDATE 51258
dspacetest=# select text_value, count(*) from metadatavalue where metadata_field_id=217 group by text_value order by count(*) desc;
dc.type
dc.type.output
to dc.type
and then re-index, if it behaves betterinput-forms.xml
I see we have two sets of ILRI subjects, but one has a few extra subjectsVALUE CHAINS
from existing list (#216)dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value='';
DELETE 226
dc.type.output
to dc.type
and re-indexing seems to have fixed the Listings and Reports issue from abovedc.type.*
but the documentation isn't very clear and I couldn't reach Atmire todaydc.type.*
but the documentation isn’t very clear and I couldn’t reach Atmire todaydc.type.output
move on CGSpace anyways, but we should wait as it might affect other external people!catalina.out
:catalina.out
:Apr 18, 2016 7:32:26 PM com.sun.jersey.spi.container.ContainerResponse logException
SEVERE: Mapped exception to response: 500 (Internal Server Error)
@@ -334,14 +334,14 @@ javax.ws.rs.WebApplicationException
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=96;
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=83;
-- They are old ICRAF fields and we haven't used them since 2011 or so
+- They are old ICRAF fields and we haven’t used them since 2011 or so
- Also delete them from the metadata registry
- CGSpace went down again,
dspace.log
had this:
2016-04-19 15:02:17,025 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
-- I restarted Tomcat and PostgreSQL and now it's back up
+- I restarted Tomcat and PostgreSQL and now it’s back up
- I bet this is the same crash as yesterday, but I only saw the errors in
catalina.out
- Looks to be related to this, from
dspace.log
:
@@ -383,24 +383,24 @@ UPDATE 46075
$ grep -c "Aborting context in finally statement" dspace.log.2016-04-20
21252
-- I found a recent discussion on the DSpace mailing list and I've asked for advice there
-- Looks like this issue was noted and fixed in DSpace 5.5 (we're on 5.1): https://jira.duraspace.org/browse/DS-2936
-- I've sent a message to Atmire asking about compatibility with DSpace 5.5
+- I found a recent discussion on the DSpace mailing list and I’ve asked for advice there
+- Looks like this issue was noted and fixed in DSpace 5.5 (we’re on 5.1): https://jira.duraspace.org/browse/DS-2936
+- I’ve sent a message to Atmire asking about compatibility with DSpace 5.5
2016-04-21
- Fix a bunch of metadata consistency issues with IITA Journal Articles (Peer review, Formally published, messed up DOIs, etc)
-- Atmire responded with DSpace 5.5 compatible versions for their modules, so I'll start testing those in a few weeks
+- Atmire responded with DSpace 5.5 compatible versions for their modules, so I’ll start testing those in a few weeks
2016-04-22
-- Import 95 records into CTA's Agrodok collection
+- Import 95 records into CTA’s Agrodok collection
2016-04-26
- Test embargo during item upload
- Seems to be working but the help text is misleading as to the date format
-- It turns out the
robots.txt
issue we thought we solved last month isn't solved because you can't use wildcards in URL patterns: https://jira.duraspace.org/browse/DS-2962
+- It turns out the
robots.txt
issue we thought we solved last month isn’t solved because you can’t use wildcards in URL patterns: https://jira.duraspace.org/browse/DS-2962
- Write some nginx rules to add
X-Robots-Tag
HTTP headers to the dynamic requests from robots.txt
instead
- A few URLs to test with:
@@ -449,17 +449,17 @@ dspace.log.2016-04-27:7271
- Add Spanish XMLUI strings so those users see “CGSpace” instead of “DSpace” in the user interface (#222)
- Submit patch to upstream DSpace for the misleading help text in the embargo step of the item submission: https://jira.duraspace.org/browse/DS-3172
- Update infrastructure playbooks for nginx 1.10.x (stable) release: https://github.com/ilri/rmg-ansible-public/issues/32
-- Currently running on DSpace Test, we'll give it a few days before we adjust CGSpace
-- CGSpace down, restarted tomcat and it's back up
+- Currently running on DSpace Test, we’ll give it a few days before we adjust CGSpace
+- CGSpace down, restarted tomcat and it’s back up
2016-04-28
-- Problems with stability again. I've blocked access to
/rest
for now to see if the number of errors in the log files drop
+- Problems with stability again. I’ve blocked access to
/rest
for now to see if the number of errors in the log files drop
- Later we could maybe start logging access to
/rest
and perhaps whitelist some IPs…
2016-04-30
-- Logs for today and yesterday have zero references to this REST error, so I'm going to open back up the REST API but log all requests
+- Logs for today and yesterday have zero references to this REST error, so I’m going to open back up the REST API but log all requests
location /rest {
access_log /var/log/nginx/rest.log;
diff --git a/docs/2016-05/index.html b/docs/2016-05/index.html
index ff46e1cda..605242bcd 100644
--- a/docs/2016-05/index.html
+++ b/docs/2016-05/index.html
@@ -31,7 +31,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
# awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l
3168
"/>
-
+
@@ -61,7 +61,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
-
+
@@ -109,7 +109,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
May, 2016
@@ -127,8 +127,8 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
GET /rest/handle/10568/NaN?expand=parentCommunityList,metadata HTTP/1.1
-- For now I'll block just the Ethiopian IP
-- The owner of that application has said that the
NaN
(not a number) is an error in his code and he'll fix it
+- For now I’ll block just the Ethiopian IP
+- The owner of that application has said that the
NaN
(not a number) is an error in his code and he’ll fix it
2016-05-03
@@ -141,7 +141,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
- DSpace Test is down,
catalina.out
has lots of messages about heap space from some time yesterday (!)
- It looks like Sisay was doing some batch imports
- Hmm, also disk space is full
-- I decided to blow away the solr indexes, since they are 50GB and we don't really need all the Atmire stuff there right now
+- I decided to blow away the solr indexes, since they are 50GB and we don’t really need all the Atmire stuff there right now
- I will re-generate the Discovery indexes after re-deploying
- Testing
renew-letsencrypt.sh
script for nginx
@@ -180,7 +180,7 @@ fi
Not sure what dcterms
is…
Looks like these were added in DSpace 4 to allow for future work to make DSpace more flexible
-CGSpace's dc
registry has 96 items, and the default DSpace one has 73.
+CGSpace’s dc
registry has 96 items, and the default DSpace one has 73.
2016-05-11
@@ -201,7 +201,7 @@ fi
Start a test rebase of the 5_x-prod
branch on top of the dspace-5.5
tag
-
-
There were a handful of conflicts that I didn't understand
+There were a handful of conflicts that I didn’t understand
-
After completing the rebase I tried to build with the module versions Atmire had indicated as being 5.5 ready but I got this error:
@@ -209,10 +209,10 @@ fi
[ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:5.5: Could not find artifact com.atmire:atmire-metadata-quality-api:jar:5.5-2.10.1-0 in sonatype-releases (https://oss.sonatype.org/content/repositories/releases/) -> [Help 1]
-- I've sent them a question about it
+- I’ve sent them a question about it
- A user mentioned having problems with uploading a 33 MB PDF
- I told her I would increase the limit temporarily tomorrow morning
-- Turns out she was able to decrease the size of the PDF so we didn't have to do anything
+- Turns out she was able to decrease the size of the PDF so we didn’t have to do anything
2016-05-12
@@ -227,7 +227,7 @@ fi
- Our
dc.place
and dc.srplace.subregion
could both map to cg.coverage.admin-unit
?
- Should we use
dc.contributor.crp
or cg.contributor.crp
for the CRP (ours is dc.crsubject.crpsubject
)?
-- Our
dc.contributor.affiliation
and dc.contributor.corporate
could both map to dc.contributor
and possibly dc.contributor.center
depending on if it's a CG center or not
+- Our
dc.contributor.affiliation
and dc.contributor.corporate
could both map to dc.contributor
and possibly dc.contributor.center
depending on if it’s a CG center or not
dc.title.jtitle
could either map to dc.publisher
or dc.source
depending on how you read things
@@ -243,8 +243,8 @@ fi
- dc.place → cg.place
-dc.place
is our own field, so it's easy to move
-I've removed dc.title.jtitle
from the list for now because there's no use moving it out of DC until we know where it will go (see discussion yesterday)
+dc.place
is our own field, so it’s easy to move
+I’ve removed dc.title.jtitle
from the list for now because there’s no use moving it out of DC until we know where it will go (see discussion yesterday)
2016-05-18
@@ -269,8 +269,8 @@ fi
- We should move PN*, SG*, CBA, IA, and PHASE* values to
cg.identifier.cpwfproject
- The rest, like BMGF and USAID etc, might have to go to either
dc.description.sponsorship
or cg.identifier.fund
(not sure yet)
-- There are also some mistakes in CPWF's things, like “PN 47”
-- This ought to catch all the CPWF values (there don't appear to be and SG* values):
+- There are also some mistakes in CPWF’s things, like “PN 47”
+- This ought to catch all the CPWF values (there don’t appear to be and SG* values):
@@ -282,7 +282,7 @@ fi
value + "__bundle:THUMBNAIL"
-- Also, I fixed some weird characters using OpenRefine's transform with the following GREL:
+- Also, I fixed some weird characters using OpenRefine’s transform with the following GREL:
value.replace(/\u0081/,'')
@@ -295,7 +295,7 @@ fi
- Try to import the CCAFS Images and Videos to CGSpace but had some issues with LibreOffice and OpenRefine
- LibreOffice excludes empty cells when it exports and all the fields shift over to the left and cause URLs to go to Subjects, etc.
-- Google Docs does this better, but somehow reorders the rows and when I paste the thumbnail/filename row in they don't match!
+- Google Docs does this better, but somehow reorders the rows and when I paste the thumbnail/filename row in they don’t match!
- I will have to try later
2016-05-30
@@ -310,8 +310,8 @@ $ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~
$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &> /tmp/ccafs-images-may30.log
- But now we have double authors for “CGIAR Research Program on Climate Change, Agriculture and Food Security” in the authority
-- I'm trying to do a Discovery index before messing with the authority index
-- Looks like we are missing the
index-authority
cron job, so who knows what's up with our authority index
+- I’m trying to do a Discovery index before messing with the authority index
+- Looks like we are missing the
index-authority
cron job, so who knows what’s up with our authority index
- Run system updates on DSpace Test, re-deploy code, and reboot the server
- Clean up and import ~200 CTA records to CGSpace via CSV like:
@@ -324,7 +324,7 @@ $ /home/cgspace.cgiar.org/bin/dspace metadata-import -e aorth@mjanja.ch -f ~/CTA
index-authority
script ran over night and was finished in the morning$ time /home/cgspace.cgiar.org/bin/dspace index-authority
diff --git a/docs/2016-06/index.html b/docs/2016-06/index.html
index 25a311554..14784096c 100644
--- a/docs/2016-06/index.html
+++ b/docs/2016-06/index.html
@@ -9,7 +9,7 @@
-
+
@@ -61,7 +61,7 @@ Working on second phase of metadata migration, looks like this will work for mov
-
+
@@ -109,14 +109,14 @@ Working on second phase of metadata migration, looks like this will work for mov
June, 2016
2016-06-01
- Experimenting with IFPRI OAI (we want to harvest their publications)
-- After reading the ContentDM documentation I found IFPRI's OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
+- After reading the ContentDM documentation I found IFPRI’s OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
- After reading the OAI documentation and testing with an OAI validator I found out how to get their publications
- This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&from=2016-01-01&set=p15738coll2&metadataPrefix=oai_dc
- You can see the others by using the OAI
ListSets
verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
@@ -132,12 +132,12 @@ UPDATE 14
2016-06-02
- Testing the configuration and theme changes for the upcoming metadata migration and I found some issues with
cg.coverage.admin-unit
-- Seems that the Browse configuration in
dspace.cfg
can't handle the ‘-’ in the field name:
+- Seems that the Browse configuration in
dspace.cfg
can’t handle the ‘-’ in the field name:
webui.browse.index.12 = subregion:metadata:cg.coverage.admin-unit:text
- But actually, I think since DSpace 4 or 5 (we are 5.1) the Browse indexes come from Discovery (defined in discovery.xml) so this is really just a parsing error
-- I've sent a message to the DSpace mailing list to ask about the Browse index definition
+- I’ve sent a message to the DSpace mailing list to ask about the Browse index definition
- A user was having problems with submission and from the stacktrace it looks like a Sherpa/Romeo issue
- I found a thread on the mailing list talking about it and there is bug report and a patch: https://jira.duraspace.org/browse/DS-2740
- The patch applies successfully on DSpace 5.1 so I will try it later
@@ -196,7 +196,7 @@ UPDATE 960
webui.browse.index.2 = author:metadataAuthority:dc.contributor.author:authority
-- That would only be for the “Browse by” function… so we'll have to see what effect that has later
+- That would only be for the “Browse by” function… so we’ll have to see what effect that has later
2016-06-04
@@ -225,10 +225,10 @@ UPDATE 960
-Discuss pulling data from IFPRI's ContentDM with Ryan Miller
+Discuss pulling data from IFPRI’s ContentDM with Ryan Miller
-Looks like OAI is kinda obtuse for this, and if we use ContentDM's API we'll be able to access their internal field names (rather than trying to figure out how they stuffed them into various, repeated Dublin Core fields)
+Looks like OAI is kinda obtuse for this, and if we use ContentDM’s API we’ll be able to access their internal field names (rather than trying to figure out how they stuffed them into various, repeated Dublin Core fields)
2016-06-08
@@ -241,13 +241,13 @@ UPDATE 960
atmire.orcid.id
to see if we can change itdspace/config/about.xml
, so now we can update the textdspace/config/about.xml
, so now we can update the textclosed="true"
attribute of controlled vocabularies not working: https://jira.duraspace.org/browse/DS-3238atmire.orcid.id
field doesn't exist in the schema, as it actually comes from the authority cache during XMLUI run timeatmire.orcid.id
field doesn’t exist in the schema, as it actually comes from the authority cache during XMLUI run time# /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook "/usr/bin/service nginx stop" --post-hook "/usr/bin/service nginx start"
dc.contributor.corporate
with 13 deletions and 121 replacementsdspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=126 group by text_value order by count desc) to /tmp/contributors-june28.csv with csv;
diff --git a/docs/2016-07/index.html b/docs/2016-07/index.html
index bf4dc8a28..68282e4fe 100644
--- a/docs/2016-07/index.html
+++ b/docs/2016-07/index.html
@@ -41,7 +41,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
In this case the select query was showing 95 results before the update
"/>
-
+
@@ -71,7 +71,7 @@ In this case the select query was showing 95 results before the update
-
+
@@ -119,7 +119,7 @@ In this case the select query was showing 95 results before the update
July, 2016
@@ -143,7 +143,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
2016-07-04
-- Seems the database's author authority values mean nothing without the
authority
Solr core from the host where they were created!
+- Seems the database’s author authority values mean nothing without the
authority
Solr core from the host where they were created!
2016-07-05
@@ -212,7 +212,7 @@ $ ./delete-metadata-values.py -f dc.contributor.author -i /tmp/Authors-Delete-UT
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
...
# awk '{print $1}' /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
710 66.249.78.38
@@ -244,7 +244,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
index.authority.ignore-prefered.dc.contributor.author=true
index.authority.ignore-variants.dc.contributor.author=false
-- After reindexing I don't see any change in Discovery's display of authors, and still have entries like:
+- After reindexing I don’t see any change in Discovery’s display of authors, and still have entries like:
Grace, D. (464)
Grace, D. (62)
@@ -282,7 +282,7 @@ index.authority.ignore-variants=true

-- The DSpace source code mentions the configuration key
discovery.index.authority.ignore-prefered.*
(with prefix of discovery, despite the docs saying otherwise), so I'm trying the following on DSpace Test:
+- The DSpace source code mentions the configuration key
discovery.index.authority.ignore-prefered.*
(with prefix of discovery, despite the docs saying otherwise), so I’m trying the following on DSpace Test:
discovery.index.authority.ignore-prefered.dc.contributor.author=true
discovery.index.authority.ignore-variants=true
@@ -291,7 +291,7 @@ discovery.index.authority.ignore-variants=true
Deploy species, breed, and identifier changes to CGSpace, as well as About page
Run Linode RAM upgrade (8→12GB)
Re-sync DSpace Test with CGSpace
-I noticed that our backup scripts don't send Solr cores to S3 so I amended the script
+I noticed that our backup scripts don’t send Solr cores to S3 so I amended the script
2016-07-31
diff --git a/docs/2016-08/index.html b/docs/2016-08/index.html
index 1affdfaf9..d1b7d029e 100644
--- a/docs/2016-08/index.html
+++ b/docs/2016-08/index.html
@@ -39,7 +39,7 @@ $ git checkout -b 55new 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.5
"/>
-
+
@@ -69,7 +69,7 @@ $ git rebase -i dspace-5.5
-
+
@@ -117,7 +117,7 @@ $ git rebase -i dspace-5.5
August, 2016
@@ -134,13 +134,13 @@ $ git rebase -i dspace-5.5
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.5
-- Lots of conflicts that don't make sense (ie, shouldn't conflict!)
+- Lots of conflicts that don’t make sense (ie, shouldn’t conflict!)
- This file in particular conflicts almost 10 times:
dspace/modules/xmlui-mirage2/src/main/webapp/themes/CGIAR/styles/_style.scss
- Checking out a clean branch at 5.5 and cherry-picking our commits works where that file would normally have a conflict
- Seems to be related to merge commits
-git rebase --preserve-merges
doesn't seem to help
+git rebase --preserve-merges
doesn’t seem to help
- Eventually I just turned on git rerere and solved the conflicts and completed the 403 commit rebase
-- The 5.5 code now builds but doesn't run (white page in Tomcat)
+- The 5.5 code now builds but doesn’t run (white page in Tomcat)
2016-08-02
@@ -173,7 +173,7 @@ $ git rebase -i dspace-5.5
- Still troubleshooting Atmire modules on DSpace 5.5
- Vanilla DSpace 5.5 works on Tomcat 7…
- Ooh, and vanilla DSpace 5.5 works on Tomcat 8 with Java 8!
-- Some notes about setting up Tomcat 8, since it's new on this machine…
+- Some notes about setting up Tomcat 8, since it’s new on this machine…
- Install latest Oracle Java 8 JDK
- Create
setenv.sh
in Tomcat 8 libexec/bin
directory:
@@ -193,17 +193,17 @@ $ ln -sv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/res
$ ln -sv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/solr
2016-08-09
-- More tests of Atmire's 5.5 modules on a clean, working instance of
5_x-prod
+- More tests of Atmire’s 5.5 modules on a clean, working instance of
5_x-prod
- Still fails, though perhaps differently than before (Flyway): https://gist.github.com/alanorth/5d49c45a16efd7c6bc1e6642e66118b2
- More work on Tomcat 8 and Java 8 stuff for Ansible playbooks
2016-08-10
-- Turns out DSpace 5.x isn't ready for Tomcat 8: https://jira.duraspace.org/browse/DS-3092
-- So we'll need to use Tomcat 7 + Java 8 on Ubuntu 16.04
+- Turns out DSpace 5.x isn’t ready for Tomcat 8: https://jira.duraspace.org/browse/DS-3092
+- So we’ll need to use Tomcat 7 + Java 8 on Ubuntu 16.04
- More work on the Ansible stuff for this, allowing Tomcat 7 to use Java 8
- Merge pull request for fixing the type Discovery index to use
dc.type
(#262)
-- Merge pull request for removing “Bitstream” text from item display, as it confuses users and isn't necessary (#263)
+- Merge pull request for removing “Bitstream” text from item display, as it confuses users and isn’t necessary (#263)
2016-08-11
@@ -223,13 +223,13 @@ $ ln -sv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/sol
- Troubleshoot Paramiko connection issues with Ansible on ILRI servers: #37
- Turns out we need to add some MACs to our
sshd_config
: hmac-sha2-512,hmac-sha2-256
-- Update DSpace Test's Java to version 8 to start testing this configuration (seeing as Solr recommends it)
+- Update DSpace Test’s Java to version 8 to start testing this configuration (seeing as Solr recommends it)
2016-08-17
-- More work on Let's Encrypt stuff for Ansible roles
+- More work on Let’s Encrypt stuff for Ansible roles
- Yesterday Atmire responded about DSpace 5.5 issues and asked me to try the
dspace database repair
command to fix Flyway issues
-- The
dspace database
command doesn't even run: https://gist.github.com/alanorth/c43c8d89e8df346d32c0ee938be90cd5
+- The
dspace database
command doesn’t even run: https://gist.github.com/alanorth/c43c8d89e8df346d32c0ee938be90cd5
- Oops, it looks like the missing classes causing
dspace database
to fail were coming from the old ~/dspace/config/spring
folder
- After removing the spring folder and running ant install again,
dspace database
works
- I see there are missing and pending Flyway migrations, but running
dspace database repair
and dspace database migrate
does nothing: https://gist.github.com/alanorth/41ed5abf2ff32d8ac9eedd1c3d015d70
@@ -247,7 +247,7 @@ $ ln -sv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/sol
- A few days ago someone on the DSpace mailing list suggested I try
dspace dsrun org.dspace.authority.UpdateAuthorities
to update preferred author names from ORCID
- If you set
auto-update-items=true
in dspace/config/modules/solrauthority.cfg
it is supposed to update records it finds automatically
-- I updated my name format on ORCID and I've been running that script a few times per day since then but nothing has changed
+- I updated my name format on ORCID and I’ve been running that script a few times per day since then but nothing has changed
- Still troubleshooting Atmire modules on DSpace 5.5
- I sent them some new verbose logs: https://gist.github.com/alanorth/700748995649688148ceba89d760253e
@@ -285,7 +285,7 @@ Database Driver: PostgreSQL Native Driver version PostgreSQL 9.1 JDBC4 (build 90
| 5.1.2015.12.03 | Atmire MQM migration | 2016-03-21 17:10:42 | Success |
+----------------+----------------------------+---------------------+---------+
org.apache.avalon.framework.configuration.ConfigurationException: Type 'ThemeResourceReader' does not exist for 'map:read' at jndi:/localhost/themes/0_CGIAR/sitemap.xmap:136:77
context:/jndi:/localhost/themes/0_CGIAR/sitemap.xmap - 136:77
sitemap.xmap
, as well as in each of our XMLUI themessitemap.xmap
, as well as in each of our XMLUI themesThemeResourceReader
changes:
dspace-xmlui/src/main/webapp/sitemap.xmap
dspace-xmlui-mirage2/src/main/webapp/sitemap.xmap
5.5-4.1.1-0
)5.5-4.1.1-0
)dspacetest=# select distinct text_value from metadatavalue where metadata_field_id=55 and text_value !~ '.*(\.pdf|\.png|\.PDF|\.Pdf|\.JPEG|\.jpg|\.JPG|\.jpeg|\.xls|\.rtf|\.docx?|\.potx|\.dotx|\.eqa|\.tiff|\.mp4|\.mp3|\.gif|\.zip|\.txt|\.pptx|\.indd|\.PNG|\.bmp|\.exe|org\.dspace\.app\.mediafilter).*';
dspace/config/spring/api/atmire-cua.xml
but it doesn't help:dspace/config/spring/api/atmire-cua.xml
but it doesn’t help:...
Error creating bean with name 'MetadataStorageInfoService'
diff --git a/docs/2016-09/index.html b/docs/2016-09/index.html
index cbdba2e48..d40a76115 100644
--- a/docs/2016-09/index.html
+++ b/docs/2016-09/index.html
@@ -9,7 +9,7 @@
-
+
@@ -61,7 +61,7 @@ $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=or
-
+
@@ -109,14 +109,14 @@ $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=or
September, 2016
2016-09-01
- Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
-- Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace
+- Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
- We had been using
DC=ILRI
to determine whether a user was ILRI or not
- It looks like we might be able to use OUs now, instead of DCs:
@@ -242,7 +242,7 @@ TLSv1/EDH-RSA-DES-CBC3-SHA
See: http://www.fileformat.info/info/unicode/char/e1/index.htm
See: http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&s=&uv=0
If I unzip the original zip from CIAT on Windows, re-zip it with 7zip on Windows, and then unzip it on Linux directly, the file names seem to be proper UTF-8
-We should definitely clean filenames so they don't use characters that are tricky to process in CSV and shell scripts, like: ,
, '
, and "
+We should definitely clean filenames so they don’t use characters that are tricky to process in CSV and shell scripts, like: ,
, '
, and "
value.replace("'","").replace(",","").replace('"','')
@@ -254,7 +254,7 @@ TLSv1/EDH-RSA-DES-CBC3-SHA
- The CSV file was giving file names in UTF-8, and unzipping the zip on Mac OS X and transferring it was converting the file names to Unicode equivalence like I saw above
-Import CIAT Gender Network records to CGSpace, first creating the SAF bundles as my user, then importing as the tomcat7
user, and deleting the bundle, for each collection's items:
+Import CIAT Gender Network records to CGSpace, first creating the SAF bundles as my user, then importing as the tomcat7
user, and deleting the bundle, for each collection’s items:
$ ./safbuilder.sh -c /home/aorth/ciat-gender-2016-09-06/66601.csv
$ JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/66601 -s /home/aorth/ciat-gender-2016-09-06/SimpleArchiveFormat -m 66601.map
@@ -263,7 +263,7 @@ $ rm -rf ~/ciat-gender-2016-09-06/SimpleArchiveFormat/
- Erase and rebuild DSpace Test based on latest Ubuntu 16.04, PostgreSQL 9.5, and Java 8 stuff
- Reading about PostgreSQL maintenance and it seems manual vacuuming is only for certain workloads, such as heavy update/write loads
-- I suggest we disable our nightly manual vacuum task, as we're a mostly read workload, and I'd rather stick as close to the documentation as possible since we haven't done any testing/observation of PostgreSQL
+- I suggest we disable our nightly manual vacuum task, as we’re a mostly read workload, and I’d rather stick as close to the documentation as possible since we haven’t done any testing/observation of PostgreSQL
- See: https://www.postgresql.org/docs/9.3/static/routine-vacuuming.html
- CGSpace went down and the error seems to be the same as always (lately):
@@ -295,7 +295,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
Exception in thread "http-bio-127.0.0.1-8081-exec-25" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding.decode(StringCoding.java:215)
-- We haven't seen that in quite a while…
+- We haven’t seen that in quite a while…
- Indeed, in a month of logs it only occurs 15 times:
# grep -rsI "OutOfMemoryError" /var/log/tomcat7/catalina.* | wc -l
@@ -397,17 +397,17 @@ java.util.Map does not have a no-arg default constructor.
JAVA_OPTS="-Djava.awt.headless=true -Xms3584m -Xmx3584m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -XX:-UseGCOverheadLimit -XX:MaxGCPauseMillis=250 -XX:GCTimeRatio=9 -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts"
-- So I'm going to bump the heap +512m and remove all the other experimental shit (and update ansible!)
+- So I’m going to bump the heap +512m and remove all the other experimental shit (and update ansible!)
- Increased JVM heap to 4096m on CGSpace (linode01)
2016-09-15
-- Looking at Google Webmaster Tools again, it seems the work I did on URL query parameters and blocking via the
X-Robots-Tag
HTTP header in March, 2016 seem to have had a positive effect on Google's index for CGSpace
+- Looking at Google Webmaster Tools again, it seems the work I did on URL query parameters and blocking via the
X-Robots-Tag
HTTP header in March, 2016 seem to have had a positive effect on Google’s index for CGSpace

2016-09-16
-- CGSpace crashed again, and there are TONS of heap space errors but the datestamps aren't on those lines so I'm not sure if they were yesterday:
+- CGSpace crashed again, and there are TONS of heap space errors but the datestamps aren’t on those lines so I’m not sure if they were yesterday:
dn:CN=Orentlicher\, Natalie (CIAT),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
Thu Sep 15 18:45:25 UTC 2016 | Query:id: 55785 AND type:2
@@ -434,12 +434,12 @@ Exception in thread "Thread-54216" org.apache.solr.client.solrj.impl.H
at com.atmire.statistics.SolrLogThread.run(SourceFile:25)
- I bumped the heap space from 4096m to 5120m to see if this is really about heap speace or not.
-- Looking into some of these errors that I've seen this week but haven't noticed before:
+- Looking into some of these errors that I’ve seen this week but haven’t noticed before:
# zcat -f -- /var/log/tomcat7/catalina.* | grep -c 'Failed to generate the schema for the JAX-B elements'
113
-- I've sent a message to Atmire about the Solr error to see if it's related to their batch update module
+- I’ve sent a message to Atmire about the Solr error to see if it’s related to their batch update module
2016-09-19
@@ -474,7 +474,7 @@ $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2

-- Found a way to improve the configuration of Atmire's Content and Usage Analysis (CUA) module for date fields
+- Found a way to improve the configuration of Atmire’s Content and Usage Analysis (CUA) module for date fields
-content.analysis.dataset.option.8=metadata:dateAccessioned:discovery
+content.analysis.dataset.option.8=metadata:dc.date.accessioned:date(month)
@@ -500,8 +500,8 @@ $ ./delete-metadata-values.py -i sponsors-delete-8.csv -f dc.description.sponsor
Merge accession date improvements for CUA module (#275)
Merge addition of accession date to Discovery search filters (#276)
Merge updates to sponsorship controlled vocabulary (#277)
-I've been trying to add a search filter for dc.description
so the IITA people can search for some tags they use there, but for some reason the filter never shows up in Atmire's CUA
-Not sure if it's something like we already have too many filters there (30), or the filter name is reserved, etc…
+I’ve been trying to add a search filter for dc.description
so the IITA people can search for some tags they use there, but for some reason the filter never shows up in Atmire’s CUA
+Not sure if it’s something like we already have too many filters there (30), or the filter name is reserved, etc…
Generate a list of ILRI subjects for Peter and Abenet to look through/fix:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=203 group by text_value order by count desc) to /tmp/ilrisubjects.csv with csv;
@@ -509,7 +509,7 @@ $ ./delete-metadata-values.py -i sponsors-delete-8.csv -f dc.description.sponsor
Regenerate Discovery indexes a few times after playing with discovery.xml
index definitions (syntax, parameters, etc).
Merge changes to boolean logic in Solr search (#274)
Run all sponsorship and affiliation fixes on CGSpace, deploy latest 5_x-prod
branch, and re-index Discovery on CGSpace
-Tested OCSP stapling on DSpace Test's nginx and it works:
+Tested OCSP stapling on DSpace Test’s nginx and it works:
$ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
...
@@ -519,7 +519,7 @@ OCSP Response Data:
...
Cert Status: good
-- I've been monitoring this for almost two years in this GitHub issue: https://github.com/ilri/DSpace/issues/38
+- I’ve been monitoring this for almost two years in this GitHub issue: https://github.com/ilri/DSpace/issues/38
2016-09-27
@@ -552,10 +552,10 @@ UPDATE 101
- Make a placeholder pull request for
discovery.xml
changes (#278), as I still need to test their effect on Atmire content analysis module
- Make a placeholder pull request for Font Awesome changes (#279), which replaces the GitHub image in the footer with an icon, and add style for RSS and @ icons that I will start replacing in community/collection HTML intros
- Had some issues with local test server after messing with Solr too much, had to blow everything away and re-install from CGSpace
-- Going to try to update Sonja Vermeulen's authority to 2b4166b7-6e4d-4f66-9d8b-ddfbec9a6ae0, as that seems to be one of her authorities that has an ORCID
+- Going to try to update Sonja Vermeulen’s authority to 2b4166b7-6e4d-4f66-9d8b-ddfbec9a6ae0, as that seems to be one of her authorities that has an ORCID
- Merge Font Awesome changes (#279)
-- Minor fix to a string in Atmire's CUA module (#280)
-- This seems to be what I'll need to do for Sonja Vermeulen (but with
2b4166b7-6e4d-4f66-9d8b-ddfbec9a6ae0
instead on the live site):
+- Minor fix to a string in Atmire’s CUA module (#280)
+- This seems to be what I’ll need to do for Sonja Vermeulen (but with
2b4166b7-6e4d-4f66-9d8b-ddfbec9a6ae0
instead on the live site):
dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen SJ%';
@@ -576,8 +576,8 @@ $ ./delete-metadata-values.py -i ilrisubjects-delete-13.csv -f cg.subject.ilri -
dspacetest=# select distinct text_value from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/5472', '10568/5473')));
2016-09-30
-- Deny access to REST API's
find-by-metadata-field
endpoint to protect against an upstream security issue (DS-3250)
-- There is a patch but it is only for 5.5 and doesn't apply cleanly to 5.1
+- Deny access to REST API’s
find-by-metadata-field
endpoint to protect against an upstream security issue (DS-3250)
+- There is a patch but it is only for 5.5 and doesn’t apply cleanly to 5.1
diff --git a/docs/2016-10/index.html b/docs/2016-10/index.html
index abdb4a682..af5da0038 100644
--- a/docs/2016-10/index.html
+++ b/docs/2016-10/index.html
@@ -15,7 +15,7 @@ ORCIDs only
ORCIDs plus normal authors
-I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
+I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
" />
@@ -35,11 +35,11 @@ ORCIDs only
ORCIDs plus normal authors
-I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
+I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
"/>
-
+
@@ -69,7 +69,7 @@ I exported a random item's metadata as CSV, deleted all columns except id an
-
+
@@ -117,7 +117,7 @@ I exported a random item's metadata as CSV, deleted all columns except id an
October, 2016
@@ -130,17 +130,17 @@ I exported a random item's metadata as CSV, deleted all columns except id an
ORCIDs plus normal authors
-I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author
with the following random ORCIDs from the ORCID registry:
+I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author
with the following random ORCIDs from the ORCID registry:
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
-- Hmm, with the
dc.contributor.author
column removed, DSpace doesn't detect any changes
+- Hmm, with the
dc.contributor.author
column removed, DSpace doesn’t detect any changes
- With a blank
dc.contributor.author
column, DSpace wants to remove all non-ORCID authors and add the new ORCID authors
-- I added the disclaimer text to the About page, then added a footer link to the disclaimer's ID, but there is a Bootstrap issue that causes the page content to disappear when using in-page anchors: https://github.com/twbs/bootstrap/issues/1768
+- I added the disclaimer text to the About page, then added a footer link to the disclaimer’s ID, but there is a Bootstrap issue that causes the page content to disappear when using in-page anchors: https://github.com/twbs/bootstrap/issues/1768

-- Looks like we'll just have to add the text to the About page (without a link) or add a separate page
+- Looks like we’ll just have to add the text to the About page (without a link) or add a separate page
2016-10-04
@@ -165,12 +165,12 @@ $ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author -
2016-10-05
- Work on more infrastructure cleanups for Ansible DSpace role
-- Clean up Let's Encrypt plumbing and submit pull request for rmg-ansible-public (#60)
+- Clean up Let’s Encrypt plumbing and submit pull request for rmg-ansible-public (#60)
2016-10-06
- Nice! DSpace Test (linode02) is now having
java.lang.OutOfMemoryError: Java heap space
errors…
-- Heap space is 2048m, and we have 5GB of RAM being used for OS cache (Solr!) so let's just bump the memory to 3072m
+- Heap space is 2048m, and we have 5GB of RAM being used for OS cache (Solr!) so let’s just bump the memory to 3072m
- Magdalena from CCAFS asked why the colors in the thumbnails for these two items look different, even though they are the same in the PDF itself

@@ -192,12 +192,12 @@ DELETE 11
root@linode01:~# ls -lh /var/log/tomcat7/localhost_access_log.2015* | wc -l
47
-- Delete 2GB
cron-filter-media.log
file, as it is just a log from a cron job and it doesn't get rotated like normal log files (almost a year now maybe)
+- Delete 2GB
cron-filter-media.log
file, as it is just a log from a cron job and it doesn’t get rotated like normal log files (almost a year now maybe)
2016-10-14
- Run all system updates on DSpace Test and reboot server
-- Looking into some issues with Discovery filters in Atmire's content and usage analysis module after adjusting the filter class
+- Looking into some issues with Discovery filters in Atmire’s content and usage analysis module after adjusting the filter class
- Looks like changing the filters from
configuration.DiscoverySearchFilterFacet
to configuration.DiscoverySearchFilter
breaks them in Atmire CUA module
2016-10-17
@@ -216,7 +216,7 @@ DELETE 11
$ git rebase -i dspace-5.5
- Have to fix about ten merge conflicts, mostly in the SCSS for the CGIAR theme
-- Skip 1e34751b8cf17021f45d4cf2b9a5800c93fb4cb2 in lieu of upstream's 55e623d1c2b8b7b1fa45db6728e172e06bfa8598 (fixes X-Forwarded-For header) because I had made the same fix myself and it's better to use the upstream one
+- Skip 1e34751b8cf17021f45d4cf2b9a5800c93fb4cb2 in lieu of upstream’s 55e623d1c2b8b7b1fa45db6728e172e06bfa8598 (fixes X-Forwarded-For header) because I had made the same fix myself and it’s better to use the upstream one
- I notice this rebase gets rid of GitHub merge commits… which actually might be fine because merges are fucking annoying to deal with when remote people merge without pulling and rebasing their branch first
- Finished up applying the 5.5 sitemap changes to all themes
- Merge the
discovery.xml
cleanups (#278)
@@ -319,7 +319,7 @@ UPDATE 10
dspace=# update metadatavalue set authority='e936f5c5-343d-4c46-aa91-7a1fff6277ed', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Knight-Jones%';
UPDATE 36
-- I updated the authority index but nothing seemed to change, so I'll wait and do it again after I update Discovery below
+- I updated the authority index but nothing seemed to change, so I’ll wait and do it again after I update Discovery below
- Skype chat with Tsega about the IFPRI contentdm bridge
- We tested harvesting OAI in an example collection to see how it works
- Talk to Carlos Quiros about CG Core metadata in CGSpace
@@ -332,7 +332,7 @@ UPDATE 36
$ ./fix-metadata-values.py -i countries-fix-18.csv -f dc.coverage.country -t 'correct' -m 228 -d dspace -u dspace -p fuuu
$ ./delete-metadata-values.py -i countries-delete-2.csv -f dc.coverage.country -m 228 -d dspace -u dspace -p fuuu
-- Run a shit ton of author fixes from Peter Ballantyne that we've been cleaning up for two months:
+- Run a shit ton of author fixes from Peter Ballantyne that we’ve been cleaning up for two months:
$ ./fix-metadata-values.py -i /tmp/authors-fix-pb2.csv -f dc.contributor.author -t correct -m 3 -u dspace -d dspace -p fuuu
diff --git a/docs/2016-11/index.html b/docs/2016-11/index.html
index 0be46f790..5e9d4ba17 100644
--- a/docs/2016-11/index.html
+++ b/docs/2016-11/index.html
@@ -8,7 +8,7 @@
@@ -20,10 +20,10 @@ Add dc.type to the output options for Atmire's Listings and Reports module (
-
+
@@ -53,7 +53,7 @@ Add dc.type to the output options for Atmire's Listings and Reports module (
-
+
@@ -101,13 +101,13 @@ Add dc.type to the output options for Atmire's Listings and Reports module (
November, 2016
2016-11-01
-- Add
dc.type
to the output options for Atmire's Listings and Reports module (#286)
+- Add
dc.type
to the output options for Atmire’s Listings and Reports module (#286)

2016-11-02
@@ -147,7 +147,7 @@ java.lang.NullPointerException
2016-11-06
-- After re-deploying and re-indexing I didn't see the same issue, and the indexing completed in 85 minutes, which is about how long it is supposed to take
+- After re-deploying and re-indexing I didn’t see the same issue, and the indexing completed in 85 minutes, which is about how long it is supposed to take
2016-11-07
@@ -155,8 +155,8 @@ java.lang.NullPointerException
$ grep -A 3 contact_info * | grep -E "(Orth|Sisay|Peter|Daniel|Tsega)" | awk -F'-' '{print $1}' | grep linode | uniq | xargs grep linode_id
-- I noticed some weird CRPs in the database, and they don't show up in Discovery for some reason, perhaps the
:
-- I'll export these and fix them in batch:
+- I noticed some weird CRPs in the database, and they don’t show up in Discovery for some reason, perhaps the
:
+- I’ll export these and fix them in batch:
dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=230 group by text_value order by count desc) to /tmp/crp.csv with csv;
COPY 22
@@ -169,11 +169,11 @@ COPY 22
2016-11-08
-- Atmire's Listings and Reports module seems to be broken on DSpace 5.5
+- Atmire’s Listings and Reports module seems to be broken on DSpace 5.5

-- I've filed a ticket with Atmire
+- I’ve filed a ticket with Atmire
- Thinking about batch updates for ORCIDs and authors
- Playing with SolrClient in Python to query Solr
- All records in the authority core are either
authority_type:orcid
or authority_type:person
@@ -185,7 +185,7 @@ COPY 22
2016-11-09
- CGSpace crashed so I quickly ran system updates, applied one or two of the waiting changes from the
5_x-prod
branch, and rebooted the server
-- The error was
Timeout waiting for idle object
but I haven't looked into the Tomcat logs to see what happened
+- The error was
Timeout waiting for idle object
but I haven’t looked into the Tomcat logs to see what happened
- Also, I ran the corrections for CRPs from earlier this week
2016-11-10
@@ -214,7 +214,7 @@ $ curl -s -H "accept: application/json" -H "Content-Type: applica
34
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
-- The results (55+34=89) don't seem to match those from the database:
+- The results (55+34=89) don’t seem to match those from the database:
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang is null;
count
@@ -230,8 +230,8 @@ dspace=# select count(text_value) from metadatavalue where resource_type_id=2 an
66
- So, querying from the API I get 55 + 34 = 89 results, but the database actually only has 85…
-- And the
find-by-metadata-field
endpoint doesn't seem to have a way to get all items with the field, or a wildcard value
-- I'll ask a question on the dspace-tech mailing list
+- And the
find-by-metadata-field
endpoint doesn’t seem to have a way to get all items with the field, or a wildcard value
+- I’ll ask a question on the dspace-tech mailing list
- And speaking of
text_lang
, this is interesting:
dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
@@ -274,7 +274,7 @@ UPDATE 420
dspacetest=# update metadatavalue set text_lang=NULL where resource_type_id=2 and text_lang='';
UPDATE 183726
-- After that restarted Tomcat and PostgreSQL (because I'm superstitious about caches) and now I see the following in REST API query:
+- After that restarted Tomcat and PostgreSQL (because I’m superstitious about caches) and now I see the following in REST API query:
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
71
@@ -282,12 +282,12 @@ $ curl -s -H "accept: application/json" -H "Content-Type: applica
0
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
-- Not sure what's going on, but Discovery shows 83 values, and database shows 85, so I'm going to reindex Discovery just in case
+- Not sure what’s going on, but Discovery shows 83 values, and database shows 85, so I’m going to reindex Discovery just in case
2016-11-14
-- I applied Atmire's suggestions to fix Listings and Reports for DSpace 5.5 and now it works
-- There were some issues with the
dspace/modules/jspui/pom.xml
, which is annoying because all I did was rebase our working 5.1 code on top of 5.5, meaning Atmire's installation procedure must have changed
+- I applied Atmire’s suggestions to fix Listings and Reports for DSpace 5.5 and now it works
+- There were some issues with the
dspace/modules/jspui/pom.xml
, which is annoying because all I did was rebase our working 5.1 code on top of 5.5, meaning Atmire’s installation procedure must have changed
- So there is apparently this Tomcat native way to limit web crawlers to one session: Crawler Session Manager
- After adding that to
server.xml
bots matching the pattern in the configuration will all use ONE session, just like normal users:
@@ -327,7 +327,7 @@ X-Cocoon-Version: 2.2.0

-- Seems the default regex doesn't catch Baidu, though:
+- Seems the default regex doesn’t catch Baidu, though:
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
HTTP/1.1 200 OK
@@ -374,7 +374,7 @@ Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)" "
$ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2 clean package
-- We absolutely don't use those modules, so we shouldn't build them in the first place
+- We absolutely don’t use those modules, so we shouldn’t build them in the first place
2016-11-17
@@ -394,16 +394,16 @@ UPDATE 7
- Had to run it twice to get all (not sure about “global” regex in PostgreSQL)
- Run the updates on CGSpace as well
- Run through some collections and manually regenerate some PDF thumbnails for items from before 2016 on DSpace Test to compare with CGSpace
-- I'm debating forcing the re-generation of ALL thumbnails, since some come from DSpace 3 and 4 when the thumbnailing wasn't as good
+- I’m debating forcing the re-generation of ALL thumbnails, since some come from DSpace 3 and 4 when the thumbnailing wasn’t as good
- The results were very good, I think that after we upgrade to 5.5 I will do it, perhaps one community / collection at a time:
$ [dspace]/bin/dspace filter-media -f -i 10568/67156 -p "ImageMagick PDF Thumbnail"
-- In related news, I'm looking at thumbnails of thumbnails (the ones we uploaded manually before, and now DSpace's media filter has made thumbnails of THEM):
+- In related news, I’m looking at thumbnails of thumbnails (the ones we uploaded manually before, and now DSpace’s media filter has made thumbnails of THEM):
dspace=# select text_value from metadatavalue where text_value like '%.jpg.jpg';
-- I'm not sure if there's anything we can do, actually, because we would have to remove those from the thumbnail bundles, and replace them with the regular JPGs from the content bundle, and then remove them from the assetstore…
+- I’m not sure if there’s anything we can do, actually, because we would have to remove those from the thumbnail bundles, and replace them with the regular JPGs from the content bundle, and then remove them from the assetstore…
2016-11-18
@@ -419,17 +419,17 @@ UPDATE 7
2016-11-23
- Upgrade Java from 7 to 8 on CGSpace
-- I had started planning the inplace PostgreSQL 9.3→9.5 upgrade but decided that I will have to
pg_dump
and pg_restore
when I move to the new server soon anyways, so there's no need to upgrade the database right now
+- I had started planning the inplace PostgreSQL 9.3→9.5 upgrade but decided that I will have to
pg_dump
and pg_restore
when I move to the new server soon anyways, so there’s no need to upgrade the database right now
- Chat with Carlos about CGCore and the CGSpace metadata registry
- Dump CGSpace metadata field registry for Carlos: https://gist.github.com/alanorth/8cbd0bb2704d4bbec78025b4742f8e70
- Send some feedback to Carlos on CG Core so they can better understand how DSpace/CGSpace uses metadata
- Notes about PostgreSQL tuning from James: https://paste.fedoraproject.org/488776/14798952/
- Play with Creative Commons stuff in DSpace submission step
-- It seems to work but it doesn't let you choose a version of CC (like 4.0), and we would need to customize the XMLUI item display so it doesn't display the gross CC badges
+- It seems to work but it doesn’t let you choose a version of CC (like 4.0), and we would need to customize the XMLUI item display so it doesn’t display the gross CC badges
2016-11-24
-- Bizuwork was testing DSpace Test on DSPace 5.5 and noticed that the Listings and Reports module seems to be case sensitive, whereas CGSpace's Listings and Reports isn't (ie, a search for “orth, alan” vs “Orth, Alan” returns the same results on CGSpace, but different on DSpace Test)
+- Bizuwork was testing DSpace Test on DSPace 5.5 and noticed that the Listings and Reports module seems to be case sensitive, whereas CGSpace’s Listings and Reports isn’t (ie, a search for “orth, alan” vs “Orth, Alan” returns the same results on CGSpace, but different on DSpace Test)
- I have raised a ticket with Atmire
- Looks like this issue is actually the new Listings and Reports module honoring the Solr search queries more correctly
@@ -449,7 +449,7 @@ UPDATE 7
Need to do updates for ansible infrastructure role defaults, and switch the GitHub branch to the new 5.5 one
-Testing DSpace 5.5 on CGSpace, it seems CUA's export as XLS works for Usage statistics, but not Content statistics
+Testing DSpace 5.5 on CGSpace, it seems CUA’s export as XLS works for Usage statistics, but not Content statistics
I will raise a bug with Atmire
2016-11-28
@@ -481,7 +481,7 @@ $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacete
2016-11-29
-- Sisay tried deleting and re-creating Goshu's account but he still can't see any communities on the homepage after he logs in
+- Sisay tried deleting and re-creating Goshu’s account but he still can’t see any communities on the homepage after he logs in
- Around the time of his login I see this in the DSpace logs:
2016-11-29 07:56:36,350 INFO org.dspace.authenticate.LDAPAuthentication @ g.cherinet@cgiar.org:session_id=F628E13AB4EF2BA949198A99EFD8EBE4:ip_addr=213.55.99.121:failed_login:no DN found for user g.cherinet@cgiar.org
@@ -510,7 +510,7 @@ org.dspace.discovery.SearchServiceException: Error executing query
- Which, according to some old threads on DSpace Tech, means that the user has a lot of permissions (from groups or on the individual eperson) which increases the Solr query size / query URL
- It might be fixed by increasing the Tomcat
maxHttpHeaderSize
, which is 8192 (or 8KB) by default
-- I've increased the
maxHttpHeaderSize
to 16384 on DSpace Test and the user said he is now able to see the communities on the homepage
+- I’ve increased the
maxHttpHeaderSize
to 16384 on DSpace Test and the user said he is now able to see the communities on the homepage
- I will make the changes on CGSpace soon
- A few users are reporting having issues with their workflows, they get the following message: “You are not allowed to perform this task”
- Might be the same as DS-2920 on the bug tracker
@@ -518,7 +518,7 @@ org.dspace.discovery.SearchServiceException: Error executing query
2016-11-30
- The
maxHttpHeaderSize
fix worked on CGSpace (user is able to see the community list on the homepage)
-- The “take task” cache fix worked on DSpace Test but it's not an official patch, so I'll have to report the bug to DSpace people and try to get advice
+- The “take task” cache fix worked on DSpace Test but it’s not an official patch, so I’ll have to report the bug to DSpace people and try to get advice
- More work on the KM4Dev Journal article
diff --git a/docs/2016-12/index.html b/docs/2016-12/index.html
index 168afb4a8..34b164f2c 100644
--- a/docs/2016-12/index.html
+++ b/docs/2016-12/index.html
@@ -17,8 +17,8 @@ While looking in the logs for errors, I see tons of warnings about Atmire MQM:
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
-I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade
-I've raised a ticket with Atmire to ask
+I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
+I’ve raised a ticket with Atmire to ask
Another worrying error from dspace.log is:
" />
@@ -39,11 +39,11 @@ While looking in the logs for errors, I see tons of warnings about Atmire MQM:
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
-I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade
-I've raised a ticket with Atmire to ask
+I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
+I’ve raised a ticket with Atmire to ask
Another worrying error from dspace.log is:
"/>
-
+
@@ -73,7 +73,7 @@ Another worrying error from dspace.log is:
-
+
@@ -121,7 +121,7 @@ Another worrying error from dspace.log is:
December, 2016
@@ -136,8 +136,8 @@ Another worrying error from dspace.log is:
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
-- I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade
-- I've raised a ticket with Atmire to ask
+- I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
+- I’ve raised a ticket with Atmire to ask
- Another worrying error from dspace.log is:
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
@@ -232,16 +232,16 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
2016-12-02 03:00:46,656 ERROR org.dspace.authority.AuthorityValueFinder @ anonymous::Error while retrieving AuthorityValue from solr:query\colon; id\colon;"b0b541c1-ec15-48bf-9209-6dbe8e338cdc"
org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8081/solr/authority
-- Looking through DSpace's solr log I see that about 20 seconds before this, there were a few 30+ KiB solr queries
+- Looking through DSpace’s solr log I see that about 20 seconds before this, there were a few 30+ KiB solr queries
- The last logs here right before Solr became unresponsive (and right after I restarted it five hours later) were:
2016-12-02 03:00:42,606 INFO org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/select params={q=containerItem:72828+AND+type:0&shards=localhost:8081/solr/statistics-2010,localhost:8081/solr/statistics&fq=-isInternal:true&fq=-(author_mtdt:"CGIAR\+Institutional\+Learning\+and\+Change\+Initiative"++AND+subject_mtdt:"PARTNERSHIPS"+AND+subject_mtdt:"RESEARCH"+AND+subject_mtdt:"AGRICULTURE"+AND+subject_mtdt:"DEVELOPMENT"++AND+iso_mtdt:"en"+)&rows=0&wt=javabin&version=2} hits=0 status=0 QTime=19
2016-12-02 08:28:23,908 INFO org.apache.solr.servlet.SolrDispatchFilter @ SolrDispatchFilter.init()
-- DSpace's own Solr logs don't give IP addresses, so I will have to enable Nginx's logging of
/solr
so I can see where this request came from
-- I enabled logging of
/rest/
and I think I'll leave it on for good
-- Also, the disk is nearly full because of log file issues, so I'm running some compression on DSpace logs
-- Normally these stay uncompressed for a month just in case we need to look at them, so now I've just compressed anything older than 2 weeks so we can get some disk space back
+- DSpace’s own Solr logs don’t give IP addresses, so I will have to enable Nginx’s logging of
/solr
so I can see where this request came from
+- I enabled logging of
/rest/
and I think I’ll leave it on for good
+- Also, the disk is nearly full because of log file issues, so I’m running some compression on DSpace logs
+- Normally these stay uncompressed for a month just in case we need to look at them, so now I’ve just compressed anything older than 2 weeks so we can get some disk space back
2016-12-04
@@ -266,10 +266,10 @@ Checksum Calculated =
Result = The bitstream could not be found
-----------------------------------------------
-- The first one seems ok, but I don't know what to make of the second one…
+- The first one seems ok, but I don’t know what to make of the second one…
- I had a look and there is indeed no file with the second checksum in the assetstore (ie, looking in
[dspace-dir]/assetstore/99/59/30/...
)
-- For what it's worth, there is no item on DSpace Test or S3 backups with that checksum either…
-- In other news, I'm looking at JVM settings from the Solr 4.10.2 release, from
bin/solr.in.sh
:
+- For what it’s worth, there is no item on DSpace Test or S3 backups with that checksum either…
+- In other news, I’m looking at JVM settings from the Solr 4.10.2 release, from
bin/solr.in.sh
:
# These GC settings have shown to work well for a number of common Solr workloads
GC_TUNE="-XX:-UseSuperWord \
@@ -292,21 +292,21 @@ GC_TUNE="-XX:-UseSuperWord \
-XX:+AggressiveOpts"
- I need to try these because they are recommended by the Solr project itself
-- Also, as always, I need to read Shawn Heisey's wiki page on Solr
+- Also, as always, I need to read Shawn Heisey’s wiki page on Solr
2016-12-05
-- I did some basic benchmarking on a local DSpace before and after the JVM settings above, but there wasn't anything amazingly obvious
+- I did some basic benchmarking on a local DSpace before and after the JVM settings above, but there wasn’t anything amazingly obvious
- I want to make the changes on DSpace Test and monitor the JVM heap graphs for a few days to see if they change the JVM GC patterns or anything (munin graphs)
- Spin up new CGSpace server on Linode
-- I did a few traceroutes from Jordan and Kenya and it seems that Linode's Frankfurt datacenter is a few less hops and perhaps less packet loss than the London one, so I put the new server in Frankfurt
+- I did a few traceroutes from Jordan and Kenya and it seems that Linode’s Frankfurt datacenter is a few less hops and perhaps less packet loss than the London one, so I put the new server in Frankfurt
- Do initial provisioning
- Atmire responded about the MQM warnings in the DSpace logs
- Apparently we need to change the batch edit consumers in
dspace/config/dspace.cfg
:
event.consumer.batchedit.filters = Community|Collection+Create
-- I haven't tested it yet, but I created a pull request: #289
+- I haven’t tested it yet, but I created a pull request: #289
2016-12-06
@@ -333,7 +333,7 @@ UPDATE 561
- Paola from CCAFS mentioned she also has the “take task” bug on CGSpace
- Reading about
shared_buffers
in PostgreSQL configuration (default is 128MB)
- Looks like we have ~5GB of memory used by caches on the test server (after OS and JVM heap!), so we might as well bump up the buffers for Postgres
-- The docs say a good starting point for a dedicated server is 25% of the system RAM, and our server isn't dedicated (also runs Solr, which can benefit from OS cache) so let's try 1024MB
+- The docs say a good starting point for a dedicated server is 25% of the system RAM, and our server isn’t dedicated (also runs Solr, which can benefit from OS cache) so let’s try 1024MB
- In other news, the authority reindexing keeps crashing (I was manually running it after the author updates above):
$ time JAVA_OPTS="-Xms768m -Xmx768m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace index-authority
@@ -363,9 +363,9 @@ user 1m54.190s
sys 0m22.647s
2016-12-07
-- For what it's worth, after running the same SQL updates on my local test server,
index-authority
runs and completes just fine
+- For what it’s worth, after running the same SQL updates on my local test server,
index-authority
runs and completes just fine
- I will have to test more
-- Anyways, I noticed that some of the authority values I set actually have versions of author names we don't want, ie “Grace, D.”
+- Anyways, I noticed that some of the authority values I set actually have versions of author names we don’t want, ie “Grace, D.”
- For example, do a Solr query for “first_name:Grace” and look at the results
- Querying that ID shows the fields that need to be changed:
@@ -400,7 +400,7 @@ sys 0m22.647s
}
- I think I can just update the
value
, first_name
, and last_name
fields…
-- The update syntax should be something like this, but I'm getting errors from Solr:
+- The update syntax should be something like this, but I’m getting errors from Solr:
$ curl 'localhost:8081/solr/authority/update?commit=true&wt=json&indent=true' -H 'Content-type:application/json' -d '[{"id":"1","price":{"set":100}}]'
{
@@ -417,7 +417,7 @@ sys 0m22.647s
dspace=# update metadatavalue set authority=null, confidence=-1 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
UPDATE 561
-- Then I'll reindex discovery and authority and see how the authority Solr core looks
+- Then I’ll reindex discovery and authority and see how the authority Solr core looks
- After this, now there are authorities for some of the “Grace, D.” and “Grace, Delia” text_values in the database (the first version is actually the same authority that already exists in the core, so it was just added back to some text_values, but the second one is new):
$ curl 'localhost:8081/solr/authority/select?q=id%3A18ea1525-2513-430a-8817-a834cd733fbc&wt=json&indent=true'
@@ -462,7 +462,7 @@ update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confi
update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
2016-12-08
-- Something weird happened and Peter Thorne's names all ended up as “Thorne”, I guess because the original authority had that as its name value:
+- Something weird happened and Peter Thorne’s names all ended up as “Thorne”, I guess because the original authority had that as its name value:
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne%';
text_value | authority | confidence
@@ -480,7 +480,7 @@ update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-417
dspace=# update metadatavalue set authority='b2f7603d-2fb5-4018-923a-c4ec8d85b3bb', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812';
UPDATE 43
-- Apparently we also need to normalize Phil Thornton's names to
Thornton, Philip K.
:
+- Apparently we also need to normalize Phil Thornton’s names to
Thornton, Philip K.
:
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
text_value | authority | confidence
@@ -504,13 +504,13 @@ UPDATE 362
- It seems that, when you are messing with authority and author text values in the database, it is better to run authority reindex first (postgres→solr authority core) and then Discovery reindex (postgres→solr Discovery core)
- Everything looks ok after authority and discovery reindex
-- In other news, I think we should really be using more RAM for PostgreSQL's
shared_buffers
-- The PostgreSQL documentation recommends using 25% of the system's RAM on dedicated systems, but we should use a bit less since we also have a massive JVM heap and also benefit from some RAM being used by the OS cache
+- In other news, I think we should really be using more RAM for PostgreSQL’s
shared_buffers
+- The PostgreSQL documentation recommends using 25% of the system’s RAM on dedicated systems, but we should use a bit less since we also have a massive JVM heap and also benefit from some RAM being used by the OS cache
2016-12-09
- More work on finishing rough draft of KM4Dev article
-- Set PostgreSQL's
shared_buffers
on CGSpace to 10% of system RAM (1200MB)
+- Set PostgreSQL’s
shared_buffers
on CGSpace to 10% of system RAM (1200MB)
- Run the following author corrections on CGSpace:
dspace=# update metadatavalue set authority='34df639a-42d8-4867-a3f2-1892075fcb3f', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812' or authority='021cd183-946b-42bb-964e-522ebff02993';
@@ -520,7 +520,7 @@ dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab76
2016-12-11
-- After enabling a sizable
shared_buffers
for CGSpace's PostgreSQL configuration the number of connections to the database dropped significantly
+- After enabling a sizable
shared_buffers
for CGSpace’s PostgreSQL configuration the number of connections to the database dropped significantly

@@ -563,12 +563,12 @@ UPDATE 35
- Looking at logs, it seems we need to evaluate which logs we keep and for how long
- Basically the only ones we need are
dspace.log
because those are used for legacy statistics (need to keep for 1 month)
-- Other logs will be an issue because they don't have date stamps
-- I will add date stamps to the logs we're storing from the tomcat7 user's cron jobs at least, using:
$(date --iso-8601)
+- Other logs will be an issue because they don’t have date stamps
+- I will add date stamps to the logs we’re storing from the tomcat7 user’s cron jobs at least, using:
$(date --iso-8601)
- Would probably be better to make custom logrotate files for them in the future
-- Clean up some unneeded log files from 2014 (they weren't large, just don't need them)
+- Clean up some unneeded log files from 2014 (they weren’t large, just don’t need them)
- So basically, new cron jobs for logs should look something like this:
-- Find any file named
*.log*
that isn't dspace.log*
, isn't already zipped, and is older than one day, and zip it:
+- Find any file named
*.log*
that isn’t dspace.log*
, isn’t already zipped, and is older than one day, and zip it:
# find /home/dspacetest.cgiar.org/log -regextype posix-extended -iregex ".*\.log.*" ! -iregex ".*dspace\.log.*" ! -iregex ".*\.(gz|lrz|lzo|xz)" ! -newermt "Yesterday" -exec schedtool -B -e ionice -c2 -n7 xz {} \;
@@ -582,7 +582,7 @@ PID 17049: PRIO 0, POLICY B: SCHED_BATCH , NICE 0, AFFINITY 0xf
best-effort: prio 7
- All in all this should free up a few gigs (we were at 9.3GB free when I started)
-- Next thing to look at is whether we need Tomcat's access logs
+- Next thing to look at is whether we need Tomcat’s access logs
- I just looked and it seems that we saved 10GB by zipping these logs
- Some users pointed out issues with the “most popular” stats on a community or collection
- This error appears in the logs when you try to view them:
@@ -645,20 +645,20 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
- Atmire sent a quick fix for the
last-update.txt
file not found error
- After applying pull request #291 on DSpace Test I no longer see the error in the logs after the
UpdateSolrStorageReports
task runs
-- Also, I'm toying with the idea of moving the
tomcat7
user's cron jobs to /etc/cron.d
so we can manage them in Ansible
+- Also, I’m toying with the idea of moving the
tomcat7
user’s cron jobs to /etc/cron.d
so we can manage them in Ansible
- Made a pull request with a template for the cron jobs (#75)
-- Testing SMTP from the new CGSpace server and it's not working, I'll have to tell James
+- Testing SMTP from the new CGSpace server and it’s not working, I’ll have to tell James
2016-12-15
- Start planning for server migration this weekend, letting users know
-- I am trying to figure out what the process is to update the server's IP in the Handle system, and emailing the hdladmin account bounces(!)
-- I will contact the Jane Euler directly as I know I've corresponded with her in the past
+- I am trying to figure out what the process is to update the server’s IP in the Handle system, and emailing the hdladmin account bounces(!)
+- I will contact the Jane Euler directly as I know I’ve corresponded with her in the past
- She said that I should indeed just re-run the
[dspace]/bin/dspace make-handle-config
command and submit the new sitebndl.zip
file to the CNRI website
- Also I was troubleshooting some workflow issues from Bizuwork
- I re-created the same scenario by adding a non-admin account and submitting an item, but I was able to successfully approve and commit it
-- So it turns out it's not a bug, it's just that Peter was added as a reviewer/admin AFTER the items were submitted
-- This is how DSpace works, and I need to ask if there is a way to override someone's submission, as the other reviewer seems to not be paying attention, or has perhaps taken the item from the task pool?
+- So it turns out it’s not a bug, it’s just that Peter was added as a reviewer/admin AFTER the items were submitted
+- This is how DSpace works, and I need to ask if there is a way to override someone’s submission, as the other reviewer seems to not be paying attention, or has perhaps taken the item from the task pool?
- Run a batch edit to add “RANGELANDS” ILRI subject to all items containing the word “RANGELANDS” in their metadata for Peter Ballantyne
@@ -666,9 +666,9 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
2016-12-18
- Add four new CRP subjects for 2017 and sort the input forms alphabetically (#294)
-- Test the SMTP on the new server and it's working
+- Test the SMTP on the new server and it’s working
- Last week, when we asked CGNET to update the DNS records this weekend, they misunderstood and did it immediately
-- We quickly told them to undo it, but I just realized they didn't undo the IPv6 AAAA record!
+- We quickly told them to undo it, but I just realized they didn’t undo the IPv6 AAAA record!
- None of our users in African institutes will have IPv6, but some Europeans might, so I need to check if any submissions have been added since then
- Update some names and authorities in the database:
@@ -680,7 +680,7 @@ dspace=# update metadatavalue set authority='f840da02-26e7-4a74-b7ba-3e2b723f368
UPDATE 140
- Generated a new UUID for Ben using
uuidgen | tr [A-Z] [a-z]
as the one in Solr had his ORCID but the name format was incorrect
-- In theory DSpace should be able to check names from ORCID and update the records in the database, but I find that this doesn't work (see Jira bug DS-3302)
+- In theory DSpace should be able to check names from ORCID and update the records in the database, but I find that this doesn’t work (see Jira bug DS-3302)
- I need to run these updates along with the other one for CIAT that I found last week
- Enable OCSP stapling for hosts >= Ubuntu 16.04 in our Ansible playbooks (#76)
- Working for DSpace Test on the second response:
@@ -729,7 +729,7 @@ $ exit
- It took about twenty minutes and afterwards I had to check a few things, like:
-- check and enable systemd timer for let's encrypt
+- check and enable systemd timer for let’s encrypt
- enable root cron jobs
- disable root cron jobs on old server after!
- enable tomcat7 cron jobs
@@ -740,13 +740,13 @@ $ exit
2016-12-22
-- Abenet wanted a CSV of the IITA community, but the web export doesn't include the
dc.date.accessioned
field
+- Abenet wanted a CSV of the IITA community, but the web export doesn’t include the
dc.date.accessioned
field
- I had to export it from the command line using the
-a
flag:
$ [dspace]/bin/dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
2016-12-28
-- We've been getting two alerts per day about CPU usage on the new server from Linode
+- We’ve been getting two alerts per day about CPU usage on the new server from Linode
- These are caused by the batch jobs for Solr etc that run in the early morning hours
- The Linode default is to alert at 90% CPU usage for two hours, but I see the old server was at 150%, so maybe we just need to adjust it
- Speaking of the old server (linode01), I think we can decommission it now
diff --git a/docs/2017-01/index.html b/docs/2017-01/index.html
index 4df0fff22..42ffa1b6c 100644
--- a/docs/2017-01/index.html
+++ b/docs/2017-01/index.html
@@ -9,8 +9,8 @@
@@ -22,10 +22,10 @@ I asked on the dspace-tech mailing list because it seems to be broken, and actua
-
+
@@ -55,7 +55,7 @@ I asked on the dspace-tech mailing list because it seems to be broken, and actua
-
+
@@ -103,15 +103,15 @@ I asked on the dspace-tech mailing list because it seems to be broken, and actua
January, 2017
2017-01-02
- I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
-- I tested on DSpace Test as well and it doesn't work there either
-- I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years
+- I tested on DSpace Test as well and it doesn’t work there either
+- I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
2017-01-04
@@ -186,18 +186,18 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
2017-01-08
-- Put Sisay's
item-view.xsl
code to show mapped collections on CGSpace (#295)
+- Put Sisay’s
item-view.xsl
code to show mapped collections on CGSpace (#295)
2017-01-09
-- A user wrote to tell me that the new display of an item's mappings had a crazy bug for at least one item: https://cgspace.cgiar.org/handle/10568/78596
+- A user wrote to tell me that the new display of an item’s mappings had a crazy bug for at least one item: https://cgspace.cgiar.org/handle/10568/78596
- She said she only mapped it once, but it appears to be mapped 184 times

2017-01-10
-- I tried to clean up the duplicate mappings by exporting the item's metadata to CSV, editing, and re-importing, but DSpace said “no changes were detected”
-- I've asked on the dspace-tech mailing list to see if anyone can help
+- I tried to clean up the duplicate mappings by exporting the item’s metadata to CSV, editing, and re-importing, but DSpace said “no changes were detected”
+- I’ve asked on the dspace-tech mailing list to see if anyone can help
- I found an old post on the mailing list discussing a similar issue, and listing some SQL commands that might help
- For example, this shows 186 mappings for the item, the first three of which are real:
@@ -226,7 +226,7 @@ UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15:
print("Fixing {} occurences of: {}".format(records_to_fix, record[0].encode('utf-8')))
- See: http://stackoverflow.com/a/36427358/487333
-- I'm actually not sure if we need to encode() the strings to UTF-8 before writing them to the database… I've never had this issue before
+- I’m actually not sure if we need to encode() the strings to UTF-8 before writing them to the database… I’ve never had this issue before
- Now back to cleaning up some journal titles so we can make the controlled vocabulary:
$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'fuuu'
@@ -237,7 +237,7 @@ UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15:
- The values are a bit dirty and outdated, since the file I had given to Abenet and Peter was from November
- I will have to go through these and fix some more before making the controlled vocabulary
-- Added 30 more corrections or so, now there are 49 total and I'll have to get the top 500 after applying them
+- Added 30 more corrections or so, now there are 49 total and I’ll have to get the top 500 after applying them
2017-01-13
@@ -256,12 +256,12 @@ delete from collection2item where id = '91082';
- Helping clean up some file names in the 232 CIAT records that Sisay worked on last week
- There are about 30 files with
%20
(space) and Spanish accents in the file name
- At first I thought we should fix these, but actually it is prescribed by the W3 working group to convert these to UTF8 and URL encode them!
-- And the file names don't really matter either, as long as the SAF Builder tool can read them—after that DSpace renames them with a hash in the assetstore
+- And the file names don’t really matter either, as long as the SAF Builder tool can read them—after that DSpace renames them with a hash in the assetstore
- Seems like the only ones I should replace are the
'
apostrophe characters, as %27
:
value.replace("'",'%27')
-- Add the item's Type to the filename column as a hint to SAF Builder so it can set a more useful description field:
+- Add the item’s Type to the filename column as a hint to SAF Builder so it can set a more useful description field:
value + "__description:" + cells["dc.type"].value
@@ -279,18 +279,18 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
2017-01-19
-- In testing a random sample of CIAT's PDFs for compressability, it looks like all of these methods generally increase the file size so we will just import them as they are
+- In testing a random sample of CIAT’s PDFs for compressability, it looks like all of these methods generally increase the file size so we will just import them as they are
- Import 232 CIAT records into CGSpace:
$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
2017-01-22
-- Looking at some records that Sisay is having problems importing into DSpace Test (seems to be because of copious whitespace return characters from Excel's CSV exporter)
+- Looking at some records that Sisay is having problems importing into DSpace Test (seems to be because of copious whitespace return characters from Excel’s CSV exporter)
- There were also some issues with an invalid dc.date.issued field, and I trimmed leading / trailing whitespace and cleaned up some URLs with unneeded parameters like ?show=full
2017-01-23
-- I merged Atmire's pull request into the development branch so they can deploy it on DSpace Test
+- I merged Atmire’s pull request into the development branch so they can deploy it on DSpace Test
- Move some old ILRI Program communities to a new subcommunity for former programs (10568/79164):
$ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child="$community" && /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child="$community"; done
diff --git a/docs/2017-02/index.html b/docs/2017-02/index.html
index 8c9ea486c..e8dc2a1d4 100644
--- a/docs/2017-02/index.html
+++ b/docs/2017-02/index.html
@@ -21,7 +21,7 @@ dspace=# delete from collection2item where id = 92551 and item_id = 80278;
DELETE 1
Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
-Looks like we'll be using cg.identifier.ccafsprojectpii as the field name
+Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name
" />
@@ -45,9 +45,9 @@ dspace=# delete from collection2item where id = 92551 and item_id = 80278;
DELETE 1
Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
-Looks like we'll be using cg.identifier.ccafsprojectpii as the field name
+Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name
"/>
-
+
@@ -77,7 +77,7 @@ Looks like we'll be using cg.identifier.ccafsprojectpii as the field name
-
+
@@ -125,7 +125,7 @@ Looks like we'll be using cg.identifier.ccafsprojectpii as the field name
February, 2017
@@ -144,7 +144,7 @@ dspace=# delete from collection2item where id = 92551 and item_id = 80278;
DELETE 1
- Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
-- Looks like we'll be using
cg.identifier.ccafsprojectpii
as the field name
+- Looks like we’ll be using
cg.identifier.ccafsprojectpii
as the field name
2017-02-08
@@ -156,7 +156,7 @@ DELETE 1
- POLICIES AND INSTITUTIONS → PRIORITIES AND POLICIES FOR CSA
-- The climate risk management one doesn't exist, so I will have to ask Magdalena if they want me to add it to the input forms
+- The climate risk management one doesn’t exist, so I will have to ask Magdalena if they want me to add it to the input forms
- Start testing some nearly 500 author corrections that CCAFS sent me:
$ ./fix-metadata-values.py -i /tmp/CCAFS-Authors-Feb-7.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
@@ -165,7 +165,7 @@ DELETE 1
More work on CCAFS Phase II stuff
Looks like simply adding a new metadata field to dspace/config/registries/cgiar-types.xml
and restarting DSpace causes the field to get added to the rregistry
It requires a restart but at least it allows you to manage the registry programmatically
-It's not a very good way to manage the registry, though, as removing one there doesn't cause it to be removed from the registry, and we always restore from database backups so there would never be a scenario when we needed these to be created
+It’s not a very good way to manage the registry, though, as removing one there doesn’t cause it to be removed from the registry, and we always restore from database backups so there would never be a scenario when we needed these to be created
Testing some corrections on CCAFS Phase II flagships (cg.subject.ccafs
):
$ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
@@ -177,7 +177,7 @@ DELETE 1
2017-02-14
-- Add
SCALING
to ILRI subjects (#304), as Sisay's attempts were all sloppy
+- Add
SCALING
to ILRI subjects (#304), as Sisay’s attempts were all sloppy
- Cherry pick some patches from the DSpace 5.7 branch:
- DS-3363 CSV import error says “row”, means “column”: f7b6c83e991db099003ee4e28ca33d3c7bab48c0
@@ -199,14 +199,14 @@ DELETE 1

- We are using only ~8GB of RAM for applications, and 16GB for caches!
-- The Linode machine we're on has 24GB of RAM but only because that's the only instance that had enough disk space for us (384GB)…
+- The Linode machine we’re on has 24GB of RAM but only because that’s the only instance that had enough disk space for us (384GB)…
- We should probably look into Google Compute Engine or Digital Ocean where we can get more storage without having to follow a linear increase in instance pricing for CPU/memory as well
- Especially because we only use 2 out of 8 CPUs basically:

- Fix issue with duplicate declaration of in atmire-dspace-xmlui
pom.xml
(causing non-fatal warnings during the maven build)
-- Experiment with making DSpace generate HTTPS handle links, first a change in dspace.cfg or the site's properties file:
+- Experiment with making DSpace generate HTTPS handle links, first a change in dspace.cfg or the site’s properties file:
handle.canonical.prefix = https://hdl.handle.net/
@@ -270,7 +270,7 @@ dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.15446/agro
- Fix label of CCAFS subjects in Atmire Listings and Reports module
- Help Sisay with SQL commands
- Help Paola from CCAFS with the Atmire Listings and Reports module
-- Testing the
fix-metadata-values.py
script on macOS and it seems like we don't need to use .encode('utf-8')
anymore when printing strings to the screen
+- Testing the
fix-metadata-values.py
script on macOS and it seems like we don’t need to use .encode('utf-8')
anymore when printing strings to the screen
- It seems this might have only been a temporary problem, as both Python 3.5.2 and 3.6.0 are able to print the problematic string “Entwicklung & Ländlicher Raum” without the
encode()
call, but print it as a bytes when it is used:
$ python
@@ -285,7 +285,7 @@ b'Entwicklung & L\xc3\xa4ndlicher Raum'
2017-02-21
- Testing regenerating PDF thumbnails, like I started in 2016-11
-- It seems there is a bug in
filter-media
that causes it to process formats that aren't part of its configuration:
+- It seems there is a bug in
filter-media
that causes it to process formats that aren’t part of its configuration:
$ [dspace]/bin/dspace filter-media -f -i 10568/16856 -p "ImageMagick PDF Thumbnail"
File: earlywinproposal_esa_postharvest.pdf.jpg
@@ -298,28 +298,28 @@ FILTERED: bitstream 16524 (item: 10568/24655) and created 'postHarvest.jpg.jpg'
filter.org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter.inputFormats = BMP, GIF, image/png, JPG, TIFF, JPEG, JPEG 2000
filter.org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.inputFormats = Adobe PDF
-- I've sent a message to the mailing list and might file a Jira issue
+- I’ve sent a message to the mailing list and might file a Jira issue
- Ask Atmire about the failed interpolation of the
dspace.internalUrl
variable in atmire-cua.cfg
2017-02-22
- Atmire said I can add
dspace.internalUrl
to my build properties and the error will go away
-- It should be the local URL for accessing Tomcat from the server's own perspective, ie: http://localhost:8080
+- It should be the local URL for accessing Tomcat from the server’s own perspective, ie: http://localhost:8080
2017-02-26
-- Find all fields with “http://hdl.handle.net" values (most are in
dc.identifier.uri
, but some are in other URL-related fields like cg.link.reference
, cg.identifier.dataurl
, and cg.identifier.url
):
+- Find all fields with “http://hdl.handle.net” values (most are in
dc.identifier.uri
, but some are in other URL-related fields like cg.link.reference
, cg.identifier.dataurl
, and cg.identifier.url
):
dspace=# select distinct metadata_field_id from metadatavalue where resource_type_id=2 and text_value like 'http://hdl.handle.net%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where resource_type_id=2 and metadata_field_id IN (25, 113, 179, 219, 220, 223) and text_value like 'http://hdl.handle.net%';
UPDATE 58633
-- This works but I'm thinking I'll wait on the replacement as there are perhaps some other places that rely on
http://hdl.handle.net
(grep the code, it's scary how many things are hard coded)
+- This works but I’m thinking I’ll wait on the replacement as there are perhaps some other places that rely on
http://hdl.handle.net
(grep the code, it’s scary how many things are hard coded)
- Send message to dspace-tech mailing list with concerns about this
2017-02-27
-- LDAP users cannot log in today, looks to be an issue with CGIAR's LDAP server:
+- LDAP users cannot log in today, looks to be an issue with CGIAR’s LDAP server:
$ openssl s_client -connect svcgroot2.cgiarad.org:3269
CONNECTED(00000003)
@@ -367,7 +367,7 @@ Certificate chain
[dspace]/log/dspace.log.2017-02-26:8
[dspace]/log/dspace.log.2017-02-27:90
-- Also, it seems that we need to use a different user for LDAP binds, as we're still using the temporary one from the root migration, so maybe we can go back to the previous user we were using
+- Also, it seems that we need to use a different user for LDAP binds, as we’re still using the temporary one from the root migration, so maybe we can go back to the previous user we were using
- So it looks like the certificate is invalid AND the bind users we had been using were deleted
- Biruk Debebe recreated the bind user and now we are just waiting for CGNET to update their certificates
- Regarding the
filter-media
issue I found earlier, it seems that the ImageMagick PDF plugin will also process JPGs if they are in the “Content Files” (aka ORIGINAL
) bundle
@@ -383,7 +383,7 @@ Certificate chain
2017-02-28
- After running the CIAT corrections and updating the Discovery and authority indexes, there is still no change in the number of items listed for CIAT in Discovery
-- Ah, this is probably because some items have the
International Center for Tropical Agriculture
author twice, which I first noticed in 2016-12 but couldn't figure out how to fix
+- Ah, this is probably because some items have the
International Center for Tropical Agriculture
author twice, which I first noticed in 2016-12 but couldn’t figure out how to fix
- I think I can do it by first exporting all metadatavalues that have the author
International Center for Tropical Agriculture
dspace=# \copy (select resource_id, metadata_value_id from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='International Center for Tropical Agriculture') to /tmp/ciat.csv with csv;
diff --git a/docs/2017-03/index.html b/docs/2017-03/index.html
index de3f58e88..de67f1959 100644
--- a/docs/2017-03/index.html
+++ b/docs/2017-03/index.html
@@ -20,7 +20,7 @@ Need to send Peter and Michael some notes about this in a few days
Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
-Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
+Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
@@ -46,12 +46,12 @@ Need to send Peter and Michael some notes about this in a few days
Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
-Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
+Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
"/>
-
+
@@ -81,7 +81,7 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
-
+
@@ -129,7 +129,7 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
March, 2017
@@ -147,7 +147,7 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
- Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
- Filed an issue on DSpace issue tracker for the
filter-media
bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
- Discovered that the ImageMagic
filter-media
plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
-- Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
+- Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
@@ -162,7 +162,7 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
2017-03-03
- I created a patch for DS-3517 and made a pull request against upstream
dspace-5_x
: https://github.com/DSpace/DSpace/pull/1669
-- Looks like
-colorspace sRGB
alone isn't enough, we need to use profiles:
+- Looks like
-colorspace sRGB
alone isn’t enough, we need to use profiles:
$ convert alc_contrastes_desafios.pdf\[0\] -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_cmyk.icc -thumbnail 300x300 -flatten -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_rgb.icc alc_contrastes_desafios.pdf.jpg
@@ -180,7 +180,7 @@ DirectClass sRGB Alpha
2017-03-04
- Spent more time looking at the ImageMagick CMYK issue
-- The
default_cmyk.icc
and default_rgb.icc
files are both part of the Ghostscript GPL distribution, but according to DSpace's LICENSES_THIRD_PARTY
file, DSpace doesn't allow distribution of dependencies that are licensed solely under the GPL
+- The
default_cmyk.icc
and default_rgb.icc
files are both part of the Ghostscript GPL distribution, but according to DSpace’s LICENSES_THIRD_PARTY
file, DSpace doesn’t allow distribution of dependencies that are licensed solely under the GPL
- So this issue is kinda pointless now, as the ICC profiles are absolutely necessary to make a meaningful CMYK→sRGB conversion
2017-03-05
@@ -191,10 +191,10 @@ DirectClass sRGB Alpha
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "LAND REFORM", "language": null}' | json_pp
-- But there are hundreds of combinations of fields and values (like
dc.subject
and all the center subjects), and we can't use wildcards in REST!
+- But there are hundreds of combinations of fields and values (like
dc.subject
and all the center subjects), and we can’t use wildcards in REST!
- Reading about enabling multiple handle prefixes in DSpace
- There is a mailing list thread from 2011 about it: http://dspace.2283337.n4.nabble.com/Multiple-handle-prefixes-merged-DSpace-instances-td3427192.html
-- And a comment from Atmire's Bram about it on the DSpace wiki: https://wiki.duraspace.org/display/DSDOC5x/Installing+DSpace?focusedCommentId=78163296#comment-78163296
+- And a comment from Atmire’s Bram about it on the DSpace wiki: https://wiki.duraspace.org/display/DSDOC5x/Installing+DSpace?focusedCommentId=78163296#comment-78163296
- Bram mentions an undocumented configuration option
handle.plugin.checknameauthority
, but I noticed another one in dspace.cfg
:
# List any additional prefixes that need to be managed by this handle server
@@ -202,12 +202,12 @@ DirectClass sRGB Alpha
# that repository)
# handle.additional.prefixes = prefix1[, prefix2]
-- Because of this I noticed that our Handle server's
config.dct
was potentially misconfigured!
+- Because of this I noticed that our Handle server’s
config.dct
was potentially misconfigured!
- We had some default values still present:
"300:0.NA/YOUR_NAMING_AUTHORITY"
-- I've changed them to the following and restarted the handle server:
+- I’ve changed them to the following and restarted the handle server:
"300:0.NA/10568"
@@ -226,7 +226,7 @@ DirectClass sRGB Alpha
2017-03-06
-- Someone on the mailing list said that
handle.plugin.checknameauthority
should be false if we're using multiple handle prefixes
+- Someone on the mailing list said that
handle.plugin.checknameauthority
should be false if we’re using multiple handle prefixes
2017-03-07
@@ -240,14 +240,14 @@ DirectClass sRGB Alpha
I need to talk to Michael and Peter to share the news, and discuss the structure of their community(s) and try some actual test data
-We'll need to do some data cleaning to make sure they are using the same fields we are, like dc.type
and cg.identifier.status
+We’ll need to do some data cleaning to make sure they are using the same fields we are, like dc.type
and cg.identifier.status
Another thing is that the import process creates new dc.date.accessioned
and dc.date.available
fields, so we end up with duplicates (is it important to preserve the originals for these?)
Report DS-3520 issue to Atmire
2017-03-08
-- Merge the author separator changes to
5_x-prod
, as everyone has responded positively about it, and it's the default in Mirage2 afterall!
-- Cherry pick the
commons-collections
patch from DSpace's dspace-5_x
branch to address DS-3520: https://jira.duraspace.org/browse/DS-3520
+- Merge the author separator changes to
5_x-prod
, as everyone has responded positively about it, and it’s the default in Mirage2 afterall!
+- Cherry pick the
commons-collections
patch from DSpace’s dspace-5_x
branch to address DS-3520: https://jira.duraspace.org/browse/DS-3520
2017-03-09
@@ -267,7 +267,7 @@ $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.spon
dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship')) to /tmp/sponsorship.csv with csv;
- Pull request for controlled vocabulary if Peter approves: https://github.com/ilri/DSpace/pull/308
-- Review Sisay's roots, tubers, and bananas (RTB) theme, which still needs some fixes to work properly: https://github.com/ilri/DSpace/pull/307
+- Review Sisay’s roots, tubers, and bananas (RTB) theme, which still needs some fixes to work properly: https://github.com/ilri/DSpace/pull/307
- Created an issue to track the progress on the Livestock CRP theme: https://github.com/ilri/DSpace/issues/309
- Created a basic theme for the Livestock CRP community
@@ -306,13 +306,13 @@ $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.spon
$ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
-- We've been waiting since February to run these
+- We’ve been waiting since February to run these
- Also, I generated a list of all CCAFS flagships because there are a dozen or so more than there should be:
dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=210 group by text_value order by count desc) to /tmp/ccafs.csv with csv;
- I sent a list to CCAFS people so they can tell me if some should be deleted or moved, etc
-- Test, squash, and merge Sisay's RTB theme into
5_x-prod
: https://github.com/ilri/DSpace/pull/316
+- Test, squash, and merge Sisay’s RTB theme into
5_x-prod
: https://github.com/ilri/DSpace/pull/316
2017-03-29
diff --git a/docs/2017-04/index.html b/docs/2017-04/index.html
index 121b4e379..64e706862 100644
--- a/docs/2017-04/index.html
+++ b/docs/2017-04/index.html
@@ -37,7 +37,7 @@ Testing the CMYK patch on a collection with 650 items:
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
"/>
-
+
@@ -67,7 +67,7 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Th
-
+
@@ -115,7 +115,7 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Th
April, 2017
@@ -141,14 +141,14 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Th
$ grep -c profile /tmp/filter-media-cmyk.txt
484
-- Looking at the CG Core document again, I'll send some feedback to Peter and Abenet:
+
- Looking at the CG Core document again, I’ll send some feedback to Peter and Abenet:
- We use cg.contributor.crp to indicate the CRP(s) affiliated with the item
-- DSpace has dc.date.available, but this field isn't particularly meaningful other than as an automatic timestamp at the time of item accession (and is identical to dc.date.accessioned)
-- dc.relation exists in CGSpace, but isn't used—rather dc.relation.ispartofseries, which is used ~5,000 times to Series name and number within that series
+- DSpace has dc.date.available, but this field isn’t particularly meaningful other than as an automatic timestamp at the time of item accession (and is identical to dc.date.accessioned)
+- dc.relation exists in CGSpace, but isn’t used—rather dc.relation.ispartofseries, which is used ~5,000 times to Series name and number within that series
-- Also, I'm noticing some weird outliers in
cg.coverage.region
, need to remember to go correct these later:
+- Also, I’m noticing some weird outliers in
cg.coverage.region
, need to remember to go correct these later:
dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=227;
2017-04-04
@@ -159,7 +159,7 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Th
1584
- Trying to find a way to get the number of items submitted by a certain user in 2016
-- It's not possible in the DSpace search / module interfaces, but might be able to be derived from
dc.description.provenance
, as that field contains the name and email of the submitter/approver, ie:
+- It’s not possible in the DSpace search / module interfaces, but might be able to be derived from
dc.description.provenance
, as that field contains the name and email of the submitter/approver, ie:
Submitted by Francesca Giampieri (fgiampieri) on 2016-01-19T13:56:43Z^M
No. of bitstreams: 1^M
@@ -169,7 +169,7 @@ ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0
dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
-- Then this one does the same, but for fields that don't contain checksums (ie, there was no bitstream in the submission):
+- Then this one does the same, but for fields that don’t contain checksums (ie, there was no bitstream in the submission):
dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*' and text_value !~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
@@ -188,14 +188,14 @@ ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0
- After reading the notes for DCAT April 2017 I am testing some new settings for PostgreSQL on DSpace Test:
-db.maxconnections
30→70 (the default PostgreSQL config allows 100 connections, so DSpace's default of 30 is quite low)
+db.maxconnections
30→70 (the default PostgreSQL config allows 100 connections, so DSpace’s default of 30 is quite low)
db.maxwait
5000→10000
db.maxidle
8→20 (DSpace default is -1, unlimited, but we had set it to 8 earlier)
- I need to look at the Munin graphs after a few days to see if the load has changed
- Run system updates on DSpace Test and reboot the server
-- Discussing harvesting CIFOR's DSpace via OAI
+- Discussing harvesting CIFOR’s DSpace via OAI
- Sisay added their OAI as a source to a new collection, but using the Simple Dublin Core method, so many fields are unqualified and duplicated
- Looking at the documentation it seems that we probably want to be using DSpace Intermediate Metadata
@@ -212,44 +212,44 @@ ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0
- Handle.net calls this “derived prefixes” and it seems this would work with DSpace if we wanted to go that route
- CIFOR is starting to test aligning their metadata more with CGSpace/CG core
- They shared a test item which is using
cg.coverage.country
, cg.subject.cifor
, dc.subject
, and dc.date.issued
-- Looking at their OAI I'm not sure it has updated as I don't see the new fields: https://data.cifor.org/dspace/oai/request?verb=ListRecords&resumptionToken=oai_dc///col_11463_6/900
-- Maybe they need to make sure they are running the OAI cache refresh cron job, or maybe OAI doesn't export these?
-- I added
cg.subject.cifor
to the metadata registry and I'm waiting for the harvester to re-harvest to see if it picks up more data now
-- Another possiblity is that we could use a cross walk… but I've never done it.
+- Looking at their OAI I’m not sure it has updated as I don’t see the new fields: https://data.cifor.org/dspace/oai/request?verb=ListRecords&resumptionToken=oai_dc///col_11463_6/900
+- Maybe they need to make sure they are running the OAI cache refresh cron job, or maybe OAI doesn’t export these?
+- I added
cg.subject.cifor
to the metadata registry and I’m waiting for the harvester to re-harvest to see if it picks up more data now
+- Another possiblity is that we could use a cross walk… but I’ve never done it.
2017-04-11
-- Looking at the item from CIFOR it hasn't been updated yet, maybe they aren't running the cron job
-- I emailed Usman from CIFOR to ask if he's running the cron job
+- Looking at the item from CIFOR it hasn’t been updated yet, maybe they aren’t running the cron job
+- I emailed Usman from CIFOR to ask if he’s running the cron job
2017-04-12
- CIFOR says they have cleaned their OAI cache and that the cron job for OAI import is enabled
- Now I see updated fields, like
dc.date.issued
but none from the CG or CIFOR namespaces
-- Also, DSpace Test hasn't re-harvested this item yet, so I will wait one more day before forcing a re-harvest
-- Looking at CIFOR's OAI using different metadata formats, like qualified Dublin Core and DSpace Intermediate Metadata:
+
- Also, DSpace Test hasn’t re-harvested this item yet, so I will wait one more day before forcing a re-harvest
+- Looking at CIFOR’s OAI using different metadata formats, like qualified Dublin Core and DSpace Intermediate Metadata:
-- Looking at one of CGSpace's items in OAI it doesn't seem that metadata fields other than those in the DC schema are exported:
+
- Looking at one of CGSpace’s items in OAI it doesn’t seem that metadata fields other than those in the DC schema are exported:
-- Side note: WTF, I just saw an item on CGSpace's OAI that is using
dc.cplace.country
and dc.rplace.region
, which we stopped using in 2016 after the metadata migrations:
+- Side note: WTF, I just saw an item on CGSpace’s OAI that is using
dc.cplace.country
and dc.rplace.region
, which we stopped using in 2016 after the metadata migrations:

-- The particular item is 10568/6 and, for what it's worth, the stale metadata only appears in the OAI view:
+
- The particular item is 10568/6 and, for what it’s worth, the stale metadata only appears in the OAI view:
-- I don't see these fields anywhere in our source code or the database's metadata registry, so maybe it's just a cache issue
+- I don’t see these fields anywhere in our source code or the database’s metadata registry, so maybe it’s just a cache issue
- I will have to check the OAI cron scripts on DSpace Test, and then run them on CGSpace
- Running
dspace oai import
and dspace oai clean-cache
have zero effect, but this seems to rebuild the cache from scratch:
@@ -263,7 +263,7 @@ OAI 2.0 manager action ended. It took 829 seconds.
- After reading some threads on the DSpace mailing list, I see that
clean-cache
is actually only for caching responses, ie to client requests in the OAI web application
- These are stored in
[dspace]/var/oai/requests/
-- The import command should theoretically catch situations like this where an item's metadata was updated, but in this case we changed the metadata schema and it doesn't seem to catch it (could be a bug!)
+- The import command should theoretically catch situations like this where an item’s metadata was updated, but in this case we changed the metadata schema and it doesn’t seem to catch it (could be a bug!)
- Attempting a full rebuild of OAI on CGSpace:
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
@@ -284,15 +284,15 @@ sys 1m29.310s
2017-04-13
-- Checking the CIFOR item on DSpace Test, it still doesn't have the new metadata
+- Checking the CIFOR item on DSpace Test, it still doesn’t have the new metadata
- The collection status shows this message from the harvester:
Last Harvest Result: OAI server did not contain any updates on 2017-04-13 02:19:47.964
-- I don't know why there were no updates detected, so I will reset and reimport the collection
-- Usman has set up a custom crosswalk called
dimcg
that now shows CG and CIFOR metadata namespaces, but we can't use it because DSpace can only harvest DIM by default (from the harvesting user interface)
+- I don’t know why there were no updates detected, so I will reset and reimport the collection
+- Usman has set up a custom crosswalk called
dimcg
that now shows CG and CIFOR metadata namespaces, but we can’t use it because DSpace can only harvest DIM by default (from the harvesting user interface)
- Also worth noting that the REST interface exposes all fields in the item, including CG and CIFOR fields: https://data.cifor.org/dspace/rest/items/944?expand=metadata
- After re-importing the CIFOR collection it looks very good!
- It seems like they have done a full metadata migration with
dc.date.issued
and cg.coverage.country
etc
@@ -347,10 +347,10 @@ $ rails -s
- Looking at the same item in XMLUI, the countries are not capitalized: https://data.cifor.org/dspace/xmlui/handle/11463/947?show=full
- So it seems he did it in the crosswalk!
- Keep working on Ansible stuff for deploying the CKM REST API
-- We can use systemd's
Environment
stuff to pass the database parameters to Rails
+- We can use systemd’s
Environment
stuff to pass the database parameters to Rails
- Abenet noticed that the “Workflow Statistics” option is missing now, but we have screenshots from a presentation in 2016 when it was there
- I filed a ticket with Atmire
-- Looking at 933 CIAT records from Sisay, he's having problems creating a SAF bundle to import to DSpace Test
+- Looking at 933 CIAT records from Sisay, he’s having problems creating a SAF bundle to import to DSpace Test
- I started by looking at his CSV in OpenRefine, and I see there a bunch of fields with whitespace issues that I cleaned up:
value.replace(" ||","||").replace("|| ","||").replace(" || ","||")
@@ -363,8 +363,8 @@ $ rails -s
value.split('/')[-1].replace(/#.*$/,"")
-- The
replace
part is because some URLs have an anchor like #page=14
which we obviously don't want on the filename
-- Also, we need to only use the PDF on the item corresponding with page 1, so we don't end up with literally hundreds of duplicate PDFs
+- The
replace
part is because some URLs have an anchor like #page=14
which we obviously don’t want on the filename
+- Also, we need to only use the PDF on the item corresponding with page 1, so we don’t end up with literally hundreds of duplicate PDFs
- Alternatively, I could export each page to a standalone PDF…
2017-04-20
@@ -399,8 +399,8 @@ $ wc -l /tmp/ciat
Run system updates on CGSpace and reboot server
This includes switching nginx to using upstream with keepalive instead of direct proxy_pass
Re-deploy CGSpace to latest 5_x-prod
, including the PABRA and RTB XMLUI themes, as well as the PDF processing and CMYK changes
-More work on Ansible infrastructure stuff for Tsega's CKM DSpace REST API
-I'm going to start re-processing all the PDF thumbnails on CGSpace, one community at a time:
+More work on Ansible infrastructure stuff for Tsega’s CKM DSpace REST API
+I’m going to start re-processing all the PDF thumbnails on CGSpace, one community at a time:
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace filter-media -f -v -i 10568/71249 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
@@ -471,11 +471,11 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this Index
[dspace]/bin/dspace index-discovery
- Now everything is ok
-- Finally finished manually running the cleanup task over and over and null'ing the conflicting IDs:
+- Finally finished manually running the cleanup task over and over and null’ing the conflicting IDs:
dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1132, 1136, 1220, 1236, 3002, 3255, 5322, 5098, 5982, 5897, 6245, 6184, 4927, 6070, 4925, 6888, 7368, 7136, 7294, 7698, 7864, 10799, 10839, 11765, 13241, 13634, 13642, 14127, 14146, 15582, 16116, 16254, 17136, 17486, 17824, 18098, 22091, 22149, 22206, 22449, 22548, 22559, 22454, 22253, 22553, 22897, 22941, 30262, 33657, 39796, 46943, 56561, 58237, 58739, 58734, 62020, 62535, 64149, 64672, 66988, 66919, 76005, 79780, 78545, 81078, 83620, 84492, 92513, 93915);
-- Now running the cleanup script on DSpace Test and already seeing 11GB freed from the assetstore—it's likely we haven't had a cleanup task complete successfully in years…
+- Now running the cleanup script on DSpace Test and already seeing 11GB freed from the assetstore—it’s likely we haven’t had a cleanup task complete successfully in years…
2017-04-25
@@ -548,7 +548,7 @@ Caused by: java.lang.ClassNotFoundException: org.dspace.statistics.content.DSpac
2017-04-26
- The size of the CGSpace database dump went from 111MB to 96MB, not sure about actual database size though
-- Update RVM's Ruby from 2.3.0 to 2.4.0 on DSpace Test:
+- Update RVM’s Ruby from 2.3.0 to 2.4.0 on DSpace Test:
$ gpg --keyserver hkp://keys.gnupg.net --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
$ \curl -sSL https://raw.githubusercontent.com/wayneeseguin/rvm/master/binscripts/rvm-installer | bash -s stable --ruby
diff --git a/docs/2017-05/index.html b/docs/2017-05/index.html
index b1871605f..2b84b6759 100644
--- a/docs/2017-05/index.html
+++ b/docs/2017-05/index.html
@@ -6,7 +6,7 @@
-
+
@@ -14,8 +14,8 @@
-
-
+
+
@@ -45,7 +45,7 @@
-
+
@@ -93,7 +93,7 @@
May, 2017
@@ -109,12 +109,12 @@
2017-05-02
-- Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request
+- Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request
2017-05-04
- Sync DSpace Test with database and assetstore from CGSpace
-- Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server
+- Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server
- Now I can see the workflow statistics and am able to select users, but everything returns 0 items
- Megan says there are still some mapped items are not appearing since last week, so I forced a full
index-discovery -b
- Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.cgiar.org/handle/10568/80731
@@ -149,8 +149,8 @@
- We decided to use AIP export to preserve the hierarchies and handles of communities and collections
- When ingesting some collections I was getting
java.lang.OutOfMemoryError: GC overhead limit exceeded
, which can be solved by disabling the GC timeout with -XX:-UseGCOverheadLimit
- Other times I was getting an error about heap space, so I kept bumping the RAM allocation by 512MB each time (up to 4096m!) it crashed
-- This leads to tens of thousands of abandoned files in the assetstore, which need to be cleaned up using
dspace cleanup -v
, or else you'll run out of disk space
-- In the end I realized it's better to use submission mode (
-s
) to ingest the community object as a single AIP without its children, followed by each of the collections:
+- This leads to tens of thousands of abandoned files in the assetstore, which need to be cleaned up using
dspace cleanup -v
, or else you’ll run out of disk space
+- In the end I realized it’s better to use submission mode (
-s
) to ingest the community object as a single AIP without its children, followed by each of the collections:
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m -XX:-UseGCOverheadLimit"
$ [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10568/87775 /home/aorth/10947-1/10947-1.zip
@@ -162,14 +162,14 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
Give feedback to CIFOR about their data quality:
- Suggestion: uppercase dc.subject, cg.coverage.region, and cg.coverage.subregion in your crosswalk so they match CGSpace and therefore can be faceted / reported on easier
-- Suggestion: use CGSpace's CRP names (cg.contributor.crp), see: dspace/config/input-forms.xml
+- Suggestion: use CGSpace’s CRP names (cg.contributor.crp), see: dspace/config/input-forms.xml
- Suggestion: clean up duplicates and errors in funders, perhaps use a controlled vocabulary like ours, see: dspace/config/controlled-vocabularies/dc-description-sponsorship.xml
- Suggestion: use dc.type “Blog Post” instead of “Blog” for your blog post items (we are also adding a “Blog Post” type to CGSpace soon)
- Question: many of your items use dc.document.uri AND cg.identifier.url with the same text value?
Help Marianne from WLE with an Open Search query to show the latest WLE CRP outputs: https://cgspace.cgiar.org/open-search/discover?query=crpsubject:WATER%2C+LAND+AND+ECOSYSTEMS&sort_by=2&order=DESC
-This uses the webui's item list sort options, see webui.itemlist.sort-option
in dspace.cfg
+This uses the webui’s item list sort options, see webui.itemlist.sort-option
in dspace.cfg
The equivalent Discovery search would be: https://cgspace.cgiar.org/discover?filtertype_1=crpsubject&filter_relational_operator_1=equals&filter_1=WATER%2C+LAND+AND+ECOSYSTEMS&submit_apply_filter=&query=&rpp=10&sort_by=dc.date.issued_dt&order=desc
2017-05-09
@@ -191,7 +191,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
2017-05-10
-- Atmire says they are willing to extend the ORCID implementation, and I've asked them to provide a quote
+- Atmire says they are willing to extend the ORCID implementation, and I’ve asked them to provide a quote
- I clarified that the scope of the implementation should be that ORCIDs are stored in the database and exposed via REST / API like other fields
- Finally finished importing all the CGIAR Library content, final method was:
@@ -239,7 +239,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
Reboot DSpace Test
-Fix cron jobs for log management on DSpace Test, as they weren't catching dspace.log.*
files correctly and we had over six months of them and they were taking up many gigs of disk space
+Fix cron jobs for log management on DSpace Test, as they weren’t catching dspace.log.*
files correctly and we had over six months of them and they were taking up many gigs of disk space
2017-05-16
@@ -253,7 +253,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
ERROR: duplicate key value violates unique constraint "handle_pkey" Detail: Key (handle_id)=(84834) already exists.
-- I tried updating the sequences a few times, with Tomcat running and stopped, but it hasn't helped
+- I tried updating the sequences a few times, with Tomcat running and stopped, but it hasn’t helped
- It appears item with
handle_id
84834 is one of the imported CGIAR Library items:
dspace=# select * from handle where handle_id=84834;
@@ -269,16 +269,16 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
86873 | 10947/99 | 2 | 89153
(1 row)
-- I've posted on the dspace-test mailing list to see if I can just manually set the
handle_seq
to that value
+- I’ve posted on the dspace-test mailing list to see if I can just manually set the
handle_seq
to that value
- Actually, it seems I can manually set the handle sequence using:
dspace=# select setval('handle_seq',86873);
-- After that I can create collections just fine, though I'm not sure if it has other side effects
+- After that I can create collections just fine, though I’m not sure if it has other side effects
2017-05-21
-- Start creating a basic theme for the CGIAR System Organization's community on CGSpace
+- Start creating a basic theme for the CGIAR System Organization’s community on CGSpace
- Using colors from the CGIAR Branding guidelines (2014)
- Make a GitHub issue to track this work: #324
@@ -315,14 +315,14 @@ AND resource_id IN (select item_id from collection2item where collection_id IN (
2017-05-23
- Add Affiliation to filters on Listing and Reports module (#325)
-- Start looking at WLE's Phase II metadata updates but it seems they are not tagging their items properly, as their website importer infers which theme to use based on the name of the CGSpace collection!
-- For now I've suggested that they just change the collection names and that we fix their metadata manually afterwards
+- Start looking at WLE’s Phase II metadata updates but it seems they are not tagging their items properly, as their website importer infers which theme to use based on the name of the CGSpace collection!
+- For now I’ve suggested that they just change the collection names and that we fix their metadata manually afterwards
- Also, they have a lot of messed up values in their
cg.subject.wle
field so I will clean up some of those first:
dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id=119) to /tmp/wle.csv with csv;
COPY 111
-- Respond to Atmire message about ORCIDs, saying that right now we'd prefer to just have them available via REST API like any other metadata field, and that I'm available for a Skype
+- Respond to Atmire message about ORCIDs, saying that right now we’d prefer to just have them available via REST API like any other metadata field, and that I’m available for a Skype
2017-05-26
@@ -334,7 +334,7 @@ COPY 111
- File an issue on GitHub to explore/track migration to proper country/region codes (ISO 2/3 and UN M.49): #326
- Ask Peter how the Landportal.info people should acknowledge us as the source of data on their website
- Communicate with MARLO people about progress on exposing ORCIDs via the REST API, as it is set to be discussed in the June, 2017 DCAT meeting
-- Find all of Amos Omore's author name variations so I can link them to his authority entry that has an ORCID:
+- Find all of Amos Omore’s author name variations so I can link them to his authority entry that has an ORCID:
dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Omore, A%';
@@ -347,7 +347,7 @@ UPDATE 187
dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Twine, E%';
-- But it doesn't look like any of his existing entries are linked to an authority which has an ORCID, so I edited the metadata via “Edit this Item” and looked up his ORCID and linked it there
+- But it doesn’t look like any of his existing entries are linked to an authority which has an ORCID, so I edited the metadata via “Edit this Item” and looked up his ORCID and linked it there
- Now I should be able to set his name variations to the new authority:
dspace=# update metadatavalue set authority='f70d0a01-d562-45b8-bca3-9cf7f249bc8b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Twine, E%';
@@ -359,7 +359,7 @@ UPDATE 187
- Discuss WLE themes and subjects with Mia and Macaroni Bros
- We decided we need to create metadata fields for Phase I and II themes
-- I've updated the existing GitHub issue for Phase II (#322) and created a new one to track the changes for Phase I themes (#327)
+- I’ve updated the existing GitHub issue for Phase II (#322) and created a new one to track the changes for Phase I themes (#327)
- After Macaroni Bros update the WLE website importer we will rename the WLE collections to reflect Phase II
- Also, we need to have Mia and Udana look through the existing metadata in
cg.subject.wle
as it is quite a mess
diff --git a/docs/2017-06/index.html b/docs/2017-06/index.html
index bad2692d0..4380d0109 100644
--- a/docs/2017-06/index.html
+++ b/docs/2017-06/index.html
@@ -6,7 +6,7 @@
-
+
@@ -14,8 +14,8 @@
-
-
+
+
@@ -45,7 +45,7 @@
-
+
@@ -93,7 +93,7 @@
June, 2017
@@ -101,7 +101,7 @@
- After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes
- The
cg.identifier.wletheme
field will be used for both Phase I and Phase II Research Themes
-- Then we'll create a new sub-community for Phase II and create collections for the research themes there
+- Then we’ll create a new sub-community for Phase II and create collections for the research themes there
- The current “Research Themes” community will be renamed to “WLE Phase I Research Themes”
- Tagged all items in the current Phase I collections with their appropriate themes
- Create pull request to add Phase II research themes to the submission form: #328
@@ -111,15 +111,15 @@
- After adding
cg.identifier.wletheme
to 1106 WLE items I can see the field on XMLUI but not in REST!
- Strangely it happens on DSpace Test AND on CGSpace!
-- I tried to re-index Discovery but it didn't fix it
+- I tried to re-index Discovery but it didn’t fix it
- Run all system updates on DSpace Test and reboot the server
- After rebooting the server (and therefore restarting Tomcat) the new metadata field is available
-- I've sent a message to the dspace-tech mailing list to ask if this is a bug and whether I should file a Jira ticket
+- I’ve sent a message to the dspace-tech mailing list to ask if this is a bug and whether I should file a Jira ticket
2016-06-05
-- Rename WLE's “Research Themes” sub-community to “WLE Phase I Research Themes” on DSpace Test so Macaroni Bros can continue their testing
-- Macaroni Bros tested it and said it's fine, so I renamed it on CGSpace as well
+- Rename WLE’s “Research Themes” sub-community to “WLE Phase I Research Themes” on DSpace Test so Macaroni Bros can continue their testing
+- Macaroni Bros tested it and said it’s fine, so I renamed it on CGSpace as well
- Working on how to automate the extraction of the CIAT Book chapters, doing some magic in OpenRefine to extract page from–to from cg.identifier.url and dc.format.extent, respectively:
- cg.identifier.url:
value.split("page=", "")[1]
@@ -144,7 +144,7 @@
- 17 of the items have issues with incorrect page number ranges, and upon closer inspection they do not appear in the referenced PDF
-- I've flagged them and proceeded without them (752 total) on DSpace Test:
+- I’ve flagged them and proceeded without them (752 total) on DSpace Test:
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &> /tmp/ciat-books.log
@@ -154,9 +154,9 @@
2017-06-07
-- Testing Atmire's patch for the CUA Workflow Statistics again
-- Still doesn't seem to give results I'd expect, like there are no results for Maria Garruccio, or for the ILRI community!
-- Then I'll file an update to the issue on Atmire's tracker
+- Testing Atmire’s patch for the CUA Workflow Statistics again
+- Still doesn’t seem to give results I’d expect, like there are no results for Maria Garruccio, or for the ILRI community!
+- Then I’ll file an update to the issue on Atmire’s tracker
- Created a new branch with just the relevant changes, so I can send it to them
- One thing I noticed is that there is a failed database migration related to CUA:
@@ -194,7 +194,7 @@
2017-06-20
-- Import Abenet and Peter's changes to the CGIAR Library CRP community
+- Import Abenet and Peter’s changes to the CGIAR Library CRP community
- Due to them using Windows and renaming some columns there were formatting, encoding, and duplicate metadata value issues
- I had to remove some fields from the CSV and rename some back to, ie,
dc.subject[en_US]
just so DSpace would detect changes properly
- Now it looks much better: https://dspacetest.cgiar.org/handle/10947/2517
@@ -212,7 +212,7 @@ $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace impo
- WLE has said that one of their Phase II research themes is being renamed from
Regenerating Degraded Landscapes
to Restoring Degraded Landscapes
- Pull request with the changes to
input-forms.xml
: #329
-- As of now it doesn't look like there are any items using this research theme so we don't need to do any updates:
+- As of now it doesn’t look like there are any items using this research theme so we don’t need to do any updates:
dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=237 and text_value like 'Regenerating Degraded Landscapes%';
text_value
@@ -229,15 +229,15 @@ $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace impo
Java stacktrace: java.util.NoSuchElementException: Timeout waiting for idle object
- After looking at the Tomcat logs, Munin graphs, and PostgreSQL connection stats, it seems there is just a high load
-- Might be a good time to adjust DSpace's database connection settings, like I first mentioned in April, 2017 after reading the 2017-04 DCAT comments
-- I've adjusted the following in CGSpace's config:
+
- Might be a good time to adjust DSpace’s database connection settings, like I first mentioned in April, 2017 after reading the 2017-04 DCAT comments
+- I’ve adjusted the following in CGSpace’s config:
-db.maxconnections
30→70 (the default PostgreSQL config allows 100 connections, so DSpace's default of 30 is quite low)
+db.maxconnections
30→70 (the default PostgreSQL config allows 100 connections, so DSpace’s default of 30 is quite low)
db.maxwait
5000→10000
db.maxidle
8→20 (DSpace default is -1, unlimited, but we had set it to 8 earlier)
-- We will need to adjust this again (as well as the
pg_hba.conf
settings) when we deploy tsega's REST API
+- We will need to adjust this again (as well as the
pg_hba.conf
settings) when we deploy tsega’s REST API
- Whip up a test for Marianne of WLE to be able to show both their Phase I and II research themes in the CGSpace item submission form:
diff --git a/docs/2017-07/index.html b/docs/2017-07/index.html
index 752ad42e3..9704ee0b4 100644
--- a/docs/2017-07/index.html
+++ b/docs/2017-07/index.html
@@ -13,8 +13,8 @@ Run system updates and reboot DSpace Test
2017-07-04
Merge changes for WLE Phase II theme rename (#329)
-Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace
-We can use PostgreSQL's extended output format (-x) plus sed to format the output into quasi XML:
+Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
+We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML:
" />
@@ -30,10 +30,10 @@ Run system updates and reboot DSpace Test
2017-07-04
Merge changes for WLE Phase II theme rename (#329)
-Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace
-We can use PostgreSQL's extended output format (-x) plus sed to format the output into quasi XML:
+Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
+We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML:
"/>
-
+
@@ -63,7 +63,7 @@ We can use PostgreSQL's extended output format (-x) plus sed to format the o
-
+
@@ -111,7 +111,7 @@ We can use PostgreSQL's extended output format (-x) plus sed to format the o
July, 2017
@@ -122,19 +122,19 @@ We can use PostgreSQL's extended output format (-x) plus sed to format the o
2017-07-04
- Merge changes for WLE Phase II theme rename (#329)
-- Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace
-- We can use PostgreSQL's extended output format (
-x
) plus sed
to format the output into quasi XML:
+- Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
+- We can use PostgreSQL’s extended output format (
-x
) plus sed
to format the output into quasi XML:
$ psql dspacenew -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=5 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:</dc-type>\n<dc-type>\n<schema>cg</schema>:;s:([^ ]*) +\| (.*): <\1>\2</\1>:;s:^$:</dc-type>:;1s:</dc-type>\n::'
- The
sed
script is from a post on the PostgreSQL mailing list
-- Abenet says the ILRI board wants to be able to have “lead author” for every item, so I've whipped up a WIP test in the
5_x-lead-author
branch
-- It works but is still very rough and we haven't thought out the whole lifecycle yet
+- Abenet says the ILRI board wants to be able to have “lead author” for every item, so I’ve whipped up a WIP test in the
5_x-lead-author
branch
+- It works but is still very rough and we haven’t thought out the whole lifecycle yet

- I assume that “lead author” would actually be the first question on the item submission form
-- We also need to check to see which ORCID authority core this uses, because it seems to be using an entirely new one rather than the one for
dc.contributor.author
(which makes sense of course, but fuck, all the author problems aren't bad enough?!)
+- We also need to check to see which ORCID authority core this uses, because it seems to be using an entirely new one rather than the one for
dc.contributor.author
(which makes sense of course, but fuck, all the author problems aren’t bad enough?!)
- Also would need to edit XMLUI item displays to incorporate this into authors list
- And fuck, then anyone consuming our data via REST / OAI will not notice that we have an author outside of
dc.contributor.authors
… ugh
- What if we modify the item submission form to use
type-bind
fields to show/hide certain fields depending on the type?
@@ -152,8 +152,8 @@ We can use PostgreSQL's extended output format (-x) plus sed to format the o
org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserved for non-replication superuser connections
- Looking at the
pg_stat_activity
table I saw there were indeed 98 active connections to PostgreSQL, and at this time the limit is 100, so that makes sense
-- Tsega restarted Tomcat and it's working now
-- Abenet said she was generating a report with Atmire's CUA module, so it could be due to that?
+- Tsega restarted Tomcat and it’s working now
+- Abenet said she was generating a report with Atmire’s CUA module, so it could be due to that?
- Looking in the logs I see this random error again that I should report to DSpace:
2017-07-05 13:50:07,196 ERROR org.dspace.statistics.SolrLogger @ COUNTRY ERROR: EU
@@ -171,7 +171,7 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
2017-07-14
-- Sisay sent me a patch to add “Photo Report” to
dc.type
so I've added it to the 5_x-prod
branch
+- Sisay sent me a patch to add “Photo Report” to
dc.type
so I’ve added it to the 5_x-prod
branch
2017-07-17
@@ -193,7 +193,7 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
- Talk to Tsega and Danny about exporting/injesting the blog posts from Drupal into DSpace?
- Followup meeting on August 8/9?
-- Sent Abenet the 2415 records from CGIAR Library's Historical Archive (10947/1) after cleaning up the author authorities and HTML entities in
dc.contributor.author
and dc.description.abstract
using OpenRefine:
+ - Sent Abenet the 2415 records from CGIAR Library’s Historical Archive (10947/1) after cleaning up the author authorities and HTML entities in
dc.contributor.author
and dc.description.abstract
using OpenRefine:
- Authors:
value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,"")
- Abstracts:
replace(value,/<\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?>/,'')
@@ -210,10 +210,10 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
2017-07-27
-- Help Sisay with some transforms to add descriptions to the
filename
column of some CIAT Presentations he's working on in OpenRefine
+- Help Sisay with some transforms to add descriptions to the
filename
column of some CIAT Presentations he’s working on in OpenRefine
- Marianne emailed a few days ago to ask why “Integrating Ecosystem Solutions” was not in the list of WLE Phase I Research Themes on the input form
- I told her that I only added the themes that I saw in the WLE Phase I Research Themes community
-- Then Mia from WLE also emailed to ask where some WLE focal regions went, and I said I didn't understand what she was talking about, as all we did in our previous work was rename the old “Research Themes” subcommunity to “WLE Phase I Research Themes” and add a new subcommunity for “WLE Phase II Research Themes”.
+- Then Mia from WLE also emailed to ask where some WLE focal regions went, and I said I didn’t understand what she was talking about, as all we did in our previous work was rename the old “Research Themes” subcommunity to “WLE Phase I Research Themes” and add a new subcommunity for “WLE Phase II Research Themes”.
- Discuss some modifications to the CCAFS project tags in CGSpace submission form and in the database
2017-07-28
@@ -228,7 +228,7 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
2017-07-30
- Start working on CCAFS project tag cleanup
-- More questions about inconsistencies and spelling mistakes in their tags, so I've sent some questions for followup
+- More questions about inconsistencies and spelling mistakes in their tags, so I’ve sent some questions for followup
2017-07-31
diff --git a/docs/2017-08/index.html b/docs/2017-08/index.html
index 6fb0bb62e..536c124b9 100644
--- a/docs/2017-08/index.html
+++ b/docs/2017-08/index.html
@@ -20,7 +20,7 @@ But many of the bots are browsing dynamic URLs like:
The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
-It turns out that we're already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
+It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
Also, the bot has to successfully browse the page first so it can receive the HTTP header…
We might actually have to block these requests with HTTP 403 depending on the user agent
Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
@@ -49,7 +49,7 @@ But many of the bots are browsing dynamic URLs like:
The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
-It turns out that we're already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
+It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
Also, the bot has to successfully browse the page first so it can receive the HTTP header…
We might actually have to block these requests with HTTP 403 depending on the user agent
Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
@@ -57,7 +57,7 @@ This was due to newline characters in the dc.description.abstract column, which
I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
"/>
-
+
@@ -87,7 +87,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
-
+
@@ -135,7 +135,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
August, 2017
@@ -153,7 +153,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
- The
robots.txt
only blocks the top-level /discover
and /browse
URLs… we will need to find a way to forbid them from accessing these!
- Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
-- It turns out that we're already adding the
X-Robots-Tag "none"
HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
+- It turns out that we’re already adding the
X-Robots-Tag "none"
HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
- Also, the bot has to successfully browse the page first so it can receive the HTTP header…
- We might actually have to block these requests with HTTP 403 depending on the user agent
- Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
@@ -164,9 +164,9 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
2017-08-02
- Magdalena from CCAFS asked if there was a way to get the top ten items published in 2016 (note: not the top items in 2016!)
-- I think Atmire's Content and Usage Analysis module should be able to do this but I will have to look at the configuration and maybe email Atmire if I can't figure it out
-- I had a look at the moduel configuration and couldn't figure out a way to do this, so I opened a ticket on the Atmire tracker
-- Atmire responded about the missing workflow statistics issue a few weeks ago but I didn't see it for some reason
+- I think Atmire’s Content and Usage Analysis module should be able to do this but I will have to look at the configuration and maybe email Atmire if I can’t figure it out
+- I had a look at the moduel configuration and couldn’t figure out a way to do this, so I opened a ticket on the Atmire tracker
+- Atmire responded about the missing workflow statistics issue a few weeks ago but I didn’t see it for some reason
- They said they added a publication and saw the workflow stat for the user, so I should try again and let them know
2017-08-05
@@ -176,17 +176,17 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s

-- I don't see anything related in our logs, so I asked him to check for our server's IP in their logs
-- Also, in the mean time I stopped the harvesting process, reset the status, and restarted the process via the Admin control panel (note: I didn't reset the collection, just the harvester status!)
+- I don’t see anything related in our logs, so I asked him to check for our server’s IP in their logs
+- Also, in the mean time I stopped the harvesting process, reset the status, and restarted the process via the Admin control panel (note: I didn’t reset the collection, just the harvester status!)
2017-08-07
-- Apply Abenet's corrections for the CGIAR Library's Consortium subcommunity (697 records)
-- I had to fix a few small things, like moving the
dc.title
column away from the beginning of the row, delete blank spaces in the abstract in vim using :g/^$/d
, add the dc.subject[en_US]
column back, as she had deleted it and DSpace didn't detect the changes made there (we needed to blank the values instead)
+- Apply Abenet’s corrections for the CGIAR Library’s Consortium subcommunity (697 records)
+- I had to fix a few small things, like moving the
dc.title
column away from the beginning of the row, delete blank spaces in the abstract in vim using :g/^$/d
, add the dc.subject[en_US]
column back, as she had deleted it and DSpace didn’t detect the changes made there (we needed to blank the values instead)
2017-08-08
-- Apply Abenet's corrections for the CGIAR Library's historic archive subcommunity (2415 records)
+- Apply Abenet’s corrections for the CGIAR Library’s historic archive subcommunity (2415 records)
- I had to add the
dc.subject[en_US]
column back with blank values so that DSpace could detect the changes
- I applied the changes in 500 item batches
@@ -196,13 +196,13 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
Help ICARDA upgrade their MELSpace to DSpace 5.7 using the docker-dspace container
- We had to import the PostgreSQL dump to the PostgreSQL container using:
pg_restore -U postgres -d dspace blah.dump
-- Otherwise, when using
-O
it messes up the permissions on the schema and DSpace can't read it
+- Otherwise, when using
-O
it messes up the permissions on the schema and DSpace can’t read it
2017-08-10
-- Apply last updates to the CGIAR Library's Fund community (812 items)
+- Apply last updates to the CGIAR Library’s Fund community (812 items)
- Had to do some quality checks and column renames before importing, as either Sisay or Abenet renamed a few columns and the metadata importer wanted to remove/add new metadata for title, abstract, etc.
- Also I applied the HTML entities unescape transform on the abstract column in Open Refine
- I need to get an author list from the database for only the CGIAR Library community to send to Peter
@@ -243,7 +243,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
85736 70.32.83.92
- The top offender is 70.32.83.92 which is actually the same IP as ccafs.cgiar.org, so I will email the Macaroni Bros to see if they can test on DSpace Test instead
-- I've enabled logging of
/oai
requests on nginx as well so we can potentially determine bad actors here (also to see if anyone is actually using OAI!)
+- I’ve enabled logging of
/oai
requests on nginx as well so we can potentially determine bad actors here (also to see if anyone is actually using OAI!)
# log oai requests
location /oai {
@@ -268,7 +268,7 @@ DELETE 1
dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='WSSD';
- Generate a new list of authors from the CGIAR Library community for Peter to look through now that the initial corrections have been done
-- Thinking about resource limits for PostgreSQL again after last week's CGSpace crash and related to a recently discussion I had in the comments of the April, 2017 DCAT meeting notes
+- Thinking about resource limits for PostgreSQL again after last week’s CGSpace crash and related to a recently discussion I had in the comments of the April, 2017 DCAT meeting notes
- In that thread Chris Wilper suggests a new default of 35 max connections for
db.maxconnections
(from the current default of 30), knowing that each DSpace web application gets to use up to this many on its own
- It would be good to approximate what the theoretical maximum number of connections on a busy server would be, perhaps by looking to see which apps use SQL:
@@ -283,21 +283,21 @@ $ grep -rsI SQLException dspace-solr | wc -l
$ grep -rsI SQLException dspace-xmlui | wc -l
866
-- Of those five applications we're running, only
solr
appears not to use the database directly
-- And JSPUI is only used internally (so it doesn't really count), leaving us with OAI, REST, and XMLUI
-- Assuming each takes a theoretical maximum of 35 connections during a heavy load (35 * 3 = 105), that would put the connections well above PostgreSQL's default max of 100 connections (remember a handful of connections are reserved for the PostgreSQL super user, see
superuser_reserved_connections
)
-- So we should adjust PostgreSQL's max connections to be DSpace's
db.maxconnections
* 3 + 3
-- This would allow each application to use up to
db.maxconnections
and not to go over the system's PostgreSQL limit
+- Of those five applications we’re running, only
solr
appears not to use the database directly
+- And JSPUI is only used internally (so it doesn’t really count), leaving us with OAI, REST, and XMLUI
+- Assuming each takes a theoretical maximum of 35 connections during a heavy load (35 * 3 = 105), that would put the connections well above PostgreSQL’s default max of 100 connections (remember a handful of connections are reserved for the PostgreSQL super user, see
superuser_reserved_connections
)
+- So we should adjust PostgreSQL’s max connections to be DSpace’s
db.maxconnections
* 3 + 3
+- This would allow each application to use up to
db.maxconnections
and not to go over the system’s PostgreSQL limit
- Perhaps since CGSpace is a busy site with lots of resources we could actually use something like 40 for
db.maxconnections
-- Also worth looking into is to set up a database pool using JNDI, as apparently DSpace's
db.poolname
hasn't been used since around DSpace 1.7 (according to Chris Wilper's comments in the thread)
+- Also worth looking into is to set up a database pool using JNDI, as apparently DSpace’s
db.poolname
hasn’t been used since around DSpace 1.7 (according to Chris Wilper’s comments in the thread)
- Need to go check the PostgreSQL connection stats in Munin on CGSpace from the past week to get an idea if 40 is appropriate
- Looks like connections hover around 50:

-- Unfortunately I don't have the breakdown of which DSpace apps are making those connections (I'll assume XMLUI)
-- So I guess a limit of 30 (DSpace default) is too low, but 70 causes problems when the load increases and the system's PostgreSQL
max_connections
is too low
-- For now I think maybe setting DSpace's
db.maxconnections
to 40 and adjusting the system's max_connections
might be a good starting point: 40 * 3 + 3 = 123
+- Unfortunately I don’t have the breakdown of which DSpace apps are making those connections (I’ll assume XMLUI)
+- So I guess a limit of 30 (DSpace default) is too low, but 70 causes problems when the load increases and the system’s PostgreSQL
max_connections
is too low
+- For now I think maybe setting DSpace’s
db.maxconnections
to 40 and adjusting the system’s max_connections
might be a good starting point: 40 * 3 + 3 = 123
- Apply 223 more author corrections from Peter on CGIAR Library
- Help Magdalena from CCAFS with some CUA statistics questions
@@ -320,7 +320,7 @@ $ grep -rsI SQLException dspace-xmlui | wc -l
dspace=# update metadatavalue set text_lang='en_US' where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'abstract') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')))
- And on others like
dc.language.iso
, dc.relation.ispartofseries
, dc.type
, dc.title
, etc…
-- Also, to move fields from
dc.identifier.url
to cg.identifier.url[en_US]
(because we don't use the Dublin Core one for some reason):
+- Also, to move fields from
dc.identifier.url
to cg.identifier.url[en_US]
(because we don’t use the Dublin Core one for some reason):
dspace=# update metadatavalue set metadata_field_id = 219, text_lang = 'en_US' where resource_type_id = 2 AND metadata_field_id = 237;
UPDATE 15
@@ -339,8 +339,8 @@ UPDATE 4899
isNotNull(value.match(/(CGIAR .+?)\|\|\1/))
-- This would be true if the authors were like
CGIAR System Management Office||CGIAR System Management Office
, which some of the CGIAR Library's were
-- Unfortunately when you fix these in OpenRefine and then submit the metadata to DSpace it doesn't detect any changes, so you have to edit them all manually via DSpace's “Edit Item”
+- This would be true if the authors were like
CGIAR System Management Office||CGIAR System Management Office
, which some of the CGIAR Library’s were
+- Unfortunately when you fix these in OpenRefine and then submit the metadata to DSpace it doesn’t detect any changes, so you have to edit them all manually via DSpace’s “Edit Item”
- Ooh! And an even more interesting regex would match any duplicated author:
isNotNull(value.match(/(.+?)\|\|\1/))
@@ -354,7 +354,7 @@ UPDATE 4899
2017-08-17
-- Run Peter's edits to the CGIAR System Organization community on DSpace Test
+- Run Peter’s edits to the CGIAR System Organization community on DSpace Test
- Uptime Robot said CGSpace went down for 1 minute, not sure why
- Looking in
dspace.log.2017-08-17
I see some weird errors that might be related?
@@ -386,7 +386,7 @@ dspace.log.2017-08-17:584
A few posts on the dspace-tech mailing list say this is related to the Cocoon cache somehow
I will clear the XMLUI cache for now and see if the errors continue (though perpaps shutting down Tomcat and removing the cache is more effective somehow?)
We tested the option for limiting restricted items from the RSS feeds on DSpace Test
-I created four items, and only the two with public metadata showed up in the community's RSS feed:
+ I created four items, and only the two with public metadata showed up in the community’s RSS feed:
- Public metadata, public bitstream ✓
- Public metadata, restricted bitstream ✓
@@ -394,7 +394,7 @@ dspace.log.2017-08-17:584
- Private item ✗
-Peter responded and said that he doesn't want to limit items to be restricted just so we can change the RSS feeds
+Peter responded and said that he doesn’t want to limit items to be restricted just so we can change the RSS feeds
2017-08-18
@@ -403,7 +403,7 @@ dspace.log.2017-08-17:584
- I wired it up to the
dc.subject
field of the submission interface using the “lookup” type and it works!
- I think we can use this example to get a working AGROVOC query
- More information about authority framework: https://wiki.duraspace.org/display/DSPACE/Authority+Control+of+Metadata+Values
-- Wow, I'm playing with the AGROVOC SPARQL endpoint using the sparql-query tool:
+- Wow, I’m playing with the AGROVOC SPARQL endpoint using the sparql-query tool:
$ ./sparql-query http://202.45.139.84:10035/catalogs/fao/repositories/agrovoc
sparql$ PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
@@ -442,7 +442,7 @@ WHERE {
2017-08-20
-- Since I cleared the XMLUI cache on 2017-08-17 there haven't been any more
ERROR net.sf.ehcache.store.DiskStore
errors
+- Since I cleared the XMLUI cache on 2017-08-17 there haven’t been any more
ERROR net.sf.ehcache.store.DiskStore
errors
- Look at the CGIAR Library to see if I can find the items that have been submitted since May:
dspace=# select * from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z';
@@ -474,13 +474,13 @@ WHERE {
2017-08-28
- Bram had written to me two weeks ago to set up a chat about ORCID stuff but the email apparently bounced and I only found out when he emaiiled me on another account
-- I told him I can chat in a few weeks when I'm back
+- I told him I can chat in a few weeks when I’m back
2017-08-31
- I notice that in many WLE collections Marianne Gadeberg is in the edit or approval steps, but she is also in the groups for those steps.
- I think we need to have a process to go back and check / fix some of these scenarios—to remove her user from the step and instead add her to the group—because we have way too many authorizations and in late 2016 we had performance issues with Solr because of this
-- I asked Sisay about this and hinted that he should go back and fix these things, but let's see what he says
+- I asked Sisay about this and hinted that he should go back and fix these things, but let’s see what he says
- Saw CGSpace go down briefly today and noticed SQL connection pool errors in the dspace log file:
ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error
@@ -488,7 +488,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
- Looking at the logs I see we have been having hundreds or thousands of these errors a few times per week in 2017-07 and almost every day in 2017-08
- It seems that I changed the
db.maxconnections
setting from 70 to 40 around 2017-08-14, but Macaroni Bros also reduced their hourly hammering of the REST API then
-- Nevertheless, it seems like a connection limit is not enough and that I should increase it (as well as the system's PostgreSQL
max_connections
)
+- Nevertheless, it seems like a connection limit is not enough and that I should increase it (as well as the system’s PostgreSQL
max_connections
)
diff --git a/docs/2017-09/index.html b/docs/2017-09/index.html
index 315b5b58b..a6d29bbfe 100644
--- a/docs/2017-09/index.html
+++ b/docs/2017-09/index.html
@@ -12,7 +12,7 @@ Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two
2017-09-07
-Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
+Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
" />
@@ -27,9 +27,9 @@ Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two
2017-09-07
-Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
+Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
"/>
-
+
@@ -59,7 +59,7 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is
-
+
@@ -107,7 +107,7 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is
September, 2017
@@ -117,7 +117,7 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is
2017-09-07
-- Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
+- Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
2017-09-10
@@ -126,17 +126,17 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is
dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
DELETE 58
-- I also ran it on DSpace Test because we'll be migrating the CGIAR Library soon and it would be good to catch these before we migrate
+- I also ran it on DSpace Test because we’ll be migrating the CGIAR Library soon and it would be good to catch these before we migrate
- Run system updates and restart DSpace Test
- We only have 7.7GB of free space on DSpace Test so I need to copy some data off of it before doing the CGIAR Library migration (requires lots of exporting and creating temp files)
-- I still have the original data from the CGIAR Library so I've zipped it up and sent it off to linode18 for now
+- I still have the original data from the CGIAR Library so I’ve zipped it up and sent it off to linode18 for now
- sha256sum of
original-cgiar-library-6.6GB.tar.gz
is: bcfabb52f51cbdf164b61b7e9b3a0e498479e4c1ed1d547d32d11f44c0d5eb8a
- Start doing a test run of the CGIAR Library migration locally
- Notes and todo checklist here for now: https://gist.github.com/alanorth/3579b74e116ab13418d187ed379abd9c
- Create pull request for Phase I and II changes to CCAFS Project Tags: #336
-- We've been discussing with Macaroni Bros and CCAFS for the past month or so and the list of tags was recently finalized
-- There will need to be some metadata updates — though if I recall correctly it is only about seven records — for that as well, I had made some notes about it in 2017-07, but I've asked for more clarification from Lili just in case
-- Looking at the DSpace logs to see if we've had a change in the “Cannot get a connection” errors since last month when we adjusted the
db.maxconnections
parameter on CGSpace:
+- We’ve been discussing with Macaroni Bros and CCAFS for the past month or so and the list of tags was recently finalized
+- There will need to be some metadata updates — though if I recall correctly it is only about seven records — for that as well, I had made some notes about it in 2017-07, but I’ve asked for more clarification from Lili just in case
+- Looking at the DSpace logs to see if we’ve had a change in the “Cannot get a connection” errors since last month when we adjusted the
db.maxconnections
parameter on CGSpace:
# grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-09-*
dspace.log.2017-09-01:0
@@ -150,11 +150,11 @@ dspace.log.2017-09-08:10
dspace.log.2017-09-09:0
dspace.log.2017-09-10:0
-- Also, since last month (2017-08) Macaroni Bros no longer runs their REST API scraper every hour, so I'm sure that helped
+- Also, since last month (2017-08) Macaroni Bros no longer runs their REST API scraper every hour, so I’m sure that helped
- There are still some errors, though, so maybe I should bump the connection limit up a bit
-- I remember seeing that Munin shows that the average number of connections is 50 (which is probably mostly from the XMLUI) and we're currently allowing 40 connections per app, so maybe it would be good to bump that value up to 50 or 60 along with the system's PostgreSQL
max_connections
(formula should be: webapps * 60 + 3, or 3 * 60 + 3 = 183 in our case)
+- I remember seeing that Munin shows that the average number of connections is 50 (which is probably mostly from the XMLUI) and we’re currently allowing 40 connections per app, so maybe it would be good to bump that value up to 50 or 60 along with the system’s PostgreSQL
max_connections
(formula should be: webapps * 60 + 3, or 3 * 60 + 3 = 183 in our case)
- I updated both CGSpace and DSpace Test to use these new settings (60 connections per web app and 183 for system PostgreSQL limit)
-- I'm expecting to see 0 connection errors for the next few months
+- I’m expecting to see 0 connection errors for the next few months
2017-09-11
@@ -163,7 +163,7 @@ dspace.log.2017-09-10:0
2017-09-12
-- I was testing the METS XSD caching during AIP ingest but it doesn't seem to help actually
+- I was testing the METS XSD caching during AIP ingest but it doesn’t seem to help actually
- The import process takes the same amount of time with and without the caching
- Also, I captured TCP packets destined for port 80 and both imports only captured ONE packet (an update check from some component in Java):
@@ -182,8 +182,8 @@ dspace.log.2017-09-10:0
I had a Skype call with Bram Luyten from Atmire to discuss various issues related to ORCID in DSpace
- First, ORCID is deprecating their version 1 API (which DSpace uses) and in version 2 API they have removed the ability to search for users by name
-- The logic is that searching by name actually isn't very useful because ORCID is essentially a global phonebook and there are tons of legitimately duplicate and ambiguous names
-- Atmire's proposed integration would work by having users lookup and add authors to the authority core directly using their ORCID ID itself (this would happen during the item submission process or perhaps as a standalone / batch process, for example to populate the authority core with a list of known ORCIDs)
+- The logic is that searching by name actually isn’t very useful because ORCID is essentially a global phonebook and there are tons of legitimately duplicate and ambiguous names
+- Atmire’s proposed integration would work by having users lookup and add authors to the authority core directly using their ORCID ID itself (this would happen during the item submission process or perhaps as a standalone / batch process, for example to populate the authority core with a list of known ORCIDs)
- Once the association between name and ORCID is made in the authority then it can be autocompleted in the lookup field
- Ideally there could also be a user interface for cleanup and merging of authorities
- He will prepare a quote for us with keeping in mind that this could be useful to contribute back to the community for a 5.x release
@@ -194,8 +194,8 @@ dspace.log.2017-09-10:0
2017-09-13
- Last night Linode sent an alert about CGSpace (linode18) that it has exceeded the outbound traffic rate threshold of 10Mb/s for the last two hours
-- I wonder what was going on, and looking into the nginx logs I think maybe it's OAI…
-- Here is yesterday's top ten IP addresses making requests to
/oai
:
+- I wonder what was going on, and looking into the nginx logs I think maybe it’s OAI…
+- Here is yesterday’s top ten IP addresses making requests to
/oai
:
# awk '{print $1}' /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
1 213.136.89.78
@@ -208,7 +208,7 @@ dspace.log.2017-09-10:0
15825 35.161.215.53
16704 54.70.51.7
-- Compared to the previous day's logs it looks VERY high:
+- Compared to the previous day’s logs it looks VERY high:
# awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
1 207.46.13.39
@@ -260,7 +260,7 @@ dspace.log.2017-09-10:0
/var/log/nginx/oai.log.8.gz:0
/var/log/nginx/oai.log.9.gz:0
-- Some of these heavy users are also using XMLUI, and their user agent isn't matched by the Tomcat Session Crawler valve, so each request uses a different session
+- Some of these heavy users are also using XMLUI, and their user agent isn’t matched by the Tomcat Session Crawler valve, so each request uses a different session
- Yesterday alone the IP addresses using the
API scraper
user agent were responsible for 16,000 sessions in XMLUI:
# grep -a -E "(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)" /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
@@ -273,7 +273,7 @@ dspace.log.2017-09-10:0
WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
- Looking at the spreadsheet with deletions and corrections that CCAFS sent last week
-- It appears they want to delete a lot of metadata, which I'm not sure they realize the implications of:
+- It appears they want to delete a lot of metadata, which I’m not sure they realize the implications of:
dspace=# select text_value, count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange') group by text_value;
text_value | count
@@ -300,12 +300,12 @@ dspace.log.2017-09-10:0
(19 rows)
- I sent CCAFS people an email to ask if they really want to remove these 200+ tags
-- She responded yes, so I'll at least need to do these deletes in PostgreSQL:
+- She responded yes, so I’ll at least need to do these deletes in PostgreSQL:
dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
DELETE 207
-- When we discussed this in late July there were some other renames they had requested, but I don't see them in the current spreadsheet so I will have to follow that up
+- When we discussed this in late July there were some other renames they had requested, but I don’t see them in the current spreadsheet so I will have to follow that up
- I talked to Macaroni Bros and they said to just go ahead with the other corrections as well as their spreadsheet was evolved organically rather than systematically!
- The final list of corrections and deletes should therefore be:
@@ -319,7 +319,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
Although it looks like there was a previous attempt to disable these update checks that was merged in DSpace 4.0 (although it only affects XMLUI): https://jira.duraspace.org/browse/DS-1492
I commented there suggesting that we disable it globally
I merged the changes to the CCAFS project tags (#336) but still need to finalize the metadata deletions/renames
-I merged the CGIAR Library theme changes (#338) to the 5_x-prod
branch in preparation for next week's migration
+I merged the CGIAR Library theme changes (#338) to the 5_x-prod
branch in preparation for next week’s migration
I emailed the Handle administrators (hdladmin@cnri.reston.va.us) to ask them what the process for changing their prefix to be resolved by our resolver
They responded and said that they need email confirmation from the contact of record of the other prefix, so I should have the CGIAR System Organization people email them before I send the new sitebndl.zip
Testing to see how we end up with all these new authorities after we keep cleaning and merging them in the database
@@ -354,7 +354,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 | 600
(9 rows)
-- It created a new authority… let's try to add another item and select the same existing author and see what happens in the database:
+- It created a new authority… let’s try to add another item and select the same existing author and see what happens in the database:
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
text_value | authority | confidence
@@ -387,7 +387,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 | 600
(10 rows)
-- Shit, it created another authority! Let's try it again!
+- Shit, it created another authority! Let’s try it again!
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
text_value | authority | confidence
@@ -413,7 +413,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
Michael Marus is the contact for their prefix but he has left CGIAR, but as I actually have access to the CGIAR Library server I think I can just generate a new sitebndl.zip
file from their server and send it to Handle.net
Also, Handle.net says their prefix is up for annual renewal next month so we might want to just pay for it and take it over
CGSpace was very slow and Uptime Robot even said it was down at one time
-I didn't see any abnormally high usage in the REST or OAI logs, but looking at Munin I see the average JVM usage was at 4.9GB and the heap is only 5GB (5120M), so I think it's just normal growing pains
+I didn’t see any abnormally high usage in the REST or OAI logs, but looking at Munin I see the average JVM usage was at 4.9GB and the heap is only 5GB (5120M), so I think it’s just normal growing pains
Every few months I generally try to increase the JVM heap to be 512M higher than the average usage reported by Munin, so now I adjusted it to 5632M
2017-09-15
@@ -480,16 +480,16 @@ DELETE 207
Abenet wants to be able to filter by ISI Journal in advanced search on queries like this: https://cgspace.cgiar.org/discover?filtertype_0=dateIssued&filtertype_1=dateIssued&filter_relational_operator_1=equals&filter_relational_operator_0=equals&filter_1=%5B2010+TO+2017%5D&filter_0=2017&filtertype=type&filter_relational_operator=equals&filter=Journal+Article
I opened an issue to track this (#340) and will test it on DSpace Test soon
Marianne Gadeberg from WLE asked if I would add an account for Adam Hunt on CGSpace and give him permissions to approve all WLE publications
-I told him to register first, as he's a CGIAR user and needs an account to be created before I can add him to the groups
+I told him to register first, as he’s a CGIAR user and needs an account to be created before I can add him to the groups
2017-09-20
- Abenet and I noticed that hdl.handle.net is blocked by ETC at ILRI Addis so I asked Biruk Debebe to route it over the satellite
-- Force thumbnail regeneration for the CGIAR System Organization's Historic Archive community (2000 items):
+- Force thumbnail regeneration for the CGIAR System Organization’s Historic Archive community (2000 items):
$ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -f -i 10947/1 -p "ImageMagick PDF Thumbnail"
-- I'm still waiting (over 1 day later) to hear back from the CGIAR System Organization about updating the DNS for library.cgiar.org
+- I’m still waiting (over 1 day later) to hear back from the CGIAR System Organization about updating the DNS for library.cgiar.org
2017-09-21
@@ -507,29 +507,29 @@ DELETE 207
- Start investigating other platforms for CGSpace due to linear instance pricing on Linode
- We need to figure out how much memory is used by applications, caches, etc, and how much disk space the asset store needs
-- First, here's the last week of memory usage on CGSpace and DSpace Test:
+- First, here’s the last week of memory usage on CGSpace and DSpace Test:

-- 8GB of RAM seems to be good for DSpace Test for now, with Tomcat's JVM heap taking 3GB, caches and buffers taking 3–4GB, and then ~1GB unused
-- 24GB of RAM is way too much for CGSpace, with Tomcat's JVM heap taking 5.5GB and caches and buffers happily using 14GB or so
+- 8GB of RAM seems to be good for DSpace Test for now, with Tomcat’s JVM heap taking 3GB, caches and buffers taking 3–4GB, and then ~1GB unused
+- 24GB of RAM is way too much for CGSpace, with Tomcat’s JVM heap taking 5.5GB and caches and buffers happily using 14GB or so
- As far as disk space, the CGSpace assetstore currently uses 51GB and Solr cores use 86GB (mostly in the statistics core)
-- DSpace Test currently doesn't even have enough space to store a full copy of CGSpace, as its Linode instance only has 96GB of disk space
-- I've heard Google Cloud is nice (cheap and performant) but it's definitely more complicated than Linode and instances aren't that much cheaper to make it worth it
+- DSpace Test currently doesn’t even have enough space to store a full copy of CGSpace, as its Linode instance only has 96GB of disk space
+- I’ve heard Google Cloud is nice (cheap and performant) but it’s definitely more complicated than Linode and instances aren’t that much cheaper to make it worth it
- Here are some theoretical instances on Google Cloud:
- DSpace Test,
n1-standard-2
with 2 vCPUs, 7.5GB RAM, 300GB persistent SSD: $99/month
- CGSpace,
n1-standard-4
with 4 vCPUs, 15GB RAM, 300GB persistent SSD: $148/month
-- Looking at Linode's instance pricing, for DSpace Test it seems we could use the same 8GB instance for $40/month, and then add block storage of ~300GB for $30 (block storage is currently in beta and priced at $0.10/GiB)
+- Looking at Linode’s instance pricing, for DSpace Test it seems we could use the same 8GB instance for $40/month, and then add block storage of ~300GB for $30 (block storage is currently in beta and priced at $0.10/GiB)
- For CGSpace we could use the cheaper 12GB instance for $80 and then add block storage of 500GB for $50
-- I've sent Peter a message about moving DSpace Test to the New Jersey data center so we can test the block storage beta
+- I’ve sent Peter a message about moving DSpace Test to the New Jersey data center so we can test the block storage beta
- Create pull request for adding ISI Journal to search filters (#341)
- Peter asked if we could map all the items of type
Journal Article
in ILRI Archive to ILRI articles in journals and newsletters
- It is easy to do via CSV using OpenRefine but I noticed that on CGSpace ~1,000 of the expected 2,500 are already mapped, while on DSpace Test they were not
-- I've asked Peter if he knows what's going on (or who mapped them)
+- I’ve asked Peter if he knows what’s going on (or who mapped them)
- Turns out he had already mapped some, but requested that I finish the rest
- With this GREL in OpenRefine I can find items that are mapped, ie they have
10568/3||
or 10568/3$
in their collection
field:
@@ -543,7 +543,7 @@ DELETE 207
- Email Rosemary Kande from ICT to ask about the administrative / finance procedure for moving DSpace Test from EU to US region on Linode
- Communicate (finally) with Tania and Tunji from the CGIAR System Organization office to tell them to request CGNET make the DNS updates for library.cgiar.org
-- Peter wants me to clean up the text values for Delia Grace's metadata, as the authorities are all messed up again since we cleaned them up in 2016-12:
+- Peter wants me to clean up the text values for Delia Grace’s metadata, as the authorities are all messed up again since we cleaned them up in 2016-12:
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
text_value | authority | confidence
@@ -554,7 +554,7 @@ DELETE 207
Grace, D. | 6a8ddca3-33c1-45f9-aa00-6fa9fc91e3fc | -1
- Strangely, none of her authority entries have ORCIDs anymore…
-- I'll just fix the text values and forget about it for now:
+- I’ll just fix the text values and forget about it for now:
dspace=# update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
UPDATE 610
@@ -593,24 +593,24 @@ real 6m6.447s
user 1m34.010s
sys 0m12.113s
-- The
index-authority
script always seems to fail, I think it's the same old bug
-- Something interesting for my notes about JNDI database pool—since I couldn't determine if it was working or not when I tried it locally the other day—is this error message that I just saw in the DSpace logs today:
+- The
index-authority
script always seems to fail, I think it’s the same old bug
+- Something interesting for my notes about JNDI database pool—since I couldn’t determine if it was working or not when I tried it locally the other day—is this error message that I just saw in the DSpace logs today:
ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspaceLocal
...
INFO org.dspace.storage.rdbms.DatabaseManager @ Unable to locate JNDI dataSource: jdbc/dspaceLocal
INFO org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Database pool
-- So it's good to know that something gets printed when it fails because I didn't see any mention of JNDI before when I was testing!
+- So it’s good to know that something gets printed when it fails because I didn’t see any mention of JNDI before when I was testing!
2017-09-26
- Adam Hunt from WLE finally registered so I added him to the editor and approver groups
-- Then I noticed that Sisay never removed Marianne's user accounts from the approver steps in the workflow because she is already in the WLE groups, which are in those steps
-- For what it's worth, I had asked him to remove them on 2017-09-14
+- Then I noticed that Sisay never removed Marianne’s user accounts from the approver steps in the workflow because she is already in the WLE groups, which are in those steps
+- For what it’s worth, I had asked him to remove them on 2017-09-14
- I also went and added the WLE approvers and editors groups to the appropriate steps of all the Phase I and Phase II research theme collections
-- A lot of CIAT's items have manually generated thumbnails which have an incorrect aspect ratio and an ugly black border
-- I communicated with Elizabeth from CIAT to tell her she should use DSpace's automatically generated thumbnails
+- A lot of CIAT’s items have manually generated thumbnails which have an incorrect aspect ratio and an ugly black border
+- I communicated with Elizabeth from CIAT to tell her she should use DSpace’s automatically generated thumbnails
- Start discussiong with ICT about Linode server update for DSpace Test
- Rosemary said I need to work with Robert Okal to destroy/create the server, and then let her and Lilian Masigah from finance know the updated Linode asset names for their records
@@ -618,7 +618,7 @@ INFO org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Da
- Tunji from the System Organization finally sent the DNS request for library.cgiar.org to CGNET
- Now the redirects work
-- I quickly registered a Let's Encrypt certificate for the domain:
+- I quickly registered a Let’s Encrypt certificate for the domain:
# systemctl stop nginx
# /opt/certbot-auto certonly --standalone --email aorth@mjanja.ch -d library.cgiar.org
diff --git a/docs/2017-10/index.html b/docs/2017-10/index.html
index 5e9be4429..732b802c2 100644
--- a/docs/2017-10/index.html
+++ b/docs/2017-10/index.html
@@ -12,7 +12,7 @@ Peter emailed to point out that many items in the ILRI archive collection have m
http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
-There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
+There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
" />
@@ -28,10 +28,10 @@ Peter emailed to point out that many items in the ILRI archive collection have m
http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
-There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
+There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
"/>
-
+
@@ -61,7 +61,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
-
+
@@ -108,7 +108,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
October, 2017
@@ -119,7 +119,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
-- There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
+- There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
- Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
2017-10-02
@@ -130,13 +130,13 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
2017-10-01 20:24:57,928 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:ldap_attribute_lookup:type=failed_search javax.naming.CommunicationException\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is java.net.ConnectException\colon; Connection timed out (Connection timed out)]
2017-10-01 20:22:37,982 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:failed_login:no DN found for user pballantyne
-- I thought maybe his account had expired (seeing as it's was the first of the month) but he says he was finally able to log in today
+- I thought maybe his account had expired (seeing as it’s was the first of the month) but he says he was finally able to log in today
- The logs for yesterday show fourteen errors related to LDAP auth failures:
$ grep -c "ldap_authentication:type=failed_auth" dspace.log.2017-10-01
14
-- For what it's worth, there are no errors on any other recent days, so it must have been some network issue on Linode or CGNET's LDAP server
+- For what it’s worth, there are no errors on any other recent days, so it must have been some network issue on Linode or CGNET’s LDAP server
- Linode emailed to say that linode578611 (DSpace Test) needs to migrate to a new host for a security update so I initiated the migration immediately rather than waiting for the scheduled time in two weeks
2017-10-04
@@ -147,7 +147,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject → https://cgspace.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject
-- We'll need to check for browse links and handle them properly, including swapping the
subject
parameter for systemsubject
(which doesn't exist in Discovery yet, but we'll need to add it) as we have moved their poorly curated subjects from dc.subject
to cg.subject.system
+- We’ll need to check for browse links and handle them properly, including swapping the
subject
parameter for systemsubject
(which doesn’t exist in Discovery yet, but we’ll need to add it) as we have moved their poorly curated subjects from dc.subject
to cg.subject.system
- The second link was a direct link to a bitstream which has broken due to the sequence being updated, so I told him he should link to the handle of the item instead
- Help Sisay proof sixty-two IITA records on DSpace Test
- Lots of inconsistencies and errors in subjects, dc.format.extent, regions, countries
@@ -155,8 +155,8 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
2017-10-05
-- Twice in the past twenty-four hours Linode has warned that CGSpace's outbound traffic rate was exceeding the notification threshold
-- I had a look at yesterday's OAI and REST logs in
/var/log/nginx
but didn't see anything unusual:
+- Twice in the past twenty-four hours Linode has warned that CGSpace’s outbound traffic rate was exceeding the notification threshold
+- I had a look at yesterday’s OAI and REST logs in
/var/log/nginx
but didn’t see anything unusual:
# awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
141 157.55.39.240
@@ -183,7 +183,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
- Working on the nginx redirects for CGIAR Library
- We should start using 301 redirects and also allow for
/sitemap
to work on the library.cgiar.org domain so the CGIAR System Organization people can update their Google Search Console and allow Google to find their content in a structured way
-- Remove eleven occurrences of
ACP
in IITA's cg.coverage.region
using the Atmire batch edit module from Discovery
+- Remove eleven occurrences of
ACP
in IITA’s cg.coverage.region
using the Atmire batch edit module from Discovery
- Need to investigate how we can verify the library.cgiar.org using the HTML or DNS methods
- Run corrections on 143 ILRI Archive items that had two
dc.identifier.uri
values (Handle) that Peter had pointed out earlier this week
- I used OpenRefine to isolate them and then fixed and re-imported them into CGSpace
@@ -197,7 +197,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG

-- I'll post it to the Yammer group to see what people think
+- I’ll post it to the Yammer group to see what people think
- I figured out at way to do the HTML verification for Google Search console for library.cgiar.org
- We can drop the HTML file in their XMLUI theme folder and it will get copied to the webapps directory during build/install
- Then we add an nginx alias for that URL in the library.cgiar.org vhost
@@ -213,7 +213,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
-- I tried to submit a “Change of Address” request in the Google Search Console but I need to be an owner on CGSpace's console (currently I'm just a user) in order to do that
+- I tried to submit a “Change of Address” request in the Google Search Console but I need to be an owner on CGSpace’s console (currently I’m just a user) in order to do that
- Manually clean up some communities and collections that Peter had requested a few weeks ago
- Delete Community 10568/102 (ILRI Research and Development Issues)
- Move five collections to 10568/27629 (ILRI Projects) using
move-collections.sh
with the following configuration:
@@ -233,8 +233,8 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG

-- We are sending top-level CGIAR Library traffic to their specific community hierarchy in CGSpace so this type of change of address won't work—we'll just need to wait for Google to slowly index everything and take note of the HTTP 301 redirects
-- Also the Google Search Console doesn't work very well with Google Analytics being blocked, so I had to turn off my ad blocker to get the “Change of Address” tool to work!
+- We are sending top-level CGIAR Library traffic to their specific community hierarchy in CGSpace so this type of change of address won’t work—we’ll just need to wait for Google to slowly index everything and take note of the HTTP 301 redirects
+- Also the Google Search Console doesn’t work very well with Google Analytics being blocked, so I had to turn off my ad blocker to get the “Change of Address” tool to work!
2017-10-12
@@ -245,7 +245,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
- Run system updates on DSpace Test and reboot server
- Merge changes adding a search/browse index for CGIAR System subject to
5_x-prod
(#344)
-- I checked the top browse links in Google's search results for
site:library.cgiar.org inurl:browse
and they are all redirected appropriately by the nginx rewrites I worked on last week
+- I checked the top browse links in Google’s search results for
site:library.cgiar.org inurl:browse
and they are all redirected appropriately by the nginx rewrites I worked on last week
2017-10-22
@@ -256,12 +256,12 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
2017-10-26
-- In the last 24 hours we've gotten a few alerts from Linode that there was high CPU and outgoing traffic on CGSpace
+- In the last 24 hours we’ve gotten a few alerts from Linode that there was high CPU and outgoing traffic on CGSpace
- Uptime Robot even noticed CGSpace go “down” for a few minutes
- In other news, I was trying to look at a question about stats raised by Magdalena and then CGSpace went down due to SQL connection pool
- Looking at the PostgreSQL activity I see there are 93 connections, but after a minute or two they went down and CGSpace came back up
- Annnd I reloaded the Atmire Usage Stats module and the connections shot back up and CGSpace went down again
-- Still not sure where the load is coming from right now, but it's clear why there were so many alerts yesterday on the 25th!
+- Still not sure where the load is coming from right now, but it’s clear why there were so many alerts yesterday on the 25th!
# grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
18022
@@ -274,12 +274,12 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
7851
- I still have no idea what was causing the load to go up today
-- I finally investigated Magdalena's issue with the item download stats and now I can't reproduce it: I get the same number of downloads reported in the stats widget on the item page, the “Most Popular Items” page, and in Usage Stats
+- I finally investigated Magdalena’s issue with the item download stats and now I can’t reproduce it: I get the same number of downloads reported in the stats widget on the item page, the “Most Popular Items” page, and in Usage Stats
- I think it might have been an issue with the statistics not being fresh
- I added the admin group for the systems organization to the admin role of the top-level community of CGSpace because I guess Sisay had forgotten
- Magdalena asked if there was a way to reuse data in item submissions where items have a lot of similar data
- I told her about the possibility to use per-collection item templates, and asked if her items in question were all from a single collection
-- We've never used it but it could be worth looking at
+- We’ve never used it but it could be worth looking at
2017-10-27
@@ -292,24 +292,24 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
2017-10-29
- Linode alerted about high CPU usage again on CGSpace around 2AM and 4AM
-- I'm still not sure why this started causing alerts so repeatadely the past week
-- I don't see any tell tale signs in the REST or OAI logs, so trying to do rudimentary analysis in DSpace logs:
+- I’m still not sure why this started causing alerts so repeatadely the past week
+- I don’t see any tell tale signs in the REST or OAI logs, so trying to do rudimentary analysis in DSpace logs:
# grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2049
- So there were 2049 unique sessions during the hour of 2AM
- Looking at my notes, the number of unique sessions was about the same during the same hour on other days when there were no alerts
-- I think I'll need to enable access logging in nginx to figure out what's going on
-- After enabling logging on requests to XMLUI on
/
I see some new bot I've never seen before:
+- I think I’ll need to enable access logging in nginx to figure out what’s going on
+- After enabling logging on requests to XMLUI on
/
I see some new bot I’ve never seen before:
137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] "GET /discover?filtertype_0=type&filter_relational_operator_0=equals&filter_0=Internal+Document&filtertype=author&filter_relational_operator=equals&filter=CGIAR+Secretariat HTTP/1.1" 200 7776 "-" "Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)"
- CORE seems to be some bot that is “Aggregating the world’s open access research papers”
-- The contact address listed in their bot's user agent is incorrect, correct page is simply: https://core.ac.uk/contact
-- I will check the logs in a few days to see if they are harvesting us regularly, then add their bot's user agent to the Tomcat Crawler Session Valve
+- The contact address listed in their bot’s user agent is incorrect, correct page is simply: https://core.ac.uk/contact
+- I will check the logs in a few days to see if they are harvesting us regularly, then add their bot’s user agent to the Tomcat Crawler Session Valve
- After browsing the CORE site it seems that the CGIAR Library is somehow a member of CORE, so they have probably only been harvesting CGSpace since we did the migration, as library.cgiar.org directs to us now
-- For now I will just contact them to have them update their contact info in the bot's user agent, but eventually I think I'll tell them to swap out the CGIAR Library entry for CGSpace
+- For now I will just contact them to have them update their contact info in the bot’s user agent, but eventually I think I’ll tell them to swap out the CGIAR Library entry for CGSpace
2017-10-30
@@ -333,7 +333,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
137.108.70.6
137.108.70.7
-- I will add their user agent to the Tomcat Session Crawler Valve but it won't help much because they are only using two sessions:
+- I will add their user agent to the Tomcat Session Crawler Valve but it won’t help much because they are only using two sessions:
# grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
session_id=5771742CABA3D0780860B8DA81E0551B
@@ -346,7 +346,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
# grep 137.108.70 /var/log/nginx/access.log | grep -c "GET /discover"
24055
-- Just because I'm curious who the top IPs are:
+- Just because I’m curious who the top IPs are:
# awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
496 62.210.247.93
@@ -362,7 +362,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
- At least we know the top two are CORE, but who are the others?
- 190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine
-- Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don't reuse their session variable, creating thousands of new sessions!
+- Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don’t reuse their session variable, creating thousands of new sessions!
# grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
1419
@@ -372,7 +372,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
From looking at the requests, it appears these are from CIAT and CCAFS
I wonder if I could somehow instruct them to use a user agent so that we could apply a crawler session manager valve to them
Actually, according to the Tomcat docs, we could use an IP with crawlerIps
: https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve
-Ah, wait, it looks like crawlerIps
only came in 2017-06, so probably isn't in Ubuntu 16.04's 7.0.68 build!
+Ah, wait, it looks like crawlerIps
only came in 2017-06, so probably isn’t in Ubuntu 16.04’s 7.0.68 build!
That would explain the errors I was getting when trying to set it:
WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
@@ -389,14 +389,14 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
2017-10-31
- Very nice, Linode alerted that CGSpace had high CPU usage at 2AM again
-- Ask on the dspace-tech mailing list if it's possible to use an existing item as a template for a new item
+- Ask on the dspace-tech mailing list if it’s possible to use an existing item as a template for a new item
- To follow up on the CORE bot traffic, there were almost 300,000 request yesterday:
# grep "CORE/0.6" /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
139109 137.108.70.6
139253 137.108.70.7
-- I've emailed the CORE people to ask if they can update the repository information from CGIAR Library to CGSpace
+- I’ve emailed the CORE people to ask if they can update the repository information from CGIAR Library to CGSpace
- Also, I asked if they could perhaps use the
sitemap.xml
, OAI-PMH, or REST APIs to index us more efficiently, because they mostly seem to be crawling the nearly endless Discovery facets
- I added GoAccess to the list of package to install in the DSpace role of the Ansible infrastructure scripts
- It makes it very easy to analyze nginx logs from the command line, to see where traffic is coming from:
@@ -406,14 +406,14 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
- According to Uptime Robot CGSpace went down and up a few times
- I had a look at goaccess and I saw that CORE was actively indexing
- Also, PostgreSQL connections were at 91 (with the max being 60 per web app, hmmm)
-- I'm really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable
-- Actually, come to think of it, they aren't even obeying
robots.txt
, because we actually disallow /discover
and /search-filter
URLs but they are hitting those massively:
+- I’m really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable
+- Actually, come to think of it, they aren’t even obeying
robots.txt
, because we actually disallow /discover
and /search-filter
URLs but they are hitting those massively:
# grep "CORE/0.6" /var/log/nginx/access.log | grep -o -E "GET /(discover|search-filter)" | sort -n | uniq -c | sort -rn
158058 GET /discover
14260 GET /search-filter
-- I tested a URL of pattern
/discover
in Google's webmaster tools and it was indeed identified as blocked
+- I tested a URL of pattern
/discover
in Google’s webmaster tools and it was indeed identified as blocked
- I will send feedback to the CORE bot team
diff --git a/docs/2017-11/index.html b/docs/2017-11/index.html
index 90438795f..00db8fa24 100644
--- a/docs/2017-11/index.html
+++ b/docs/2017-11/index.html
@@ -45,7 +45,7 @@ Generate list of authors on CGSpace for Peter to go through and correct:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
"/>
-
+
@@ -75,7 +75,7 @@ COPY 54701
-
+
@@ -122,7 +122,7 @@ COPY 54701
November, 2017
@@ -160,15 +160,15 @@ COPY 54701
2017-11-03
- Atmire got back to us to say that they estimate it will take two days of labor to implement the change to Listings and Reports
-- I said I'd ask Abenet if she wants that feature
+- I said I’d ask Abenet if she wants that feature
2017-11-04
-- I finished looking through Sisay's CIAT records for the “Alianzas de Aprendizaje” data
+- I finished looking through Sisay’s CIAT records for the “Alianzas de Aprendizaje” data
- I corrected about half of the authors to standardize them
- Linode emailed this morning to say that the CPU usage was high again, this time at 6:14AM
-- It's the first time in a few days that this has happened
-- I had a look to see what was going on, but it isn't the CORE bot:
+- It’s the first time in a few days that this has happened
+- I had a look to see what was going on, but it isn’t the CORE bot:
# awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
306 68.180.229.31
@@ -193,11 +193,11 @@ COPY 54701
/var/log/nginx/access.log.5.gz:0
/var/log/nginx/access.log.6.gz:0
-- It's clearly a bot as it's making tens of thousands of requests, but it's using a “normal” user agent:
+- It’s clearly a bot as it’s making tens of thousands of requests, but it’s using a “normal” user agent:
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
-- For now I don't know what this user is!
+- For now I don’t know what this user is!
2017-11-05
@@ -222,8 +222,8 @@ COPY 54701
International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | 500
(8 rows)
-- So I'm not sure if this is just a graphical glitch or if editors have to edit this metadata field prior to approval
-- Looking at monitoring Tomcat's JVM heap with Prometheus, it looks like we need to use JMX + jmx_exporter
+- So I’m not sure if this is just a graphical glitch or if editors have to edit this metadata field prior to approval
+- Looking at monitoring Tomcat’s JVM heap with Prometheus, it looks like we need to use JMX + jmx_exporter
- This guide shows how to enable JMX in Tomcat by modifying
CATALINA_OPTS
- I was able to successfully connect to my local Tomcat with jconsole!
@@ -268,8 +268,8 @@ $ grep 104.196.152.243 dspace.log.2017-11-03 | grep -o -E 'session_id=[A-Z0-9]{3
$ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
7051
-- The worst thing is that this user never specifies a user agent string so we can't lump it in with the other bots using the Tomcat Session Crawler Manager Valve
-- They don't request dynamic URLs like “/discover” but they seem to be fetching handles from XMLUI instead of REST (and some with
//handle
, note the regex below):
+- The worst thing is that this user never specifies a user agent string so we can’t lump it in with the other bots using the Tomcat Session Crawler Manager Valve
+- They don’t request dynamic URLs like “/discover” but they seem to be fetching handles from XMLUI instead of REST (and some with
//handle
, note the regex below):
# grep -c 104.196.152.243 /var/log/nginx/access.log.1
4681
@@ -277,7 +277,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
4618
- I just realized that
ciat.cgiar.org
points to 104.196.152.243, so I should contact Leroy from CIAT to see if we can change their scraping behavior
-- The next IP (207.46.13.36) seem to be Microsoft's bingbot, but all its requests specify the “bingbot” user agent and there are no requests for dynamic URLs that are forbidden, like “/discover”:
+- The next IP (207.46.13.36) seem to be Microsoft’s bingbot, but all its requests specify the “bingbot” user agent and there are no requests for dynamic URLs that are forbidden, like “/discover”:
$ grep -c 207.46.13.36 /var/log/nginx/access.log.1
2034
@@ -328,18 +328,18 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)
-I'll just keep an eye on that one for now, as it only made a few hundred requests to dynamic discovery URLs
-While it's not in the top ten, Baidu is one bot that seems to not give a fuck:
+I’ll just keep an eye on that one for now, as it only made a few hundred requests to dynamic discovery URLs
+While it’s not in the top ten, Baidu is one bot that seems to not give a fuck:
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "7/Nov/2017" | grep -c Baiduspider
8912
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "7/Nov/2017" | grep Baiduspider | grep -c -E "GET /(browse|discover|search-filter)"
2521
-- According to their documentation their bot respects
robots.txt
, but I don't see this being the case
+- According to their documentation their bot respects
robots.txt
, but I don’t see this being the case
- I think I will end up blocking Baidu as well…
- Next is for me to look and see what was happening specifically at 3AM and 7AM when the server crashed
-- I should look in nginx access.log, rest.log, oai.log, and DSpace's dspace.log.2017-11-07
+- I should look in nginx access.log, rest.log, oai.log, and DSpace’s dspace.log.2017-11-07
- Here are the top IPs making requests to XMLUI from 2 to 8 AM:
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
@@ -389,8 +389,8 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
462 ip_addr=104.196.152.243
488 ip_addr=66.249.66.90
-- These aren't actually very interesting, as the top few are Google, CIAT, Bingbot, and a few other unknown scrapers
-- The number of requests isn't even that high to be honest
+- These aren’t actually very interesting, as the top few are Google, CIAT, Bingbot, and a few other unknown scrapers
+- The number of requests isn’t even that high to be honest
- As I was looking at these logs I noticed another heavy user (124.17.34.59) that was not active during this time period, but made many requests today alone:
# zgrep -c 124.17.34.59 /var/log/nginx/access.log*
@@ -405,13 +405,13 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
/var/log/nginx/access.log.8.gz:0
/var/log/nginx/access.log.9.gz:1
-- The whois data shows the IP is from China, but the user agent doesn't really give any clues:
+- The whois data shows the IP is from China, but the user agent doesn’t really give any clues:
# grep 124.17.34.59 /var/log/nginx/access.log | awk -F'" ' '{print $3}' | sort | uniq -c | sort -h
210 "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
22610 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)"
-- A Google search for “LCTE bot” doesn't return anything interesting, but this Stack Overflow discussion references the lack of information
+- A Google search for “LCTE bot” doesn’t return anything interesting, but this Stack Overflow discussion references the lack of information
- So basically after a few hours of looking at the log files I am not closer to understanding what is going on!
- I do know that we want to block Baidu, though, as it does not respect
robots.txt
- And as we speak Linode alerted that the outbound traffic rate is very high for the past two hours (about 12–14 hours)
@@ -479,13 +479,13 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
$ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=124.17.34.59' | sort | uniq | wc -l
20733
-- I'm getting really sick of this
+- I’m getting really sick of this
- Sisay re-uploaded the CIAT records that I had already corrected earlier this week, erasing all my corrections
- I had to re-correct all the publishers, places, names, dates, etc and apply the changes on DSpace Test
- Run system updates on DSpace Test and reboot the server
- Magdalena had written to say that two of their Phase II project tags were missing on CGSpace, so I added them (#346)
-- I figured out a way to use nginx's map function to assign a “bot” user agent to misbehaving clients who don't define a user agent
-- Most bots are automatically lumped into one generic session by Tomcat's Crawler Session Manager Valve but this only works if their user agent matches a pre-defined regular expression like
.*[bB]ot.*
+- I figured out a way to use nginx’s map function to assign a “bot” user agent to misbehaving clients who don’t define a user agent
+- Most bots are automatically lumped into one generic session by Tomcat’s Crawler Session Manager Valve but this only works if their user agent matches a pre-defined regular expression like
.*[bB]ot.*
- Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process
- Basically, we modify the nginx config to add a mapping with a modified user agent
$ua
:
@@ -495,15 +495,15 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
default $http_user_agent;
}
-- If the client's address matches then the user agent is set, otherwise the default
$http_user_agent
variable is used
-- Then, in the server's
/
block we pass this header to Tomcat:
+- If the client’s address matches then the user agent is set, otherwise the default
$http_user_agent
variable is used
+- Then, in the server’s
/
block we pass this header to Tomcat:
proxy_pass http://tomcat_http;
proxy_set_header User-Agent $ua;
-- Note to self: the
$ua
variable won't show up in nginx access logs because the default combined
log format doesn't show it, so don't run around pulling your hair out wondering with the modified user agents aren't showing in the logs!
+- Note to self: the
$ua
variable won’t show up in nginx access logs because the default combined
log format doesn’t show it, so don’t run around pulling your hair out wondering with the modified user agents aren’t showing in the logs!
- If a client matching one of these IPs connects without a session, it will be assigned one by the Crawler Session Manager Valve
-- You can verify by cross referencing nginx's
access.log
and DSpace's dspace.log.2017-11-08
, for example
+- You can verify by cross referencing nginx’s
access.log
and DSpace’s dspace.log.2017-11-08
, for example
- I will deploy this on CGSpace later this week
- I am interested to check how this affects the number of sessions used by the CIAT and Chinese bots (see above on 2017-11-07 for example)
- I merged the clickable thumbnails code to
5_x-prod
(#347) and will deploy it later along with the new bot mapping stuff (and re-run the Asible nginx
and tomcat
tags)
@@ -522,7 +522,7 @@ proxy_set_header User-Agent $ua;
1134
- I have been looking for a reason to ban Baidu and this is definitely a good one
-- Disallowing
Baiduspider
in robots.txt
probably won't work because this bot doesn't seem to respect the robot exclusion standard anyways!
+- Disallowing
Baiduspider
in robots.txt
probably won’t work because this bot doesn’t seem to respect the robot exclusion standard anyways!
- I will whip up something in nginx later
- Run system updates on CGSpace and reboot the server
- Re-deploy latest
5_x-prod
branch on CGSpace and DSpace Test (includes the clickable thumbnails, CCAFS phase II project tags, and updated news text)
@@ -548,7 +548,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{3
3506
- The number of sessions is over ten times less!
-- This gets me thinking, I wonder if I can use something like nginx's rate limiter to automatically change the user agent of clients who make too many requests
+- This gets me thinking, I wonder if I can use something like nginx’s rate limiter to automatically change the user agent of clients who make too many requests
- Perhaps using a combination of geo and map, like illustrated here: https://www.nginx.com/blog/rate-limiting-nginx/
2017-11-11
@@ -560,7 +560,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{3
2017-11-12
- Update the Ansible infrastructure templates to be a little more modular and flexible
-- Looking at the top client IPs on CGSpace so far this morning, even though it's only been eight hours:
+- Looking at the top client IPs on CGSpace so far this morning, even though it’s only been eight hours:
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "12/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
243 5.83.120.111
@@ -579,7 +579,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{3
# grep 5.9.6.51 /var/log/nginx/access.log | tail -n 1
5.9.6.51 - - [12/Nov/2017:08:13:13 +0000] "GET /handle/10568/16515/recent-submissions HTTP/1.1" 200 5097 "-" "Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"
-- What's amazing is that it seems to reuse its Java session across all requests:
+- What’s amazing is that it seems to reuse its Java session across all requests:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2017-11-12
1558
@@ -587,7 +587,7 @@ $ grep 5.9.6.51 dspace.log.2017-11-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | s
1
- Bravo to MegaIndex.ru!
-- The same cannot be said for 95.108.181.88, which appears to be YandexBot, even though Tomcat's Crawler Session Manager valve regex should match ‘YandexBot’:
+- The same cannot be said for 95.108.181.88, which appears to be YandexBot, even though Tomcat’s Crawler Session Manager valve regex should match ‘YandexBot’:
# grep 95.108.181.88 /var/log/nginx/access.log | tail -n 1
95.108.181.88 - - [12/Nov/2017:08:33:17 +0000] "GET /bitstream/handle/10568/57004/GenebankColombia_23Feb2015.pdf HTTP/1.1" 200 972019 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
@@ -600,8 +600,8 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2017-11-
10947/34 10947/1 10568/83389
10947/2512 10947/1 10568/83389
-- I explored nginx rate limits as a way to aggressively throttle Baidu bot which doesn't seem to respect disallowed URLs in robots.txt
-- There's an interesting blog post from Nginx's team about rate limiting as well as a clever use of mapping with rate limits
+- I explored nginx rate limits as a way to aggressively throttle Baidu bot which doesn’t seem to respect disallowed URLs in robots.txt
+- There’s an interesting blog post from Nginx’s team about rate limiting as well as a clever use of mapping with rate limits
- The solution I came up with uses tricks from both of those
- I deployed the limit on CGSpace and DSpace Test and it seems to work well:
@@ -664,7 +664,7 @@ Server: nginx
- Deploy some nginx configuration updates to CGSpace
- They had been waiting on a branch for a few months and I think I just forgot about them
-- I have been running them on DSpace Test for a few days and haven't seen any issues there
+- I have been running them on DSpace Test for a few days and haven’t seen any issues there
- Started testing DSpace 6.2 and a few things have changed
- Now PostgreSQL needs
pgcrypto
:
@@ -672,21 +672,21 @@ Server: nginx
dspace6=# CREATE EXTENSION pgcrypto;
- Also, local settings are no longer in
build.properties
, they are now in local.cfg
-- I'm not sure if we can use separate profiles like we did before with
mvn -Denv=blah
to use blah.properties
+- I’m not sure if we can use separate profiles like we did before with
mvn -Denv=blah
to use blah.properties
- It seems we need to use “system properties” to override settings, ie:
-Ddspace.dir=/Users/aorth/dspace6
2017-11-15
- Send Adam Hunt an invite to the DSpace Developers network on Yammer
- He is the new head of communications at WLE, since Michael left
-- Merge changes to item view's wording of link metadata (#348)
+- Merge changes to item view’s wording of link metadata (#348)
2017-11-17
- Uptime Robot said that CGSpace went down today and I see lots of
Timeout waiting for idle object
errors in the DSpace logs
- I looked in PostgreSQL using
SELECT * FROM pg_stat_activity;
and saw that there were 73 active connections
- After a few minutes the connecitons went down to 44 and CGSpace was kinda back up, it seems like Tsega restarted Tomcat
-- Looking at the REST and XMLUI log files, I don't see anything too crazy:
+- Looking at the REST and XMLUI log files, I don’t see anything too crazy:
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep "17/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
13 66.249.66.223
@@ -712,7 +712,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
2020 66.249.66.219
- I need to look into using JMX to analyze active sessions I think, rather than looking at log files
-- After adding appropriate JMX listener options to Tomcat's JAVA_OPTS and restarting Tomcat, I can connect remotely using an SSH dynamic port forward (SOCKS) on port 7777 for example, and then start jconsole locally like:
+- After adding appropriate JMX listener options to Tomcat’s JAVA_OPTS and restarting Tomcat, I can connect remotely using an SSH dynamic port forward (SOCKS) on port 7777 for example, and then start jconsole locally like:
$ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=7777 service:jmx:rmi:///jndi/rmi://localhost:9000/jmxrmi -J-DsocksNonProxyHosts=
@@ -760,14 +760,14 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
2017-11-19 03:00:32,806 INFO org.apache.pdfbox.pdfparser.PDFParser @ Document is encrypted
2017-11-19 03:00:32,807 ERROR org.apache.pdfbox.filter.FlateFilter @ FlateFilter: stop reading corrupt stream due to a DataFormatException
-- It's been a few days since I enabled the G1GC on DSpace Test and the JVM graph definitely changed:
+- It’s been a few days since I enabled the G1GC on DSpace Test and the JVM graph definitely changed:

2017-11-20
- I found an article about JVM tuning that gives some pointers how to enable logging and tools to analyze logs for you
- Also notes on rotating GC logs
-- I decided to switch DSpace Test back to the CMS garbage collector because it is designed for low pauses and high throughput (like G1GC!) and because we haven't even tried to monitor or tune it
+- I decided to switch DSpace Test back to the CMS garbage collector because it is designed for low pauses and high throughput (like G1GC!) and because we haven’t even tried to monitor or tune it
2017-11-21
@@ -777,7 +777,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
2017-11-22
- Linode sent an alert that the CPU usage on the CGSpace server was very high around 4 to 6 AM
-- The logs don't show anything particularly abnormal between those hours:
+- The logs don’t show anything particularly abnormal between those hours:
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "22/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
136 31.6.77.23
@@ -791,7 +791,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
696 66.249.66.90
707 104.196.152.243
-- I haven't seen 54.144.57.183 before, it is apparently the CCBot from commoncrawl.org
+- I haven’t seen 54.144.57.183 before, it is apparently the CCBot from commoncrawl.org
- In other news, it looks like the JVM garbage collection pattern is back to its standard jigsaw pattern after switching back to CMS a few days ago:

@@ -826,22 +826,22 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
942 45.5.184.196
3995 70.32.83.92
-- These IPs crawling the REST API don't specify user agents and I'd assume they are creating many Tomcat sessions
+- These IPs crawling the REST API don’t specify user agents and I’d assume they are creating many Tomcat sessions
- I would catch them in nginx to assign a “bot” user agent to them so that the Tomcat Crawler Session Manager valve could deal with them, but they seem to create any really — at least not in the dspace.log:
$ grep 70.32.83.92 dspace.log.2017-11-23 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2
-- I'm wondering if REST works differently, or just doesn't log these sessions?
+- I’m wondering if REST works differently, or just doesn’t log these sessions?
- I wonder if they are measurable via JMX MBeans?
-- I did some tests locally and I don't see the sessionCounter incrementing after making requests to REST, but it does with XMLUI and OAI
-- I came across some interesting PostgreSQL tuning advice for SSDs: https://amplitude.engineering/how-a-single-postgresql-config-change-improved-slow-query-performance-by-50x-85593b8991b0
+- I did some tests locally and I don’t see the sessionCounter incrementing after making requests to REST, but it does with XMLUI and OAI
+- I came across some interesting PostgreSQL tuning advice for SSDs: https://amplitude.engineering/how-a-single-postgresql-config-change-improved-slow-query-performance-by-50x-85593b8991b0
- Apparently setting
random_page_cost
to 1 is “common” advice for systems running PostgreSQL on SSD (the default is 4)
- So I deployed this on DSpace Test and will check the Munin PostgreSQL graphs in a few days to see if anything changes
2017-11-24
-- It's too early to tell for sure, but after I made the
random_page_cost
change on DSpace Test's PostgreSQL yesterday the number of connections dropped drastically:
+- It’s too early to tell for sure, but after I made the
random_page_cost
change on DSpace Test’s PostgreSQL yesterday the number of connections dropped drastically:

@@ -849,8 +849,8 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19

-- I just realized that we're not logging access requests to other vhosts on CGSpace, so it's possible I have no idea that we're getting slammed at 4AM on another domain that we're just silently redirecting to cgspace.cgiar.org
-- I've enabled logging on the CGIAR Library on CGSpace so I can check to see if there are many requests there
+- I just realized that we’re not logging access requests to other vhosts on CGSpace, so it’s possible I have no idea that we’re getting slammed at 4AM on another domain that we’re just silently redirecting to cgspace.cgiar.org
+- I’ve enabled logging on the CGIAR Library on CGSpace so I can check to see if there are many requests there
- In just a few seconds I already see a dozen requests from Googlebot (of course they get HTTP 301 redirects to cgspace.cgiar.org)
- I also noticed that CGNET appears to be monitoring the old domain every few minutes:
@@ -893,29 +893,29 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
6053 45.5.184.196
- PostgreSQL activity shows 69 connections
-- I don't have time to troubleshoot more as I'm in Nairobi working on the HPC so I just restarted Tomcat for now
+- I don’t have time to troubleshoot more as I’m in Nairobi working on the HPC so I just restarted Tomcat for now
- A few hours later Uptime Robot says the server is down again
-- I don't see much activity in the logs but there are 87 PostgreSQL connections
+- I don’t see much activity in the logs but there are 87 PostgreSQL connections
- But shit, there were 10,000 unique Tomcat sessions today:
$ cat dspace.log.2017-11-29 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
10037
-- Although maybe that's not much, as the previous two days had more:
+- Although maybe that’s not much, as the previous two days had more:
$ cat dspace.log.2017-11-27 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
12377
$ cat dspace.log.2017-11-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
16984
-- I think we just need start increasing the number of allowed PostgreSQL connections instead of fighting this, as it's the most common source of crashes we have
-- I will bump DSpace's
db.maxconnections
from 60 to 90, and PostgreSQL's max_connections
from 183 to 273 (which is using my loose formula of 90 * webapps + 3)
+- I think we just need start increasing the number of allowed PostgreSQL connections instead of fighting this, as it’s the most common source of crashes we have
+- I will bump DSpace’s
db.maxconnections
from 60 to 90, and PostgreSQL’s max_connections
from 183 to 273 (which is using my loose formula of 90 * webapps + 3)
- I really need to figure out how to get DSpace to use a PostgreSQL connection pool
2017-11-30
- Linode alerted about high CPU usage on CGSpace again around 6 to 8 AM
-- Then Uptime Robot said CGSpace was down a few minutes later, but it resolved itself I think (or Tsega restarted Tomcat, I don't know)
+- Then Uptime Robot said CGSpace was down a few minutes later, but it resolved itself I think (or Tsega restarted Tomcat, I don’t know)
diff --git a/docs/2017-12/index.html b/docs/2017-12/index.html
index 87089c0d1..9931eb6a7 100644
--- a/docs/2017-12/index.html
+++ b/docs/2017-12/index.html
@@ -27,7 +27,7 @@ The logs say “Timeout waiting for idle object”
PostgreSQL activity says there are 115 connections currently
The list of connections to XMLUI and REST API for today:
"/>
-
+
@@ -57,7 +57,7 @@ The list of connections to XMLUI and REST API for today:
-
+
@@ -104,7 +104,7 @@ The list of connections to XMLUI and REST API for today:
December, 2017
@@ -128,7 +128,7 @@ The list of connections to XMLUI and REST API for today:
4007 70.32.83.92
6061 45.5.184.196
-- The number of DSpace sessions isn't even that high:
+- The number of DSpace sessions isn’t even that high:
$ cat /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
5815
@@ -148,7 +148,7 @@ The list of connections to XMLUI and REST API for today:
314 2.86.122.76
- What the fuck is going on?
-- I've never seen this 2.86.122.76 before, it has made quite a few unique Tomcat sessions today:
+- I’ve never seen this 2.86.122.76 before, it has made quite a few unique Tomcat sessions today:
$ grep 2.86.122.76 /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
822
@@ -169,20 +169,20 @@ The list of connections to XMLUI and REST API for today:
319 2001:4b99:1:1:216:3eff:fe76:205b
2017-12-03
-- Linode alerted that CGSpace's load was 327.5% from 6 to 8 AM again
+- Linode alerted that CGSpace’s load was 327.5% from 6 to 8 AM again
2017-12-04
-- Linode alerted that CGSpace's load was 255.5% from 8 to 10 AM again
+- Linode alerted that CGSpace’s load was 255.5% from 8 to 10 AM again
- I looked at the Munin stats on DSpace Test (linode02) again to see how the PostgreSQL tweaks from a few weeks ago were holding up:

-- The results look fantastic! So the
random_page_cost
tweak is massively important for informing the PostgreSQL scheduler that there is no “cost” to accessing random pages, as we're on an SSD!
+- The results look fantastic! So the
random_page_cost
tweak is massively important for informing the PostgreSQL scheduler that there is no “cost” to accessing random pages, as we’re on an SSD!
- I guess we could probably even reduce the PostgreSQL connections in DSpace / PostgreSQL after using this
- Run system updates on DSpace Test (linode02) and reboot it
-- I'm going to enable the PostgreSQL
random_page_cost
tweak on CGSpace
-- For reference, here is the past month's connections:
+- I’m going to enable the PostgreSQL
random_page_cost
tweak on CGSpace
+- For reference, here is the past month’s connections:

2017-12-05
@@ -196,8 +196,8 @@ The list of connections to XMLUI and REST API for today:
Linode alerted again that the CPU usage on CGSpace was high this morning from 6 to 8 AM
Uptime Robot alerted that the server went down and up around 8:53 this morning
Uptime Robot alerted that CGSpace was down and up again a few minutes later
-I don't see any errors in the DSpace logs but I see in nginx's access.log that UptimeRobot was returned with HTTP 499 status (Client Closed Request)
-Looking at the REST API logs I see some new client IP I haven't noticed before:
+I don’t see any errors in the DSpace logs but I see in nginx’s access.log that UptimeRobot was returned with HTTP 499 status (Client Closed Request)
+Looking at the REST API logs I see some new client IP I haven’t noticed before:
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "6/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
18 95.108.181.88
@@ -233,7 +233,7 @@ The list of connections to XMLUI and REST API for today:
2662 66.249.66.219
5110 124.17.34.60
-- We've never seen 124.17.34.60 yet, but it's really hammering us!
+- We’ve never seen 124.17.34.60 yet, but it’s really hammering us!
- Apparently it is from China, and here is one of its user agents:
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)
@@ -243,7 +243,7 @@ The list of connections to XMLUI and REST API for today:
$ grep 124.17.34.60 /home/cgspace.cgiar.org/log/dspace.log.2017-12-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
4574
-- I've adjusted the nginx IP mapping that I set up last month to account for 124.17.34.60 and 124.17.34.59 using a regex, as it's the same bot on the same subnet
+- I’ve adjusted the nginx IP mapping that I set up last month to account for 124.17.34.60 and 124.17.34.59 using a regex, as it’s the same bot on the same subnet
- I was running the DSpace cleanup task manually and it hit an error:
$ /home/cgspace.cgiar.org/bin/dspace cleanup -v
@@ -261,7 +261,7 @@ UPDATE 1
2017-12-16
-- Re-work the XMLUI base theme to allow child themes to override the header logo's image and link destination: #349
+- Re-work the XMLUI base theme to allow child themes to override the header logo’s image and link destination: #349
- This required a little bit of work to restructure the XSL templates
- Optimize PNG and SVG image assets in the CGIAR base theme using pngquant and svgo: #350
@@ -276,7 +276,7 @@ UPDATE 1
I also had to add the .jpg to the thumbnail string in the CSV
The thumbnail11.jpg is missing
The dates are in super long ISO8601 format (from Excel?) like 2016-02-07T00:00:00Z
so I converted them to simpler forms in GREL: value.toString("yyyy-MM-dd")
-I trimmed the whitespaces in a few fields but it wasn't many
+I trimmed the whitespaces in a few fields but it wasn’t many
Rename her thumbnail column to filename, and format it so SAFBuilder adds the files to the thumbnail bundle with this GREL in OpenRefine: value + "__bundle:THUMBNAIL"
Rename dc.identifier.status and dc.identifier.url columns to cg.identifier.status and cg.identifier.url
Item 4 has weird characters in citation, ie: Nagoya et de Trait
@@ -289,7 +289,7 @@ UPDATE 1
$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/89338 --source /Users/aorth/Downloads/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat --mapfile=/tmp/ccafs.map &> /tmp/ccafs.log
-- It's the same on DSpace Test, I can't import the SAF bundle without specifying the collection:
+- It’s the same on DSpace Test, I can’t import the SAF bundle without specifying the collection:
$ dspace import --add --eperson=aorth@mjanja.ch --mapfile=/tmp/ccafs.map --source=/tmp/ccafs-2016/SimpleArchiveFormat
No collections given. Assuming 'collections' file inside item directory
@@ -317,7 +317,7 @@ Elapsed time: 2 secs (2559 msecs)
-Dlog4j.configuration=file:/Users/aorth/dspace/config/log4j-console.properties -Ddspace.log.init.disable=true
- … but the error message was the same, just with more INFO noise around it
-- For now I'll import into a collection in DSpace Test but I'm really not sure what's up with this!
+- For now I’ll import into a collection in DSpace Test but I’m really not sure what’s up with this!
- Linode alerted that CGSpace was using high CPU from 4 to 6 PM
- The logs for today show the CORE bot (137.108.70.7) being active in XMLUI:
@@ -347,7 +347,7 @@ Elapsed time: 2 secs (2559 msecs)
4014 70.32.83.92
11030 45.5.184.196
-- That's probably ok, as I don't think the REST API connections use up a Tomcat session…
+- That’s probably ok, as I don’t think the REST API connections use up a Tomcat session…
- CIP emailed a few days ago to ask about unique IDs for authors and organizations, and if we can provide them via an API
- Regarding the import issue above it seems to be a known issue that has a patch in DSpace 5.7:
@@ -355,7 +355,7 @@ Elapsed time: 2 secs (2559 msecs)
- https://jira.duraspace.org/browse/DS-3583
-- We're on DSpace 5.5 but there is a one-word fix to the addItem() function here: https://github.com/DSpace/DSpace/pull/1731
+- We’re on DSpace 5.5 but there is a one-word fix to the addItem() function here: https://github.com/DSpace/DSpace/pull/1731
- I will apply it on our branch but I need to make a note to NOT cherry-pick it when I rebase on to the latest 5.x upstream later
- Pull request: #351
@@ -393,7 +393,7 @@ Elapsed time: 2 secs (2559 msecs)
I need to keep an eye on this issue because it has nice fixes for reducing the number of database connections in DSpace 5.7: https://jira.duraspace.org/browse/DS-3551
Update text on CGSpace about page to give some tips to developers about using the resources more wisely (#352)
Linode alerted that CGSpace was using 396.3% CPU from 12 to 2 PM
-The REST and OAI API logs look pretty much the same as earlier this morning, but there's a new IP harvesting XMLUI:
+The REST and OAI API logs look pretty much the same as earlier this morning, but there’s a new IP harvesting XMLUI:
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
360 95.108.181.88
@@ -416,8 +416,8 @@ Elapsed time: 2 secs (2559 msecs)
$ grep 2.86.72.181 dspace.log.2017-12-18 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
1
-- I guess there's nothing I can do to them for now
-- In other news, I am curious how many PostgreSQL connection pool errors we've had in the last month:
+- I guess there’s nothing I can do to them for now
+- In other news, I am curious how many PostgreSQL connection pool errors we’ve had in the last month:
$ grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-1* | grep -v :0
dspace.log.2017-11-07:15695
@@ -430,9 +430,9 @@ dspace.log.2017-12-01:1601
dspace.log.2017-12-02:1274
dspace.log.2017-12-07:2769
-- I made a small fix to my
move-collections.sh
script so that it handles the case when a “to” or “from” community doesn't exist
+- I made a small fix to my
move-collections.sh
script so that it handles the case when a “to” or “from” community doesn’t exist
- The script lives here: https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515
-- Major reorganization of four of CTA's French collections
+- Major reorganization of four of CTA’s French collections
- Basically moving their items into the English ones, then moving the English ones to the top-level of the CTA community, and deleting the old sub-communities
- Move collection 10568/51821 from 10568/42212 to 10568/42211
- Move collection 10568/51400 from 10568/42214 to 10568/42211
@@ -457,21 +457,21 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
2017-12-19
- Briefly had PostgreSQL connection issues on CGSpace for the millionth time
-- I'm fucking sick of this!
+- I’m fucking sick of this!
- The connection graph on CGSpace shows shit tons of connections idle

-- And I only now just realized that DSpace's
db.maxidle
parameter is not seconds, but number of idle connections to allow.
+- And I only now just realized that DSpace’s
db.maxidle
parameter is not seconds, but number of idle connections to allow.
- So theoretically, because each webapp has its own pool, this could be 20 per app—so no wonder we have 50 idle connections!
- I notice that this number will be set to 10 by default in DSpace 6.1 and 7.0: https://jira.duraspace.org/browse/DS-3564
-- So I'm going to reduce ours from 20 to 10 and start trying to figure out how the hell to supply a database pool using Tomcat JNDI
+- So I’m going to reduce ours from 20 to 10 and start trying to figure out how the hell to supply a database pool using Tomcat JNDI
- I re-deployed the
5_x-prod
branch on CGSpace, applied all system updates, and restarted the server
- Looking through the dspace.log I see this error:
2017-12-19 08:17:15,740 ERROR org.dspace.statistics.SolrLogger @ Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
-- I don't have time now to look into this but the Solr sharding has long been an issue!
+- I don’t have time now to look into this but the Solr sharding has long been an issue!
- Looking into using JDBC / JNDI to provide a database pool to DSpace
- The DSpace 6.x configuration docs have more notes about setting up the database pool than the 5.x ones (which actually have none!)
- First, I uncomment
db.jndi
in dspace/config/dspace.cfg
@@ -496,7 +496,7 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
<ResourceLink global="jdbc/dspace" name="jdbc/dspace" type="javax.sql.DataSource"/>
- I am not sure why several guides show configuration snippets for server.xml and web application contexts that use a Local and Global jdbc…
-- When DSpace can't find the JNDI context (for whatever reason) you will see this in the dspace logs:
+- When DSpace can’t find the JNDI context (for whatever reason) you will see this in the dspace logs:
2017-12-19 13:12:08,796 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Context. Unable to find [jdbc].
@@ -547,31 +547,31 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
<version>9.1-901-1.jdbc4</version>
</dependency>
-- So WTF? Let's try copying one to Tomcat's lib folder and restarting Tomcat:
+- So WTF? Let’s try copying one to Tomcat’s lib folder and restarting Tomcat:
$ cp ~/dspace/lib/postgresql-9.1-901-1.jdbc4.jar /usr/local/opt/tomcat@7/libexec/lib
-- Oh that's fantastic, now at least Tomcat doesn't print an error during startup so I guess it succeeds to create the JNDI pool
-- DSpace starts up but I have no idea if it's using the JNDI configuration because I see this in the logs:
+- Oh that’s fantastic, now at least Tomcat doesn’t print an error during startup so I guess it succeeds to create the JNDI pool
+- DSpace starts up but I have no idea if it’s using the JNDI configuration because I see this in the logs:
2017-12-19 13:26:54,271 INFO org.dspace.storage.rdbms.DatabaseManager @ DBMS is '{}'PostgreSQL
2017-12-19 13:26:54,277 INFO org.dspace.storage.rdbms.DatabaseManager @ DBMS driver version is '{}'9.5.10
2017-12-19 13:26:54,293 INFO org.dspace.storage.rdbms.DatabaseUtils @ Loading Flyway DB migrations from: filesystem:/Users/aorth/dspace/etc/postgres, classpath:org.dspace.storage.rdbms.sqlmigration.postgres, classpath:org.dspace.storage.rdbms.migration
2017-12-19 13:26:54,306 INFO org.flywaydb.core.internal.dbsupport.DbSupportFactory @ Database: jdbc:postgresql://localhost:5432/dspacetest (PostgreSQL 9.5)
-- Let's try again, but this time explicitly blank the PostgreSQL connection parameters in dspace.cfg and see if DSpace starts…
-- Wow, ok, that works, but having to copy the PostgreSQL JDBC JAR to Tomcat's lib folder totally blows
-- Also, it's likely this is only a problem on my local macOS + Tomcat test environment
-- Ubuntu's Tomcat distribution will probably handle this differently
+- Let’s try again, but this time explicitly blank the PostgreSQL connection parameters in dspace.cfg and see if DSpace starts…
+- Wow, ok, that works, but having to copy the PostgreSQL JDBC JAR to Tomcat’s lib folder totally blows
+- Also, it’s likely this is only a problem on my local macOS + Tomcat test environment
+- Ubuntu’s Tomcat distribution will probably handle this differently
- So for reference I have:
- a
<Resource>
defined globally in server.xml
-- a
<ResourceLink>
defined in each web application's context XML
+- a
<ResourceLink>
defined in each web application’s context XML
- unset the
db.url
, db.username
, and db.password
parameters in dspace.cfg
- set the
db.jndi
in dspace.cfg to the name specified in the web application context
-- After adding the
Resource
to server.xml on Ubuntu I get this in Catalina's logs:
+- After adding the
Resource
to server.xml on Ubuntu I get this in Catalina’s logs:
SEVERE: Unable to create initial connections of pool.
java.sql.SQLException: org.postgresql.Driver
@@ -579,8 +579,8 @@ java.sql.SQLException: org.postgresql.Driver
Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
- The username and password are correct, but maybe I need to copy the fucking lib there too?
-- I tried installing Ubuntu's
libpostgresql-jdbc-java
package but Tomcat still can't find the class
-- Let me try to symlink the lib into Tomcat's libs:
+- I tried installing Ubuntu’s
libpostgresql-jdbc-java
package but Tomcat still can’t find the class
+- Let me try to symlink the lib into Tomcat’s libs:
# ln -sv /usr/share/java/postgresql.jar /usr/share/tomcat7/lib
@@ -589,17 +589,17 @@ Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
SEVERE: Exception sending context initialized event to listener instance of class org.dspace.app.util.DSpaceContextListener
java.lang.AbstractMethodError: Method org/postgresql/jdbc3/Jdbc3ResultSet.isClosed()Z is abstract
-- Could be a version issue or something since the Ubuntu package provides 9.2 and DSpace's are 9.1…
-- Let me try to remove it and copy in DSpace's:
+- Could be a version issue or something since the Ubuntu package provides 9.2 and DSpace’s are 9.1…
+- Let me try to remove it and copy in DSpace’s:
# rm /usr/share/tomcat7/lib/postgresql.jar
# cp [dspace]/webapps/xmlui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar /usr/share/tomcat7/lib/
- Wow, I think that actually works…
- I wonder if I could get the JDBC driver from postgresql.org instead of relying on the one from the DSpace build: https://jdbc.postgresql.org/
-- I notice our version is 9.1-901, which isn't even available anymore! The latest in the archived versions is 9.1-903
+- I notice our version is 9.1-901, which isn’t even available anymore! The latest in the archived versions is 9.1-903
- Also, since I commented out all the db parameters in DSpace.cfg, how does the command line
dspace
tool work?
-- Let's try the upstream JDBC driver first:
+- Let’s try the upstream JDBC driver first:
# rm /usr/share/tomcat7/lib/postgresql-9.1-901-1.jdbc4.jar
# wget https://jdbc.postgresql.org/download/postgresql-42.1.4.jar -O /usr/share/tomcat7/lib/postgresql-42.1.4.jar
@@ -648,8 +648,8 @@ javax.naming.NoInitialContextException: Need to specify class name in environmen
- If I add the db values back to dspace.cfg the
dspace database info
command succeeds but the log still shows errors retrieving the JNDI connection
- Perhaps something to report to the dspace-tech mailing list when I finally send my comments
-- Oh cool!
select * from pg_stat_activity
shows “PostgreSQL JDBC Driver” for the application name! That's how you know it's working!
-- If you monitor the
pg_stat_activity
while you run dspace database info
you can see that it doesn't use the JNDI and creates ~9 extra PostgreSQL connections!
+- Oh cool!
select * from pg_stat_activity
shows “PostgreSQL JDBC Driver” for the application name! That’s how you know it’s working!
+- If you monitor the
pg_stat_activity
while you run dspace database info
you can see that it doesn’t use the JNDI and creates ~9 extra PostgreSQL connections!
- And in the middle of all of this Linode sends an alert that CGSpace has high CPU usage from 2 to 4 PM
2017-12-20
@@ -678,14 +678,14 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -i 10568/89287
2017-12-24
- Linode alerted that CGSpace was using high CPU this morning around 6 AM
-- I'm playing with reading all of a month's nginx logs into goaccess:
+- I’m playing with reading all of a month’s nginx logs into goaccess:
# find /var/log/nginx -type f -newermt "2017-12-01" | xargs zcat --force | goaccess --log-format=COMBINED -
- I can see interesting things using this approach, for example:
-- 50.116.102.77 checked our status almost 40,000 times so far this month—I think it's the CGNet uptime tool
-- Also, we've handled 2.9 million requests this month from 172,000 unique IP addresses!
+- 50.116.102.77 checked our status almost 40,000 times so far this month—I think it’s the CGNet uptime tool
+- Also, we’ve handled 2.9 million requests this month from 172,000 unique IP addresses!
- Total bandwidth so far this month is 640GiB
- The user that made the most requests so far this month is 45.5.184.196 (267,000 requests)
@@ -720,13 +720,13 @@ UPDATE 5
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
DELETE 20
-- I need to figure out why we have records with language
in
because that's not a language!
+- I need to figure out why we have records with language
in
because that’s not a language!
2017-12-30
- Linode alerted that CGSpace was using 259% CPU from 4 to 6 AM
- Uptime Robot noticed that the server went down for 1 minute a few hours later, around 9AM
-- Here's the XMLUI logs:
+- Here’s the XMLUI logs:
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "30/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
637 207.46.13.106
@@ -740,14 +740,14 @@ DELETE 20
1586 66.249.64.78
3653 66.249.64.91
-- Looks pretty normal actually, but I don't know who 54.175.208.220 is
+- Looks pretty normal actually, but I don’t know who 54.175.208.220 is
- They identify as “com.plumanalytics”, which Google says is associated with Elsevier
-- They only seem to have used one Tomcat session so that's good, I guess I don't need to add them to the Tomcat Crawler Session Manager valve:
+- They only seem to have used one Tomcat session so that’s good, I guess I don’t need to add them to the Tomcat Crawler Session Manager valve:
$ grep 54.175.208.220 dspace.log.2017-12-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
1
-- 216.244.66.245 seems to be moz.com's DotBot
+- 216.244.66.245 seems to be moz.com’s DotBot
2017-12-31
diff --git a/docs/2018-01/index.html b/docs/2018-01/index.html
index bf99a7005..6fcdbd4d6 100644
--- a/docs/2018-01/index.html
+++ b/docs/2018-01/index.html
@@ -9,7 +9,7 @@
@@ -83,7 +83,7 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv
-
+
@@ -177,7 +177,7 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv
-
+
@@ -224,7 +224,7 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv
January, 2018
@@ -232,7 +232,7 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv
2018-01-02
- Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
-- I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
+- I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
- The nginx logs show HTTP 200s until
02/Jan/2018:11:27:17 +0000
when Uptime Robot got an HTTP 500
- In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
- And just before that I see this:
@@ -240,8 +240,8 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv
Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
- Ah hah! So the pool was actually empty!
-- I need to increase that, let's try to bump it up from 50 to 75
-- After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
+- I need to increase that, let’s try to bump it up from 50 to 75
+- After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw
- I notice this error quite a few times in dspace.log:
2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
@@ -294,7 +294,7 @@ dspace.log.2017-12-31:53
dspace.log.2018-01-01:45
dspace.log.2018-01-02:34
-- Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains
+- Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains
2018-01-03
@@ -326,8 +326,8 @@ dspace.log.2018-01-03:1909
- 134.155.96.78 appears to be at the University of Mannheim in Germany
- They identify as: Mozilla/5.0 (compatible; heritrix/3.2.0 +http://ifm.uni-mannheim.de)
-- This appears to be the Internet Archive's open source bot
-- They seem to be re-using their Tomcat session so I don't need to do anything to them just yet:
+- This appears to be the Internet Archive’s open source bot
+- They seem to be re-using their Tomcat session so I don’t need to do anything to them just yet:
$ grep 134.155.96.78 dspace.log.2018-01-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2
@@ -387,8 +387,8 @@ dspace.log.2018-01-03:1909
139 164.39.7.62
- I have no idea what these are but they seem to be coming from Amazon…
-- I guess for now I just have to increase the database connection pool's max active
-- It's currently 75 and normally I'd just bump it by 25 but let me be a bit daring and push it by 50 to 125, because I used to see at least 121 connections in pg_stat_activity before when we were using the shitty default pooling
+- I guess for now I just have to increase the database connection pool’s max active
+- It’s currently 75 and normally I’d just bump it by 25 but let me be a bit daring and push it by 50 to 125, because I used to see at least 121 connections in pg_stat_activity before when we were using the shitty default pooling
2018-01-04
@@ -420,14 +420,14 @@ dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
dspace.log.2018-01-04:1559
-- I will just bump the connection limit to 300 because I'm fucking fed up with this shit
+- I will just bump the connection limit to 300 because I’m fucking fed up with this shit
- Once I get back to Amman I will have to try to create different database pools for different web applications, like recently discussed on the dspace-tech mailing list
- Create accounts on CGSpace for two CTA staff km4ard@cta.int and bheenick@cta.int
2018-01-05
- Peter said that CGSpace was down last night and Tsega restarted Tomcat
-- I don't see any alerts from Linode or UptimeRobot, and there are no PostgreSQL connection errors in the dspace logs for today:
+- I don’t see any alerts from Linode or UptimeRobot, and there are no PostgreSQL connection errors in the dspace logs for today:
$ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
dspace.log.2018-01-01:0
@@ -442,8 +442,8 @@ dspace.log.2018-01-05:0
[Fri Jan 05 09:31:22.965398 2018] [:error] [pid 9340] [client 213.55.99.121:64476] WARNING: Unable to find a match for "9-16-1-RV.doc" in "/home/files/journals/6//articles/9/". Skipping this file., referer: http://dagris.info/reviewtool/index.php/index/install/upgrade
- I will delete the log file for now and tell Danny
-- Also, I'm still seeing a hundred or so of the “ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer” errors in dspace logs, I need to search the dspace-tech mailing list to see what the cause is
-- I will run a full Discovery reindex in the mean time to see if it's something wrong with the Discovery Solr core
+- Also, I’m still seeing a hundred or so of the “ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer” errors in dspace logs, I need to search the dspace-tech mailing list to see what the cause is
+- I will run a full Discovery reindex in the mean time to see if it’s something wrong with the Discovery Solr core
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
@@ -456,7 +456,7 @@ sys 3m14.890s
2018-01-06
-- I'm still seeing Solr errors in the DSpace logs even after the full reindex yesterday:
+- I’m still seeing Solr errors in the DSpace logs even after the full reindex yesterday:
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1983+TO+1989]': Encountered " "]" "] "" at line 1, column 32.
@@ -471,7 +471,7 @@ sys 3m14.890s
COPY 4515
2018-01-10
-- I looked to see what happened to this year's Solr statistics sharding task that should have run on 2018-01-01 and of course it failed:
+- I looked to see what happened to this year’s Solr statistics sharding task that should have run on 2018-01-01 and of course it failed:
Moving: 81742 into core statistics-2010
Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2010
@@ -542,9 +542,9 @@ Caused by: org.apache.http.client.ClientProtocolException
... 10 more
- There is interesting documentation about this on the DSpace Wiki: https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-SolrShardingByYear
-- I'm looking to see maybe if we're hitting the issues mentioned in DS-2212 that were apparently fixed in DSpace 5.2
+- I’m looking to see maybe if we’re hitting the issues mentioned in DS-2212 that were apparently fixed in DSpace 5.2
- I can apparently search for records in the Solr stats core that have an empty
owningColl
field using this in the Solr admin query: -owningColl:*
-- On CGSpace I see 48,000,000 records that have an
owningColl
field and 34,000,000 that don't:
+- On CGSpace I see 48,000,000 records that have an
owningColl
field and 34,000,000 that don’t:
$ http 'http://localhost:3000/solr/statistics/select?q=owningColl%3A*&wt=json&indent=true' | grep numFound
"response":{"numFound":48476327,"start":0,"docs":[
@@ -552,14 +552,14 @@ $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&wt=js
"response":{"numFound":34879872,"start":0,"docs":[
- I tested the
dspace stats-util -s
process on my local machine and it failed the same way
-- It doesn't seem to be helpful, but the dspace log shows this:
+- It doesn’t seem to be helpful, but the dspace log shows this:
2018-01-10 10:51:19,301 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
2018-01-10 10:51:19,301 INFO org.dspace.statistics.SolrLogger @ Moving: 3821 records into core statistics-2016
- Terry Brady has written some notes on the DSpace Wiki about Solr sharing issues: https://wiki.duraspace.org/display/%7Eterrywbrady/Statistics+Import+Export+Issues
- Uptime Robot said that CGSpace went down at around 9:43 AM
-- I looked at PostgreSQL's
pg_stat_activity
table and saw 161 active connections, but no pool errors in the DSpace logs:
+- I looked at PostgreSQL’s
pg_stat_activity
table and saw 161 active connections, but no pool errors in the DSpace logs:
$ grep -c "Timeout: Pool empty." dspace.log.2018-01-10
0
@@ -583,7 +583,7 @@ $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&wt=js
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36"
whois
says they come from Perfect IP
-- I've never seen those top IPs before, but they have created 50,000 Tomcat sessions today:
+- I’ve never seen those top IPs before, but they have created 50,000 Tomcat sessions today:
$ grep -E '(2607:fa98:40:9:26b6:fdff:feff:1888|2607:fa98:40:9:26b6:fdff:feff:195d|2607:fa98:40:9:26b6:fdff:feff:1c96|70.36.107.49|70.36.107.190|70.36.107.50)' /home/cgspace.cgiar.org/log/dspace.log.2018-01-10 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
49096
@@ -599,20 +599,20 @@ $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&wt=js
23401 2607:fa98:40:9:26b6:fdff:feff:195d
47875 2607:fa98:40:9:26b6:fdff:feff:1888
-- I added the user agent to nginx's badbots limit req zone but upon testing the config I got an error:
+- I added the user agent to nginx’s badbots limit req zone but upon testing the config I got an error:
# nginx -t
nginx: [emerg] could not build map_hash, you should increase map_hash_bucket_size: 64
nginx: configuration file /etc/nginx/nginx.conf test failed
-- According to nginx docs the bucket size should be a multiple of the CPU's cache alignment, which is 64 for us:
+- According to nginx docs the bucket size should be a multiple of the CPU’s cache alignment, which is 64 for us:
# cat /proc/cpuinfo | grep cache_alignment | head -n1
cache_alignment : 64
- On our servers that is 64, so I increased this parameter to 128 and deployed the changes to nginx
- Almost immediately the PostgreSQL connections dropped back down to 40 or so, and UptimeRobot said the site was back up
-- So that's interesting that we're not out of PostgreSQL connections (current pool maxActive is 300!) but the system is “down” to UptimeRobot and very slow to use
+- So that’s interesting that we’re not out of PostgreSQL connections (current pool maxActive is 300!) but the system is “down” to UptimeRobot and very slow to use
- Linode continues to test mitigations for Meltdown and Spectre: https://blog.linode.com/2018/01/03/cpu-vulnerabilities-meltdown-spectre/
- I rebooted DSpace Test to see if the kernel will be updated (currently Linux 4.14.12-x86_64-linode92)… nope.
- It looks like Linode will reboot the KVM hosts later this week, though
@@ -650,7 +650,7 @@ cache_alignment : 64
111535 2607:fa98:40:9:26b6:fdff:feff:1c96
161797 2607:fa98:40:9:26b6:fdff:feff:1888
-- Wow, I just figured out how to set the application name of each database pool in the JNDI config of Tomcat's
server.xml
:
+- Wow, I just figured out how to set the application name of each database pool in the JNDI config of Tomcat’s
server.xml
:
<Resource name="jdbc/dspaceWeb" auth="Container" type="javax.sql.DataSource"
driverClassName="org.postgresql.Driver"
@@ -665,9 +665,9 @@ cache_alignment : 64
validationQuery='SELECT 1'
testOnBorrow='true' />
-- So theoretically I could name each connection “xmlui” or “dspaceWeb” or something meaningful and it would show up in PostgreSQL's
pg_stat_activity
table!
+- So theoretically I could name each connection “xmlui” or “dspaceWeb” or something meaningful and it would show up in PostgreSQL’s
pg_stat_activity
table!
- This would be super helpful for figuring out where load was coming from (now I wonder if I could figure out how to graph this)
-- Also, I realized that the
db.jndi
parameter in dspace.cfg needs to match the name
value in your applicaiton's context—not the global
one
+- Also, I realized that the
db.jndi
parameter in dspace.cfg needs to match the name
value in your applicaiton’s context—not the global
one
- Ah hah! Also, I can name the default DSpace connection pool in dspace.cfg as well, like:
db.url = jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceDefault
@@ -676,7 +676,7 @@ cache_alignment : 64
2018-01-12
-- I'm looking at the DSpace 6.0 Install docs and notice they tweak the number of threads in their Tomcat connector:
+- I’m looking at the DSpace 6.0 Install docs and notice they tweak the number of threads in their Tomcat connector:
<!-- Define a non-SSL HTTP/1.1 Connector on port 8080 -->
<Connector port="8080"
@@ -691,8 +691,8 @@ cache_alignment : 64
URIEncoding="UTF-8"/>
- In Tomcat 8.5 the
maxThreads
defaults to 200 which is probably fine, but tweaking minSpareThreads
could be good
-- I don't see a setting for
maxSpareThreads
in the docs so that might be an error
-- Looks like in Tomcat 8.5 the default URIEncoding for Connectors is UTF-8, so we don't need to specify that manually anymore: https://tomcat.apache.org/tomcat-8.5-doc/config/http.html
+- I don’t see a setting for
maxSpareThreads
in the docs so that might be an error
+- Looks like in Tomcat 8.5 the default URIEncoding for Connectors is UTF-8, so we don’t need to specify that manually anymore: https://tomcat.apache.org/tomcat-8.5-doc/config/http.html
- Ooh, I just saw the
acceptorThreadCount
setting (in Tomcat 7 and 8.5):
The number of threads to be used to accept connections. Increase this value on a multi CPU machine, although you would never really need more than 2. Also, with a lot of non keep alive connections, you might want to increase this value as well. Default value is 1.
@@ -707,7 +707,7 @@ cache_alignment : 64
13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxActive is not used in DBCP2, use maxTotal instead. maxTotal default value is 8. You have set value of "35" for "maxActive" property, which is being ignored.
13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxWait is not used in DBCP2 , use maxWaitMillis instead. maxWaitMillis default value is -1. You have set value of "5000" for "maxWait" property, which is being ignored.
-- I looked in my Tomcat 7.0.82 logs and I don't see anything about DBCP2 errors, so I guess this a Tomcat 8.0.x or 8.5.x thing
+- I looked in my Tomcat 7.0.82 logs and I don’t see anything about DBCP2 errors, so I guess this a Tomcat 8.0.x or 8.5.x thing
- DBCP2 appears to be Tomcat 8.0.x and up according to the Tomcat 8.0 migration guide
- I have updated our Ansible infrastructure scripts so that it will be ready whenever we switch to Tomcat 8 (probably with Ubuntu 18.04 later this year)
- When I enable the ResourceLink in the ROOT.xml context I get the following error in the Tomcat localhost log:
@@ -735,24 +735,24 @@ Caused by: java.lang.NullPointerException
... 15 more
- Interesting blog post benchmarking Tomcat JDBC vs Apache Commons DBCP2, with configuration snippets: http://www.tugay.biz/2016/07/tomcat-connection-pool-vs-apache.html
-- The Tomcat vs Apache pool thing is confusing, but apparently we're using Apache Commons DBCP2 because we don't specify
factory="org.apache.tomcat.jdbc.pool.DataSourceFactory"
in our global resource
-- So at least I know that I'm not looking for documentation or troubleshooting on the Tomcat JDBC pool!
-- I looked at
pg_stat_activity
during Tomcat's startup and I see that the pool created in server.xml is indeed connecting, just that nothing uses it
+- The Tomcat vs Apache pool thing is confusing, but apparently we’re using Apache Commons DBCP2 because we don’t specify
factory="org.apache.tomcat.jdbc.pool.DataSourceFactory"
in our global resource
+- So at least I know that I’m not looking for documentation or troubleshooting on the Tomcat JDBC pool!
+- I looked at
pg_stat_activity
during Tomcat’s startup and I see that the pool created in server.xml is indeed connecting, just that nothing uses it
- Also, the fallback connection parameters specified in local.cfg (not dspace.cfg) are used
- Shit, this might actually be a DSpace error: https://jira.duraspace.org/browse/DS-3434
-- I'll comment on that issue
+- I’ll comment on that issue
2018-01-14
- Looking at the authors Peter had corrected
-- Some had multiple and he's corrected them by adding
||
in the correction column, but I can't process those this way so I will just have to flag them and do those manually later
+- Some had multiple and he’s corrected them by adding
||
in the correction column, but I can’t process those this way so I will just have to flag them and do those manually later
- Also, I can flag the values that have “DELETE”
- Then I need to facet the correction column on isBlank(value) and not flagged
2018-01-15
- Help Udana from IWMI export a CSV from DSpace Test so he can start trying a batch upload
-- I'm going to apply these ~130 corrections on CGSpace:
+- I’m going to apply these ~130 corrections on CGSpace:
update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
@@ -764,7 +764,7 @@ update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_f
update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
-- Continue proofing Peter's author corrections that I started yesterday, faceting on non blank, non flagged, and briefly scrolling through the values of the corrections to find encoding errors for French and Spanish names
+- Continue proofing Peter’s author corrections that I started yesterday, faceting on non blank, non flagged, and briefly scrolling through the values of the corrections to find encoding errors for French and Spanish names

@@ -817,9 +817,9 @@ COPY 4552
- Looking over the affiliations again I see dozens of CIAT ones with their affiliation formatted like: International Center for Tropical Agriculture (CIAT)
- For example, this one is from just last month: https://cgspace.cgiar.org/handle/10568/89930
- Our controlled vocabulary has this in the format without the abbreviation: International Center for Tropical Agriculture
-- So some submitters don't know to use the controlled vocabulary lookup
+- So some submitters don’t know to use the controlled vocabulary lookup
- Help Sisay with some thumbnails for book chapters in Open Refine and SAFBuilder
-- CGSpace users were having problems logging in, I think something's wrong with LDAP because I see this in the logs:
+- CGSpace users were having problems logging in, I think something’s wrong with LDAP because I see this in the logs:
2018-01-15 12:53:15,810 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2386749547D03E0AA4EC7E44181A7552:ip_addr=x.x.x.x:ldap_authentication:type=failed_auth javax.naming.AuthenticationException\colon; [LDAP\colon; error code 49 - 80090308\colon; LdapErr\colon; DSID-0C090400, comment\colon; AcceptSecurityContext error, data 775, v1db1^@]
@@ -835,7 +835,7 @@ sys 0m2.210s
- Meeting with CGSpace team, a few action items:
-- Discuss standardized names for CRPs and centers with ICARDA (don't wait for CG Core)
+- Discuss standardized names for CRPs and centers with ICARDA (don’t wait for CG Core)
- Re-send DC rights implementation and forward to everyone so we can move forward with it (without the URI field for now)
- Start looking at where I was with the AGROVOC API
- Have a controlled vocabulary for CGIAR authors’ names and ORCIDs? Perhaps values like: Orth, Alan S. (0000-0002-1735-7458)
@@ -845,15 +845,15 @@ sys 0m2.210s
- Add Sisay and Danny to Uptime Robot and allow them to restart Tomcat on CGSpace ✔
-- I removed Tsega's SSH access to the web and DSpace servers, and asked Danny to check whether there is anything he needs from Tsega's home directories so we can delete the accounts completely
-- I removed Tsega's access to Linode dashboard as well
+- I removed Tsega’s SSH access to the web and DSpace servers, and asked Danny to check whether there is anything he needs from Tsega’s home directories so we can delete the accounts completely
+- I removed Tsega’s access to Linode dashboard as well
- I ended up creating a Jira issue for my
db.jndi
documentation fix: DS-3803
- The DSpace developers said they wanted each pull request to be associated with a Jira issue
2018-01-17
- Abenet asked me to proof and upload 54 records for LIVES
-- A few records were missing countries (even though they're all from Ethiopia)
+- A few records were missing countries (even though they’re all from Ethiopia)
- Also, there are whitespace issues in many columns, and the items are mapped to the LIVES and ILRI articles collections, not Theses
- In any case, importing them like this:
@@ -862,7 +862,7 @@ $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFor
- And fantastic, before I started the import there were 10 PostgreSQL connections, and then CGSpace crashed during the upload
- When I looked there were 210 PostgreSQL connections!
-- I don't see any high load in XMLUI or REST/OAI:
+- I don’t see any high load in XMLUI or REST/OAI:
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "17/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
381 40.77.167.124
@@ -892,8 +892,8 @@ $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFor
2018-01-17 07:59:25,856 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://localhost:8081: The target server failed to respond
2018-01-17 07:59:25,856 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ Retrying request to {}->http://localhost:8081
-- I have NEVER seen this error before, and there is no error before or after that in DSpace's solr.log
-- Tomcat's catalina.out does show something interesting, though, right at that time:
+- I have NEVER seen this error before, and there is no error before or after that in DSpace’s solr.log
+- Tomcat’s catalina.out does show something interesting, though, right at that time:
[====================> ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:02
[====================> ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:11
@@ -933,7 +933,7 @@ Exception in thread "http-bio-127.0.0.1-8081-exec-627" java.lang.OutOf
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
-- You can see the timestamp above, which is some Atmire nightly task I think, but I can't figure out which one
+- You can see the timestamp above, which is some Atmire nightly task I think, but I can’t figure out which one
- So I restarted Tomcat and tried the import again, which finished very quickly and without errors!
$ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives2.map &> lives2.log
@@ -942,7 +942,7 @@ Exception in thread "http-bio-127.0.0.1-8081-exec-627" java.lang.OutOf

-- I'm playing with maven repository caching using Artifactory in a Docker instance: https://www.jfrog.com/confluence/display/RTF/Installing+with+Docker
+- I’m playing with maven repository caching using Artifactory in a Docker instance: https://www.jfrog.com/confluence/display/RTF/Installing+with+Docker
$ docker pull docker.bintray.io/jfrog/artifactory-oss:latest
$ docker volume create --name artifactory5_data
@@ -961,10 +961,10 @@ $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/
$ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-sword,\!dspace-swordv2 clean package
- UptimeRobot said CGSpace went down for a few minutes
-- I didn't do anything but it came back up on its own
-- I don't see anything unusual in the XMLUI or REST/OAI logs
+- I didn’t do anything but it came back up on its own
+- I don’t see anything unusual in the XMLUI or REST/OAI logs
- Now Linode alert says the CPU load is high, sigh
-- Regarding the heap space error earlier today, it looks like it does happen a few times a week or month (I'm not sure how far these logs go back, as they are not strictly daily):
+- Regarding the heap space error earlier today, it looks like it does happen a few times a week or month (I’m not sure how far these logs go back, as they are not strictly daily):
# zgrep -c java.lang.OutOfMemoryError /var/log/tomcat7/catalina.out* | grep -v :0
/var/log/tomcat7/catalina.out:2
@@ -994,14 +994,14 @@ $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/
2018-01-18
- UptimeRobot said CGSpace was down for 1 minute last night
-- I don't see any errors in the nginx or catalina logs, so I guess UptimeRobot just got impatient and closed the request, which caused nginx to send an HTTP 499
+- I don’t see any errors in the nginx or catalina logs, so I guess UptimeRobot just got impatient and closed the request, which caused nginx to send an HTTP 499
- I realize I never did a full re-index after the SQL author and affiliation updates last week, so I should force one now:
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b
- Maria from Bioversity asked if I could remove the abstracts from all of their Limited Access items in the Bioversity Journal Articles collection
-- It's easy enough to do in OpenRefine, but you have to be careful to only get those items that are uploaded into Bioversity's collection, not the ones that are mapped from others!
+- It’s easy enough to do in OpenRefine, but you have to be careful to only get those items that are uploaded into Bioversity’s collection, not the ones that are mapped from others!
- Use this GREL in OpenRefine after isolating all the Limited Access items:
value.startsWith("10568/35501")
- UptimeRobot said CGSpace went down AGAIN and both Sisay and Danny immediately logged in and restarted Tomcat without talking to me or each other!
@@ -1011,8 +1011,8 @@ Jan 18 07:01:22 linode18 systemd[1]: Stopping LSB: Start Tomcat....
Jan 18 07:01:22 linode18 sudo[10812]: swebshet : TTY=pts/3 ; PWD=/home/swebshet ; USER=root ; COMMAND=/bin/systemctl restart tomcat7
Jan 18 07:01:22 linode18 sudo[10812]: pam_unix(sudo:session): session opened for user root by swebshet(uid=0)
-- I had to cancel the Discovery indexing and I'll have to re-try it another time when the server isn't so busy (it had already taken two hours and wasn't even close to being done)
-- For now I've increased the Tomcat JVM heap from 5632 to 6144m, to give ~1GB of free memory over the average usage to hopefully account for spikes caused by load or background jobs
+- I had to cancel the Discovery indexing and I’ll have to re-try it another time when the server isn’t so busy (it had already taken two hours and wasn’t even close to being done)
+- For now I’ve increased the Tomcat JVM heap from 5632 to 6144m, to give ~1GB of free memory over the average usage to hopefully account for spikes caused by load or background jobs
2018-01-19
@@ -1023,8 +1023,8 @@ Jan 18 07:01:22 linode18 sudo[10812]: pam_unix(sudo:session): session opened for
$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b
- Linode alerted again and said that CGSpace was using 301% CPU
-- Peter emailed to ask why this item doesn't have an Altmetric badge on CGSpace but does have one on the Altmetric dashboard
-- Looks like our badge code calls the
handle
endpoint which doesn't exist:
+- Peter emailed to ask why this item doesn’t have an Altmetric badge on CGSpace but does have one on the Altmetric dashboard
+- Looks like our badge code calls the
handle
endpoint which doesn’t exist:
https://api.altmetric.com/v1/handle/10568/88090
@@ -1060,7 +1060,7 @@ real 7m2.241s
user 1m33.198s
sys 0m12.317s
-- I tested the abstract cleanups on Bioversity's Journal Articles collection again that I had started a few days ago
+- I tested the abstract cleanups on Bioversity’s Journal Articles collection again that I had started a few days ago
- In the end there were 324 items in the collection that were Limited Access, but only 199 had abstracts
- I want to document the workflow of adding a production PostgreSQL database to a development instance of DSpace in Docker:
@@ -1075,7 +1075,7 @@ $ docker cp ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace_db:
$ docker exec dspace_db psql -U dspace -f /tmp/update-sequences.sql dspace
2018-01-22
-- Look over Udana's CSV of 25 WLE records from last week
+- Look over Udana’s CSV of 25 WLE records from last week
- I sent him some corrections:
- The file encoding is Windows-1252
@@ -1090,7 +1090,7 @@ $ docker exec dspace_db psql -U dspace -f /tmp/update-sequences.sql dspace
- I wrote a quick Python script to use the DSpace REST API to find all collections under a given community
- The source code is here: rest-find-collections.py
-- Peter had said that found a bunch of ILRI collections that were called “untitled”, but I don't see any:
+- Peter had said that found a bunch of ILRI collections that were called “untitled”, but I don’t see any:
$ ./rest-find-collections.py 10568/1 | wc -l
308
@@ -1099,17 +1099,17 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
Looking at the Tomcat connector docs I think we really need to increase maxThreads
The default is 200, which can easily be taken up by bots considering that Google and Bing each browse with fifty (50) connections each sometimes!
Before I increase this I want to see if I can measure and graph this, and then benchmark
-I'll probably also increase minSpareThreads
to 20 (its default is 10)
+I’ll probably also increase minSpareThreads
to 20 (its default is 10)
I still want to bump up acceptorThreadCount
from 1 to 2 as well, as the documentation says this should be increased on multi-core systems
I spent quite a bit of time looking at jvisualvm
and jconsole
today
Run system updates on DSpace Test and reboot it
I see I can monitor the number of Tomcat threads and some detailed JVM memory stuff if I install munin-plugins-java
-I'd still like to get arbitrary mbeans like activeSessions etc, though
-I can't remember if I had to configure the jmx settings in /etc/munin/plugin-conf.d/munin-node
or not—I think all I did was re-run the munin-node-configure
script and of course enable JMX in Tomcat's JVM options
+I’d still like to get arbitrary mbeans like activeSessions etc, though
+I can’t remember if I had to configure the jmx settings in /etc/munin/plugin-conf.d/munin-node
or not—I think all I did was re-run the munin-node-configure
script and of course enable JMX in Tomcat’s JVM options
2018-01-23
-- Thinking about generating a jmeter test plan for DSpace, along the lines of Georgetown's dspace-performance-test
+- Thinking about generating a jmeter test plan for DSpace, along the lines of Georgetown’s dspace-performance-test
- I got a list of all the GET requests on CGSpace for January 21st (the last time Linode complained the load was high), excluding admin calls:
# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -c -v "/admin"
@@ -1208,7 +1208,7 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
$ jmeter -g 2018-01-24-linode5451120-baseline.jtl -o 2018-01-24-linode5451120-baseline
2018-01-25
-- Run another round of tests on DSpace Test with jmeter after changing Tomcat's
minSpareThreads
to 20 (default is 10) and acceptorThreadCount
to 2 (default is 1):
+- Run another round of tests on DSpace Test with jmeter after changing Tomcat’s
minSpareThreads
to 20 (default is 10) and acceptorThreadCount
to 2 (default is 1):
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.log
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads2.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads2.log
@@ -1221,18 +1221,18 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc2.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc2.log
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc3.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc3.log
-- I haven't had time to look at the results yet
+- I haven’t had time to look at the results yet
2018-01-26
- Peter followed up about some of the points from the Skype meeting last week
-- Regarding the ORCID field issue, I see ICARDA's MELSpace is using
cg.creator.ID
: 0000-0001-9156-7691
+- Regarding the ORCID field issue, I see ICARDA’s MELSpace is using
cg.creator.ID
: 0000-0001-9156-7691
- I had floated the idea of using a controlled vocabulary with values formatted something like: Orth, Alan S. (0000-0002-1735-7458)
- Update PostgreSQL JDBC driver version from 42.1.4 to 42.2.1 on DSpace Test, see: https://jdbc.postgresql.org/
- Reboot DSpace Test to get new Linode kernel (Linux 4.14.14-x86_64-linode94)
- I am testing my old work on the
dc.rights
field, I had added a branch for it a few months ago
- I added a list of Creative Commons and other licenses in
input-forms.xml
-- The problem is that Peter wanted to use two questions, one for CG centers and one for other, but using the same metadata value, which isn't possible (?)
+- The problem is that Peter wanted to use two questions, one for CG centers and one for other, but using the same metadata value, which isn’t possible (?)
- So I used some creativity and made several fields display values, but not store any, ie:
<pair>
@@ -1240,7 +1240,7 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
<stored-value></stored-value>
</pair>
-- I was worried that if a user selected this field for some reason that DSpace would store an empty value, but it simply doesn't register that as a valid option:
+- I was worried that if a user selected this field for some reason that DSpace would store an empty value, but it simply doesn’t register that as a valid option:

@@ -1286,9 +1286,9 @@ Was expecting one of:
Maximum: 2771268
Average: 210483
-- I guess responses that don't fit in RAM get saved to disk (a default of 1024M), so this is definitely not the issue here, and that warning is totally unrelated
-- My best guess is that the Solr search error is related somehow but I can't figure it out
-- We definitely have enough database connections, as I haven't seen a pool error in weeks:
+- I guess responses that don’t fit in RAM get saved to disk (a default of 1024M), so this is definitely not the issue here, and that warning is totally unrelated
+- My best guess is that the Solr search error is related somehow but I can’t figure it out
+- We definitely have enough database connections, as I haven’t seen a pool error in weeks:
$ grep -c "Timeout: Pool empty." dspace.log.2018-01-2*
dspace.log.2018-01-20:0
@@ -1305,7 +1305,7 @@ dspace.log.2018-01-29:0
Adam Hunt from WLE complained that pages take “1-2 minutes” to load each, from France and Sri Lanka
I asked him which particular pages, as right now pages load in 2 or 3 seconds for me
UptimeRobot said CGSpace went down again, and I looked at PostgreSQL and saw 211 active database connections
-If it's not memory and it's not database, it's gotta be Tomcat threads, seeing as the default maxThreads
is 200 anyways, it actually makes sense
+If it’s not memory and it’s not database, it’s gotta be Tomcat threads, seeing as the default maxThreads
is 200 anyways, it actually makes sense
I decided to change the Tomcat thread settings on CGSpace:
maxThreads
from 200 (default) to 400
@@ -1333,8 +1333,8 @@ busy.value 0
idle.value 20
max.value 400
-- Apparently you can't monitor more than one connector, so I guess the most important to monitor would be the one that nginx is sending stuff to
-- So for now I think I'll just monitor these and skip trying to configure the jmx plugins
+- Apparently you can’t monitor more than one connector, so I guess the most important to monitor would be the one that nginx is sending stuff to
+- So for now I think I’ll just monitor these and skip trying to configure the jmx plugins
- Although following the logic of /usr/share/munin/plugins/jmx_tomcat_dbpools could be useful for getting the active Tomcat sessions
- From debugging the
jmx_tomcat_db_pools
script from the munin-plugins-java
package, I see that this is how you call arbitrary mbeans:
@@ -1343,7 +1343,7 @@ Catalina:type=DataSource,class=javax.sql.DataSource,name="jdbc/dspace"
- More notes here: https://github.com/munin-monitoring/contrib/tree/master/plugins/jmx
- Looking at the Munin graphs, I that the load is 200% every morning from 03:00 to almost 08:00
-- Tomcat's catalina.out log file is full of spam from this thing too, with lines like this
+- Tomcat’s catalina.out log file is full of spam from this thing too, with lines like this
[===================> ]38% time remaining: 5 hour(s) 21 minute(s) 47 seconds. timestamp: 2018-01-29 06:25:16
@@ -1359,7 +1359,7 @@ Catalina:type=DataSource,class=javax.sql.DataSource,name="jdbc/dspace"
- UptimeRobot says CGSpace went down at 7:57 AM, and indeed I see a lot of HTTP 499 codes in nginx logs
- PostgreSQL activity shows 222 database connections
- Now PostgreSQL activity shows 265 database connections!
-- I don't see any errors anywhere…
+- I don’t see any errors anywhere…
- Now PostgreSQL activity shows 308 connections!
- Well this is interesting, there are 400 Tomcat threads busy:
@@ -1411,18 +1411,18 @@ javax.ws.rs.WebApplicationException
- We need to start graphing the Tomcat sessions as well, though that requires JMX
- Also, I wonder if I could disable the nightly Atmire thing
-- God, I don't know where this load is coming from
+- God, I don’t know where this load is coming from
- Since I bumped up the Tomcat threads from 200 to 400 the load on the server has been sustained at about 200% for almost a whole day:

- I should make separate database pools for the web applications and the API applications like REST and OAI
-- Ok, so this is interesting: I figured out how to get the MBean path to query Tomcat's activeSessions from JMX (using
munin-plugins-java
):
+- Ok, so this is interesting: I figured out how to get the MBean path to query Tomcat’s activeSessions from JMX (using
munin-plugins-java
):
# port=5400 ip="127.0.0.1" /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=Manager,context=/,host=localhost activeSessions
Catalina:type=Manager,context=/,host=localhost activeSessions 8
-- If you connect to Tomcat in
jvisualvm
it's pretty obvious when you hover over the elements
+- If you connect to Tomcat in
jvisualvm
it’s pretty obvious when you hover over the elements

diff --git a/docs/2018-02/index.html b/docs/2018-02/index.html
index 4adbe47d5..f39111521 100644
--- a/docs/2018-02/index.html
+++ b/docs/2018-02/index.html
@@ -9,9 +9,9 @@
@@ -23,11 +23,11 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plug
-
+
@@ -57,7 +57,7 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plug
-
+
@@ -104,7 +104,7 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plug
February, 2018
@@ -112,9 +112,9 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plug
2018-02-01
- Peter gave feedback on the
dc.rights
proof of concept that I had sent him last week
-- We don't need to distinguish between internal and external works, so that makes it just a simple list
+- We don’t need to distinguish between internal and external works, so that makes it just a simple list
- Yesterday I figured out how to monitor DSpace sessions using JMX
-- I copied the logic in the
jmx_tomcat_dbpools
provided by Ubuntu's munin-plugins-java
package and used the stuff I discovered about JMX in 2018-01
+- I copied the logic in the
jmx_tomcat_dbpools
provided by Ubuntu’s munin-plugins-java
package and used the stuff I discovered about JMX in 2018-01

@@ -163,7 +163,7 @@ sys 0m1.905s
dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
UPDATE 20
-- I tried the
TRIM(TRAILING from text_value)
function and it said it changed 20 items but the spaces didn't go away
+- I tried the
TRIM(TRAILING from text_value)
function and it said it changed 20 items but the spaces didn’t go away
- This is on a fresh import of the CGSpace database, but when I tried to apply it on CGSpace there were no changes detected. Weird.
- Anyways, Peter wants a new list of authors to clean up, so I exported another CSV:
@@ -200,10 +200,10 @@ Tue Feb 6 09:30:32 UTC 2018
295 197.210.168.174
752 144.76.64.79
-- I did notice in
/var/log/tomcat7/catalina.out
that Atmire's update thing was running though
+- I did notice in
/var/log/tomcat7/catalina.out
that Atmire’s update thing was running though
- So I restarted Tomcat and now everything is fine
- Next time I see that many database connections I need to save the output so I can analyze it later
-- I'm going to re-schedule the taskUpdateSolrStatsMetadata task as Bram detailed in ticket 566 to see if it makes CGSpace stop crashing every morning
+- I’m going to re-schedule the taskUpdateSolrStatsMetadata task as Bram detailed in ticket 566 to see if it makes CGSpace stop crashing every morning
- If I move the task from 3AM to 3PM, deally CGSpace will stop crashing in the morning, or start crashing ~12 hours later
- Eventually Atmire has said that there will be a fix for this high load caused by their script, but it will come with the 5.8 compatability they are already working on
- I re-deployed CGSpace with the new task time of 3PM, ran all system updates, and restarted the server
@@ -211,16 +211,16 @@ Tue Feb 6 09:30:32 UTC 2018
- I implemented some changes to the pooling in the Ansible infrastructure scripts so that each DSpace web application can use its own pool (web, api, and solr)
- Each pool uses its own name and hopefully this should help me figure out which one is using too many connections next time CGSpace goes down
- Also, this will mean that when a search bot comes along and hammers the XMLUI, the REST and OAI applications will be fine
-- I'm not actually sure if the Solr web application uses the database though, so I'll have to check later and remove it if necessary
+- I’m not actually sure if the Solr web application uses the database though, so I’ll have to check later and remove it if necessary
- I deployed the changes on DSpace Test only for now, so I will monitor and make them on CGSpace later this week
2018-02-07
- Abenet wrote to ask a question about the ORCiD lookup not working for one CIAT user on CGSpace
-- I tried on DSpace Test and indeed the lookup just doesn't work!
+- I tried on DSpace Test and indeed the lookup just doesn’t work!
- The ORCiD code in DSpace appears to be using
http://pub.orcid.org/
, but when I go there in the browser it redirects me to https://pub.orcid.org/v2.0/
- According to the announcement the v1 API was moved from
http://pub.orcid.org/
to https://pub.orcid.org/v1.2
until March 1st when it will be discontinued for good
-- But the old URL is hard coded in DSpace and it doesn't work anyways, because it currently redirects you to
https://pub.orcid.org/v2.0/v1.2
+- But the old URL is hard coded in DSpace and it doesn’t work anyways, because it currently redirects you to
https://pub.orcid.org/v2.0/v1.2
- So I guess we have to disable that shit once and for all and switch to a controlled vocabulary
- CGSpace crashed again, this time around
Wed Feb 7 11:20:28 UTC 2018
- I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own:
@@ -249,7 +249,7 @@ $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
1828
- CGSpace went down again a few hours later, and now the connections to the dspaceWeb pool are maxed at 250 (the new limit I imposed with the new separate pool scheme)
-- What's interesting is that the DSpace log says the connections are all busy:
+- What’s interesting is that the DSpace log says the connections are all busy:
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
@@ -263,14 +263,14 @@ $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle
187
- What the fuck, does DSpace think all connections are busy?
-- I suspect these are issues with abandoned connections or maybe a leak, so I'm going to try adding the
removeAbandoned='true'
parameter which is apparently off by default
-- I will try
testOnReturn='true'
too, just to add more validation, because I'm fucking grasping at straws
+- I suspect these are issues with abandoned connections or maybe a leak, so I’m going to try adding the
removeAbandoned='true'
parameter which is apparently off by default
+- I will try
testOnReturn='true'
too, just to add more validation, because I’m fucking grasping at straws
- Also, WTF, there was a heap space error randomly in catalina.out:
Wed Feb 07 15:01:54 UTC 2018 | Query:containerItem:91917 AND type:2
Exception in thread "http-bio-127.0.0.1-8081-exec-58" java.lang.OutOfMemoryError: Java heap space
-- I'm trying to find a way to determine what was using all those Tomcat sessions, but parsing the DSpace log is hard because some IPs are IPv6, which contain colons!
+- I’m trying to find a way to determine what was using all those Tomcat sessions, but parsing the DSpace log is hard because some IPs are IPv6, which contain colons!
- Looking at the first crash this morning around 11, I see these IPv4 addresses making requests around 10 and 11AM:
$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | sort -n | uniq -c | sort -n | tail -n 20
@@ -319,20 +319,20 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
992
-- Let's investigate who these IPs belong to:
+
- Let’s investigate who these IPs belong to:
- 104.196.152.243 is CIAT, which is already marked as a bot via nginx!
-- 207.46.13.71 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
-- 40.77.167.62 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
-- 207.46.13.135 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
-- 68.180.228.157 is Yahoo, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
-- 40.77.167.36 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
-- 207.46.13.54 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
-- 46.229.168.x is Semrush, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
+- 207.46.13.71 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
+- 40.77.167.62 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
+- 207.46.13.135 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
+- 68.180.228.157 is Yahoo, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
+- 40.77.167.36 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
+- 207.46.13.54 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
+- 46.229.168.x is Semrush, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
-- Nice, so these are all known bots that are already crammed into one session by Tomcat's Crawler Session Manager Valve.
-- What in the actual fuck, why is our load doing this? It's gotta be something fucked up with the database pool being “busy” but everything is fucking idle
+- Nice, so these are all known bots that are already crammed into one session by Tomcat’s Crawler Session Manager Valve.
+- What in the actual fuck, why is our load doing this? It’s gotta be something fucked up with the database pool being “busy” but everything is fucking idle
- One that I should probably add in nginx is 54.83.138.123, which is apparently the following user agent:
BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
@@ -343,7 +343,7 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
/var/log/nginx/access.log:1925
/var/log/nginx/access.log.1:2029
-- And they have 30 IPs, so fuck that shit I'm going to add them to the Tomcat Crawler Session Manager Valve nowwww
+- And they have 30 IPs, so fuck that shit I’m going to add them to the Tomcat Crawler Session Manager Valve nowwww
- Lots of discussions on the dspace-tech mailing list over the last few years about leaky transactions being a known problem with DSpace
- Helix84 recommends restarting PostgreSQL instead of Tomcat because it restarts quicker
- This is how the connections looked when it crashed this afternoon:
@@ -359,16 +359,16 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
5 dspaceWeb
- So is this just some fucked up XMLUI database leaking?
-- I notice there is an issue (that I've probably noticed before) on the Jira tracker about this that was fixed in DSpace 5.7: https://jira.duraspace.org/browse/DS-3551
-- I seriously doubt this leaking shit is fixed for sure, but I'm gonna cherry-pick all those commits and try them on DSpace Test and probably even CGSpace because I'm fed up with this shit
-- I cherry-picked all the commits for DS-3551 but it won't build on our current DSpace 5.5!
+- I notice there is an issue (that I’ve probably noticed before) on the Jira tracker about this that was fixed in DSpace 5.7: https://jira.duraspace.org/browse/DS-3551
+- I seriously doubt this leaking shit is fixed for sure, but I’m gonna cherry-pick all those commits and try them on DSpace Test and probably even CGSpace because I’m fed up with this shit
+- I cherry-picked all the commits for DS-3551 but it won’t build on our current DSpace 5.5!
- I sent a message to the dspace-tech mailing list asking why DSpace thinks these connections are busy when PostgreSQL says they are idle
2018-02-10
- I tried to disable ORCID lookups but keep the existing authorities
- This item has an ORCID for Ralf Kiese: http://localhost:8080/handle/10568/89897
-- Switch authority.controlled off and change authorLookup to lookup, and the ORCID badge doesn't show up on the item
+- Switch authority.controlled off and change authorLookup to lookup, and the ORCID badge doesn’t show up on the item
- Leave all settings but change choices.presentation to lookup and ORCID badge is there and item submission uses LC Name Authority and it breaks with this error:
Field dc_contributor_author has choice presentation of type "select", it may NOT be authority-controlled.
@@ -377,7 +377,7 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
xmlui.mirage2.forms.instancedCompositeFields.noSuggestionError
-- So I don't think we can disable the ORCID lookup function and keep the ORCID badges
+- So I don’t think we can disable the ORCID lookup function and keep the ORCID badges
2018-02-11
@@ -409,7 +409,7 @@ authors-2018-02-05.csv: line 100, char 18, byte 4179: After a first byte between
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
-- That reminds me that Bizu had asked me to fix some of Alan Duncan's names in December
+- That reminds me that Bizu had asked me to fix some of Alan Duncan’s names in December
- I see he actually has some variations with “Duncan, Alan J.": https://cgspace.cgiar.org/discover?filtertype_1=author&filter_relational_operator_1=contains&filter_1=Duncan%2C+Alan&submit_apply_filter=&query=
- I will just update those for her too and then restart the indexing:
@@ -440,7 +440,7 @@ dspace=# commit;
I wrote a Python script (resolve-orcids-from-solr.py
) using SolrClient to parse the Solr authority cache for ORCID IDs
We currently have 1562 authority records with ORCID IDs, and 624 unique IDs
We can use this to build a controlled vocabulary of ORCID IDs for new item submissions
-I don't know how to add ORCID IDs to existing items yet… some more querying of PostgreSQL for authority values perhaps?
+I don’t know how to add ORCID IDs to existing items yet… some more querying of PostgreSQL for authority values perhaps?
I added the script to the ILRI DSpace wiki on GitHub
2018-02-12
@@ -448,21 +448,21 @@ dspace=# commit;
Follow up with Atmire on the DSpace 5.8 Compatibility ticket to ask again if they want me to send them a DSpace 5.8 branch to work on
Abenet asked if there was a way to get the number of submissions she and Bizuwork did
I said that the Atmire Workflow Statistics module was supposed to be able to do that
-We had tried it in June, 2017 and found that it didn't work
-Atmire sent us some fixes but they didn't work either
+We had tried it in June, 2017 and found that it didn’t work
+Atmire sent us some fixes but they didn’t work either
I just tried the branch with the fixes again and it indeed does not work:

-- I see that in April, 2017 I just used a SQL query to get a user's submissions by checking the
dc.description.provenance
field
+- I see that in April, 2017 I just used a SQL query to get a user’s submissions by checking the
dc.description.provenance
field
- So for Abenet, I can check her submissions in December, 2017 with:
dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*yabowork.*2017-12.*';
- I emailed Peter to ask whether we can move DSpace Test to a new Linode server and attach 300 GB of disk space to it
-- This would be using Linode's new block storage volumes
+- This would be using Linode’s new block storage volumes
- I think our current $40/month Linode has enough CPU and memory capacity, but we need more disk space
-- I think I'd probably just attach the block storage volume and mount it on /home/dspace
+- I think I’d probably just attach the block storage volume and mount it on /home/dspace
- Ask Peter about
dc.rights
on DSpace Test again, if he likes it then we should move it to CGSpace soon
2018-02-13
@@ -492,16 +492,16 @@ dspace.log.2018-02-11:3
dspace.log.2018-02-12:0
dspace.log.2018-02-13:4
-- I apparently added that on 2018-02-07 so it could be, as I don't see any of those socket closed errors in 2018-01's logs!
+- I apparently added that on 2018-02-07 so it could be, as I don’t see any of those socket closed errors in 2018-01’s logs!
- I will increase the removeAbandonedTimeout from its default of 60 to 90 and enable logAbandoned
-- Peter hit this issue one more time, and this is apparently what Tomcat's catalina.out log says when an abandoned connection is removed:
+- Peter hit this issue one more time, and this is apparently what Tomcat’s catalina.out log says when an abandoned connection is removed:
Feb 13, 2018 2:05:42 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
WARNING: Connection has been abandoned PooledConnection[org.postgresql.jdbc.PgConnection@22e107be]:java.lang.Exception
2018-02-14
- Skype with Peter and the Addis team to discuss what we need to do for the ORCIDs in the immediate future
-- We said we'd start with a controlled vocabulary for
cg.creator.id
on the DSpace Test submission form, where we store the author name and the ORCID in some format like: Alan S. Orth (0000-0002-1735-7458)
+- We said we’d start with a controlled vocabulary for
cg.creator.id
on the DSpace Test submission form, where we store the author name and the ORCID in some format like: Alan S. Orth (0000-0002-1735-7458)
- Eventually we need to find a way to print the author names with links to their ORCID profiles
- Abenet will send an email to the partners to give us ORCID IDs for their authors and to stress that they update their name format on ORCID.org if they want it in a special way
- I sent the Codeobia guys a question to ask how they prefer that we store the IDs, ie one of:
@@ -539,14 +539,14 @@ $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_c
$ cat CGcenter_ORCID_ID_combined.csv ciat-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
1227
-- There are some formatting issues with names in Peter's list, so I should remember to re-generate the list of names from ORCID's API once we're done
+- There are some formatting issues with names in Peter’s list, so I should remember to re-generate the list of names from ORCID’s API once we’re done
- The
dspace cleanup -v
currently fails on CGSpace with the following:
- Deleting bitstream record from database (ID: 149473)
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
Detail: Key (bitstream_id)=(149473) is still referenced from table "bundle".
-- The solution is to update the bitstream table, as I've discovered several other times in 2016 and 2017:
+- The solution is to update the bitstream table, as I’ve discovered several other times in 2016 and 2017:
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (149473);'
UPDATE 1
@@ -561,7 +561,7 @@ UPDATE 1
- See the corresponding page on Altmetric: https://www.altmetric.com/details/handle/10568/78450
-And this item doesn't even exist on CGSpace!
+And this item doesn’t even exist on CGSpace!
Start working on XMLUI item display code for ORCIDs
Send emails to Macaroni Bros and Usman at CIFOR about ORCID metadata
CGSpace crashed while I was driving to Tel Aviv, and was down for four hours!
@@ -573,7 +573,7 @@ UPDATE 1
1 dspaceWeb
3 dspaceApi
# grep -c "Java heap space" /var/log/tomcat7/catalina.out
56
@@ -607,13 +607,13 @@ UPDATE 1
UPDATE 2
cg.creator.id
field like “Alan Orth: 0000-0002-1735-7458” because no name will have a “:” so it's easier to split oncg.creator.id
field like “Alan Orth: 0000-0002-1735-7458” because no name will have a “:” so it’s easier to split on-B1
I can see the line before the heap space error, which has the time, ie:-B1
I can see the line before the heap space error, which has the time, ie:2018-02-15 16:02:12,748 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
# zcat --force /var/log/nginx/*.log.{3,4}.gz | wc -l
168571
@@ -693,7 +693,7 @@ Traceback (most recent call last):
family_name = data['name']['family-name']['value']
TypeError: 'NoneType' object is not subscriptable
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > 2018-02-20-combined.txt
resolve-orcids.py
to use the “credit-name” if it exists in a profile, falling back to “given-names” + “family-name”resolve-orcids.py
:resolve-orcids.py
:$ cat orcid-test-values.txt
# valid identifier with 'given-names' and 'family-name'
@@ -753,13 +753,13 @@ TypeError: 'NoneType' object is not subscriptable
The Altmetric JavaScript builds the following API call: https://api.altmetric.com/v1/handle/10568/83320?callback=_altmetric.embed_callback&domain=cgspace.cgiar.org&key=3c130976ca2b8f2e88f8377633751ba1&cache_until=13-20
The response body is not JSON
To contrast, the following bare API call without query parameters is valid JSON: https://api.altmetric.com/v1/handle/10568/83320
-I told them that it's their JavaScript that is fucked up
+I told them that it’s their JavaScript that is fucked up
Remove CPWF project number and Humidtropics subject from submission form (#3)
I accidentally merged it into my own repository, oops
2018-02-22
-- CGSpace was apparently down today around 13:00 server time and I didn't get any emails on my phone, but saw them later on the computer
+- CGSpace was apparently down today around 13:00 server time and I didn’t get any emails on my phone, but saw them later on the computer
- It looks like Sisay restarted Tomcat because I was offline
- There was absolutely nothing interesting going on at 13:00 on the server, WTF?
@@ -789,7 +789,7 @@ TypeError: 'NoneType' object is not subscriptable
5208 5.9.6.51
8686 45.5.184.196
# grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
729
@@ -821,14 +821,14 @@ TypeError: 'NoneType' object is not subscriptable
$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ccafs | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
1004
-- I will add them to DSpace Test but Abenet says she's still waiting to set us ILRI's list
+- I will add them to DSpace Test but Abenet says she’s still waiting to set us ILRI’s list
- I will tell her that we should proceed on sharing our work on DSpace Test with the partners this week anyways and we can update the list later
- While regenerating the names for these ORCID identifiers I saw one that has a weird value for its names:
Looking up the names associated with ORCID iD: 0000-0002-2614-426X
Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
-- I don't know if the user accidentally entered this as their name or if that's how ORCID behaves when the name is private?
+- I don’t know if the user accidentally entered this as their name or if that’s how ORCID behaves when the name is private?
- I will remove that one from our list for now
- Remove Dryland Systems subject from submission form because that CRP closed two years ago (#355)
- Run all system updates on DSpace Test
@@ -842,7 +842,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
62464
(1 row)
orcid_id:*
id:d7ef744b-bbd4-4171-b449-00e37e1b776f
, then I could query PostgreSQL for all metadata records using that authority:removeAbandonedTimeout
from 90 to something like 180 and continue observingremoveAbandoned
for now because that's the only thing I changed in the last few weeks since he started having issuesremoveAbandoned
for now because that’s the only thing I changed in the last few weeks since he started having issuesremoveAbandoned
thing CGSpace went down and lo and behold, there were 264 connections, most of which were idle:removeAbandoned
settingremoveAbandoned
settingpg_stat_activity
for all queries running longer than 2 minutes:dspace=# \copy (SELECT now() - query_start as "runtime", application_name, usename, datname, waiting, state, query
@@ -926,8 +926,8 @@ COPY 263
2018-02-28
-- CGSpace crashed today, the first HTTP 499 in nginx's access.log was around 09:12
-- There's nothing interesting going on in nginx's logs around that time:
+- CGSpace crashed today, the first HTTP 499 in nginx’s access.log was around 09:12
+- There’s nothing interesting going on in nginx’s logs around that time:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Feb/2018:09:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
65 197.210.168.174
@@ -995,8 +995,8 @@ dspace.log.2018-02-28:1
According to the log 01D9932D6E85E90C2BA9FF5563A76D03 is an ILRI editor, doing lots of updating and editing of items
8100883DAD00666A655AE8EC571C95AE is some Indian IP address
1E9834E918A550C5CD480076BC1B73A4 looks to be a session shared by the bots
-So maybe it was due to the editor's uploading of files, perhaps something that was too big or?
-I think I'll increase the JVM heap size on CGSpace from 6144m to 8192m because I'm sick of this random crashing shit and the server has memory and I'd rather eliminate this so I can get back to solving PostgreSQL issues and doing other real work
+So maybe it was due to the editor’s uploading of files, perhaps something that was too big or?
+I think I’ll increase the JVM heap size on CGSpace from 6144m to 8192m because I’m sick of this random crashing shit and the server has memory and I’d rather eliminate this so I can get back to solving PostgreSQL issues and doing other real work
Run the few corrections from earlier this month for sponsor on CGSpace:
cgspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
diff --git a/docs/2018-03/index.html b/docs/2018-03/index.html
index 7882881d0..8ee63ebc8 100644
--- a/docs/2018-03/index.html
+++ b/docs/2018-03/index.html
@@ -21,7 +21,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
Export a CSV of the IITA community metadata for Martin Mueller
"/>
-
+
@@ -51,7 +51,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
-
+
@@ -98,7 +98,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
March, 2018
@@ -143,7 +143,7 @@ UPDATE 1
- Add CIAT author Mauricio Efren Sotelo Cabrera to controlled vocabulary for ORCID identifiers (#360)
- Help Sisay proof 200 IITA records on DSpace Test
-- Finally import Udana's 24 items to IWMI Journal Articles on CGSpace
+- Finally import Udana’s 24 items to IWMI Journal Articles on CGSpace
- Skype with James Stapleton to discuss CGSpace, ILRI website, CKM staff issues, etc
2018-03-08
@@ -189,14 +189,14 @@ dspacetest=# select distinct text_lang from metadatavalue where resource_type_id
es
(9 rows)
-- On second inspection it looks like
dc.description.provenance
fields use the text_lang “en” so that's probably why there are over 100,000 fields changed…
+- On second inspection it looks like
dc.description.provenance
fields use the text_lang “en” so that’s probably why there are over 100,000 fields changed…
- If I skip that, there are about 2,000, which seems more reasonably like the amount of fields users have edited manually, or fucked up during CSV import, etc:
dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
UPDATE 2309
- I will apply this on CGSpace right now
-- In other news, I was playing with adding ORCID identifiers to a dump of CIAT's community via CSV in OpenRefine
+- In other news, I was playing with adding ORCID identifiers to a dump of CIAT’s community via CSV in OpenRefine
- Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the
cg.creator.id
field
- For example, a GREL expression in a custom text facet to get all items with
dc.contributor.author[en_US]
of a certain author with several name variations (this is how you use a logical OR in OpenRefine):
@@ -206,7 +206,7 @@ UPDATE 2309
if(isBlank(value), "Hernan Ceballos: 0000-0002-8744-7918", value + "||Hernan Ceballos: 0000-0002-8744-7918")
-- One thing that bothers me is that this won't honor author order
+- One thing that bothers me is that this won’t honor author order
- It might be better to do batches of these in PostgreSQL with a script that takes the
place
column of an author into account when setting the cg.creator.id
- I wrote a Python script to read the author names and ORCID identifiers from CSV and create matching
cg.creator.id
fields: add-orcid-identifiers-csv.py
- The CSV should have two columns: author name and ORCID identifier:
@@ -215,13 +215,13 @@ UPDATE 2309
"Orth, Alan",Alan S. Orth: 0000-0002-1735-7458
"Orth, A.",Alan S. Orth: 0000-0002-1735-7458
-- I didn't integrate the ORCID API lookup for author names in this script for now because I was only interested in “tagging” old items for a few given authors
-- I added ORCID identifers for 187 items by CIAT's Hernan Ceballos, because that is what Elizabeth was trying to do manually!
+- I didn’t integrate the ORCID API lookup for author names in this script for now because I was only interested in “tagging” old items for a few given authors
+- I added ORCID identifers for 187 items by CIAT’s Hernan Ceballos, because that is what Elizabeth was trying to do manually!
- Also, I decided to add ORCID identifiers for all records from Peter, Abenet, and Sisay as well
2018-03-09
-- Give James Stapleton input on Sisay's KRAs
+- Give James Stapleton input on Sisay’s KRAs
- Create a pull request to disable ORCID authority integration for
dc.contributor.author
in the submission forms and XMLUI display (#363)
2018-03-11
@@ -240,12 +240,12 @@ g/jspui/listings-and-reports
org.apache.jasper.JasperException: java.lang.NullPointerException
dc.identifier.citation
):dc.identifier.citation
):dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value='';
$ grep -c 'ERROR org.dspace.storage.rdbms.DatabaseManager' dspace.log.2018-03-1*
dspace.log.2018-03-10:13
@@ -327,7 +327,7 @@ dspace.log.2018-03-17:13
dspace.log.2018-03-18:15
dspace.log.2018-03-19:90
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Mar/2018:0[89]:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.197
@@ -341,7 +341,7 @@ dspace.log.2018-03-19:90
207 104.196.152.243
294 54.198.169.202
catalina.out
:catalina.out
:Mon Mar 19 09:05:28 UTC 2018 | Query:id: 92032 AND type:2
Exception in thread "http-bio-127.0.0.1-8081-exec-280" java.lang.OutOfMemoryError: Java heap space
@@ -354,7 +354,7 @@ Exception in thread "http-bio-127.0.0.1-8081-exec-280" java.lang.OutOf
Magdalena from CCAFS wrote to ask about one record that has a bunch of metadata missing in her Listings and Reports export
It appears to be this one: https://cgspace.cgiar.org/handle/10568/83473?show=full
The title is “Untitled” and there is some metadata but indeed the citation is missing
-I don't know what would cause that
+I don’t know what would cause that
2018-03-20
@@ -367,7 +367,7 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
dspace=# update metadatavalue set text_value='Lance W. Robinson: 0000-0002-5224-8644' where resource_type_id=2 and metadata_field_id=240 and text_value like '%0000-0002-6344-195X%';
@@ -406,7 +406,7 @@ java.lang.IllegalArgumentException: No choices plugin was configured for field
- Looks like the indexing gets confused that there is still data in the
authority
column
- Unfortunately this causes those items to simply not be indexed, which users noticed because item counts were cut in half and old items showed up in RSS!
-- Since we've migrated the ORCID identifiers associated with the authority data to the
cg.creator.id
field we can nullify the authorities remaining in the database:
+- Since we’ve migrated the ORCID identifiers associated with the authority data to the
cg.creator.id
field we can nullify the authorities remaining in the database:
dspace=# UPDATE metadatavalue SET authority=NULL WHERE resource_type_id=2 AND metadata_field_id=3 AND authority IS NOT NULL;
UPDATE 195463
@@ -417,8 +417,8 @@ java.lang.IllegalArgumentException: No choices plugin was configured for field
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv header;
COPY 56156
-- Afterwards we'll want to do some batch tagging of ORCID identifiers to these names
-- CGSpace crashed again this afternoon, I'm not sure of the cause but there are a lot of SQL errors in the DSpace log:
+- Afterwards we’ll want to do some batch tagging of ORCID identifiers to these names
+- CGSpace crashed again this afternoon, I’m not sure of the cause but there are a lot of SQL errors in the DSpace log:
2018-03-21 15:11:08,166 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
java.sql.SQLException: Connection has already been closed.
@@ -444,11 +444,11 @@ java.lang.OutOfMemoryError: Java heap space
# grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
319
-- I guess we need to give it more RAM because it now has CGSpace's large Solr core
+- I guess we need to give it more RAM because it now has CGSpace’s large Solr core
- I will increase the memory from 3072m to 4096m
- Update Ansible playbooks to use PostgreSQL JBDC driver 42.2.2
- Deploy the new JDBC driver on DSpace Test
-- I'm also curious to see how long the
dspace index-discovery -b
takes on DSpace Test where the DSpace installation directory is on one of Linode's new block storage volumes
+- I’m also curious to see how long the
dspace index-discovery -b
takes on DSpace Test where the DSpace installation directory is on one of Linode’s new block storage volumes
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
@@ -456,9 +456,9 @@ real 208m19.155s
user 8m39.138s
sys 2m45.135s
-- So that's about three times as long as it took on CGSpace this morning
+- So that’s about three times as long as it took on CGSpace this morning
- I should also check the raw read speed with
hdparm -tT /dev/sdc
-- Looking at Peter's author corrections there are some mistakes due to Windows 1252 encoding
+- Looking at Peter’s author corrections there are some mistakes due to Windows 1252 encoding
- I need to find a way to filter these easily with OpenRefine
- For example, Peter has inadvertantly introduced Unicode character 0xfffd into several fields
- I can search for Unicode values by their hex code in OpenRefine using the following GREL expression:
@@ -475,16 +475,16 @@ sys 2m45.135s
2018-03-24
- More work on the Ubuntu 18.04 readiness stuff for the Ansible playbooks
-- The playbook now uses the system's Ruby and Node.js so I don't have to manually install RVM and NVM after
+- The playbook now uses the system’s Ruby and Node.js so I don’t have to manually install RVM and NVM after
2018-03-25
-- Looking at Peter's author corrections and trying to work out a way to find errors in OpenRefine easily
+- Looking at Peter’s author corrections and trying to work out a way to find errors in OpenRefine easily
- I can find all names that have acceptable characters using a GREL expression like:
isNotNull(value.match(/.*[a-zA-ZáÁéèïíñØøöóúü].*/))
-- But it's probably better to just say which characters I know for sure are not valid (like parentheses, pipe, or weird Unicode characters):
+- But it’s probably better to just say which characters I know for sure are not valid (like parentheses, pipe, or weird Unicode characters):
or(
isNotNull(value.match(/.*[(|)].*/)),
@@ -493,7 +493,7 @@ sys 2m45.135s
isNotNull(value.match(/.*\u200A.*/))
)
-- And here's one combined GREL expression to check for items marked as to delete or check so I can flag them and export them to a separate CSV (though perhaps it's time to add delete support to my
fix-metadata-values.py
script:
+- And here’s one combined GREL expression to check for items marked as to delete or check so I can flag them and export them to a separate CSV (though perhaps it’s time to add delete support to my
fix-metadata-values.py
script:
or(
isNotNull(value.match(/.*delete.*/i)),
@@ -523,21 +523,21 @@ $ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.cont
2018-03-26
-- Atmire got back to me about the Listings and Reports issue and said it's caused by items that have missing
dc.identifier.citation
fields
+- Atmire got back to me about the Listings and Reports issue and said it’s caused by items that have missing
dc.identifier.citation
fields
- The will send a fix
2018-03-27
-- Atmire got back with an updated quote about the DSpace 5.8 compatibility so I've forwarded it to Peter
+- Atmire got back with an updated quote about the DSpace 5.8 compatibility so I’ve forwarded it to Peter
2018-03-28
-- DSpace Test crashed due to heap space so I've increased it from 4096m to 5120m
-- The error in Tomcat's
catalina.out
was:
+- DSpace Test crashed due to heap space so I’ve increased it from 4096m to 5120m
+- The error in Tomcat’s
catalina.out
was:
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space
-- Add ISI Journal (cg.isijournal) as an option in Atmire's Listing and Reports layout (#370) for Abenet
+- Add ISI Journal (cg.isijournal) as an option in Atmire’s Listing and Reports layout (#370) for Abenet
- I noticed a few hundred CRPs using the old capitalized formatting so I corrected them:
$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db cgspace -u cgspace -p 'fuuu'
@@ -552,7 +552,7 @@ Fixed 28 occurences of: GRAIN LEGUMES
Fixed 3 occurences of: FORESTS, TREES AND AGROFORESTRY
Fixed 5 occurences of: GENEBANKS
-- That's weird because we just updated them last week…
+- That’s weird because we just updated them last week…
- Create a pull request to enable searching by ORCID identifier (
cg.creator.id
) in Discovery and Listings and Reports (#371)
- I will test it on DSpace Test first!
- Fix one missing XMLUI string for “Access Status” (cg.identifier.status)
diff --git a/docs/2018-04/index.html b/docs/2018-04/index.html
index 406261bc8..cc6f5f948 100644
--- a/docs/2018-04/index.html
+++ b/docs/2018-04/index.html
@@ -8,7 +8,7 @@
@@ -20,10 +20,10 @@ Catalina logs at least show some memory errors yesterday:
-
+
@@ -53,7 +53,7 @@ Catalina logs at least show some memory errors yesterday:
-
+
@@ -100,14 +100,14 @@ Catalina logs at least show some memory errors yesterday:
April, 2018
2018-04-01
-- I tried to test something on DSpace Test but noticed that it's down since god knows when
+- I tried to test something on DSpace Test but noticed that it’s down since god knows when
- Catalina logs at least show some memory errors yesterday:
Mar 31, 2018 10:26:42 PM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
@@ -124,7 +124,7 @@ Exception in thread "ContainerBackgroundProcessor[StandardEngine[Catalina]]
2018-04-04
-- Peter noticed that there were still some old CRP names on CGSpace, because I hadn't forced the Discovery index to be updated after I fixed the others last week
+- Peter noticed that there were still some old CRP names on CGSpace, because I hadn’t forced the Discovery index to be updated after I fixed the others last week
- For completeness I re-ran the CRP corrections on CGSpace:
$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
@@ -139,7 +139,7 @@ real 76m13.841s
user 8m22.960s
sys 2m2.498s
-- Elizabeth from CIAT emailed to ask if I could help her by adding ORCID identifiers to all of Joseph Tohme's items
+- Elizabeth from CIAT emailed to ask if I could help her by adding ORCID identifiers to all of Joseph Tohme’s items
- I used my add-orcid-identifiers-csv.py script:
$ ./add-orcid-identifiers-csv.py -i /tmp/jtohme-2018-04-04.csv -db dspace -u dspace -p 'fuuu'
@@ -165,13 +165,13 @@ $ git rebase -i dspace-5.8
DS-3583 Usage of correct Collection Array (#1731) (upstream commit on dspace-5_x: c8f62e6f496fa86846bfa6bcf2d16811087d9761)
-… but somehow git knew, and didn't include them in my interactive rebase!
+… but somehow git knew, and didn’t include them in my interactive rebase!
I need to send this branch to Atmire and also arrange payment (see ticket #560 in their tracker)
-Fix Sisay's SSH access to the new DSpace Test server (linode19)
+Fix Sisay’s SSH access to the new DSpace Test server (linode19)
2018-04-05
-- Fix Sisay's sudo access on the new DSpace Test server (linode19)
+- Fix Sisay’s sudo access on the new DSpace Test server (linode19)
- The reindexing process on DSpace Test took forever yesterday:
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
@@ -220,15 +220,15 @@ sys 2m52.585s
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-04-10
4363
-- 70.32.83.92 appears to be some harvester we've seen before, but on a new IP
+- 70.32.83.92 appears to be some harvester we’ve seen before, but on a new IP
- They are not creating new Tomcat sessions so there is no problem there
- 178.154.200.38 also appears to be Yandex, and is also creating many Tomcat sessions:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38' dspace.log.2018-04-10
3982
-- I'm not sure why Yandex creates so many Tomcat sessions, as its user agent should match the Crawler Session Manager valve
-- Let's try a manual request with and without their user agent:
+- I’m not sure why Yandex creates so many Tomcat sessions, as its user agent should match the Crawler Session Manager valve
+- Let’s try a manual request with and without their user agent:
$ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg 'User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
@@ -312,7 +312,7 @@ UPDATE 1
2115
- Apparently from these stacktraces we should be able to see which code is not closing connections properly
-- Here's a pretty good overview of days where we had database issues recently:
+- Here’s a pretty good overview of days where we had database issues recently:
# zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' | awk '{print $1,$2, $3}' | sort | uniq -c | sort -n
1 Feb 18, 2018
@@ -337,9 +337,9 @@ UPDATE 1
In Tomcat 8.5 the removeAbandoned
property has been split into two: removeAbandonedOnBorrow
and removeAbandonedOnMaintenance
See: https://tomcat.apache.org/tomcat-8.5-doc/jndi-datasource-examples-howto.html#Database_Connection_Pool_(DBCP_2)_Configurations
I assume we want removeAbandonedOnBorrow
and make updates to the Tomcat 8 templates in Ansible
-After reading more documentation I see that Tomcat 8.5's default DBCP seems to now be Commons DBCP2 instead of Tomcat DBCP
-It can be overridden in Tomcat's server.xml by setting factory="org.apache.tomcat.jdbc.pool.DataSourceFactory"
in the <Resource>
-I think we should use this default, so we'll need to remove some other settings that are specific to Tomcat's DBCP like jdbcInterceptors
and abandonWhenPercentageFull
+After reading more documentation I see that Tomcat 8.5’s default DBCP seems to now be Commons DBCP2 instead of Tomcat DBCP
+It can be overridden in Tomcat’s server.xml by setting factory="org.apache.tomcat.jdbc.pool.DataSourceFactory"
in the <Resource>
+I think we should use this default, so we’ll need to remove some other settings that are specific to Tomcat’s DBCP like jdbcInterceptors
and abandonWhenPercentageFull
Merge the changes adding ORCID identifier to advanced search and Atmire Listings and Reports (#371)
Fix one more issue of missing XMLUI strings (for CRP subject when clicking “view more” in the Discovery sidebar)
I told Udana to fix the citation and abstract of the one item, and to correct the dc.language.iso
for the five Spanish items in his Book Chapters collection
@@ -377,7 +377,7 @@ java.lang.NullPointerException
I see the same error on DSpace Test so this is definitely a problem
After disabling the authority consumer I no longer see the error
I merged a pull request to the 5_x-prod
branch to clean that up (#372)
-File a ticket on DSpace's Jira for the target="_blank"
security and performance issue (DS-3891)
+File a ticket on DSpace’s Jira for the target="_blank"
security and performance issue (DS-3891)
I re-deployed DSpace Test (linode19) and was surprised by how long it took the ant update to complete:
BUILD SUCCESSFUL
@@ -394,7 +394,7 @@ Total time: 4 minutes 12 seconds
- IWMI people are asking about building a search query that outputs RSS for their reports
- They want the same results as this Discovery query: https://cgspace.cgiar.org/discover?filtertype_1=dateAccessioned&filter_relational_operator_1=contains&filter_1=2018&submit_apply_filter=&query=&scope=10568%2F16814&rpp=100&sort_by=dc.date.issued_dt&order=desc
-- They will need to use OpenSearch, but I can't remember all the parameters
+- They will need to use OpenSearch, but I can’t remember all the parameters
- Apparently search sort options for OpenSearch are in
dspace.cfg
:
webui.itemlist.sort-option.1 = title:dc.title:title
@@ -410,15 +410,15 @@ webui.itemlist.sort-option.4 = type:dc.type:text
For example, set rpp=1
and then check the results for start
values of 0, 1, and 2 and they are all the same!
If I have time I will check if this behavior persists on DSpace 6.x on the official DSpace demo and file a bug
Also, the DSpace Manual as of 5.x has very poor documentation for OpenSearch
-They don't tell you to use Discovery search filters in the query
(with format query=dateIssued:2018
)
-They don't tell you that the sort options are actually defined in dspace.cfg
(ie, you need to use 2
instead of dc.date.issued_dt
)
+They don’t tell you to use Discovery search filters in the query
(with format query=dateIssued:2018
)
+They don’t tell you that the sort options are actually defined in dspace.cfg
(ie, you need to use 2
instead of dc.date.issued_dt
)
They are missing the order
parameter (ASC vs DESC)
I notice that DSpace Test has crashed again, due to memory:
# grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
178
-- I will increase the JVM heap size from 5120M to 6144M, though we don't have much room left to grow as DSpace Test (linode19) is using a smaller instance size than CGSpace
+- I will increase the JVM heap size from 5120M to 6144M, though we don’t have much room left to grow as DSpace Test (linode19) is using a smaller instance size than CGSpace
- Gabriela from CIP asked if I could send her a list of all CIP authors so she can do some replacements on the name formats
- I got a list of all the CIP collections manually and use the same query that I used in August, 2017:
@@ -445,8 +445,8 @@ sys 2m2.687s
2018-04-20
-- Gabriela from CIP emailed to say that CGSpace was returning a white page, but I haven't seen any emails from UptimeRobot
-- I confirm that it's just giving a white page around 4:16
+- Gabriela from CIP emailed to say that CGSpace was returning a white page, but I haven’t seen any emails from UptimeRobot
+- I confirm that it’s just giving a white page around 4:16
- The DSpace logs show that there are no database connections:
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-715] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:18; idle:0; lastwait:5000].
@@ -456,7 +456,7 @@ sys 2m2.687s
# grep -c 'org.apache.tomcat.jdbc.pool.PoolExhaustedException' /home/cgspace.cgiar.org/log/dspace.log.2018-04-20
32147
-- I can't even log into PostgreSQL as the
postgres
user, WTF?
+- I can’t even log into PostgreSQL as the
postgres
user, WTF?
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
^C
@@ -475,7 +475,7 @@ sys 2m2.687s
4325 70.32.83.92
10718 45.5.184.2
-- It doesn't even seem like there is a lot of traffic compared to the previous days:
+- It doesn’t even seem like there is a lot of traffic compared to the previous days:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Apr/2018" | wc -l
74931
@@ -485,9 +485,9 @@ sys 2m2.687s
93459
- I tried to restart Tomcat but
systemctl
hangs
-- I tried to reboot the server from the command line but after a few minutes it didn't come back up
+- I tried to reboot the server from the command line but after a few minutes it didn’t come back up
- Looking at the Linode console I see that it is stuck trying to shut down
-- Even “Reboot” via Linode console doesn't work!
+- Even “Reboot” via Linode console doesn’t work!
- After shutting it down a few times via the Linode console it finally rebooted
- Everything is back but I have no idea what caused this—I suspect something with the hosting provider
- Also super weird, the last entry in the DSpace log file is from
2018-04-20 16:35:09
, and then immediately it goes to 2018-04-20 19:15:04
(three hours later!):
@@ -518,13 +518,13 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [localhost-startStop-2] Time
2018-04-24
-- Testing my Ansible playbooks with a clean and updated installation of Ubuntu 18.04 and I fixed some issues that I hadn't run into a few weeks ago
+- Testing my Ansible playbooks with a clean and updated installation of Ubuntu 18.04 and I fixed some issues that I hadn’t run into a few weeks ago
- There seems to be a new issue with Java dependencies, though
- The
default-jre
package is going to be Java 10 on Ubuntu 18.04, but I want to use openjdk-8-jre-headless
(well, the JDK actually, but it uses this JRE)
- Tomcat and Ant are fine with Java 8, but the
maven
package wants to pull in Java 10 for some reason
- Looking closer, I see that
maven
depends on java7-runtime-headless
, which is indeed provided by openjdk-8-jre-headless
-- So it must be one of Maven's dependencies…
-- I will watch it for a few days because it could be an issue that will be resolved before Ubuntu 18.04's release
+- So it must be one of Maven’s dependencies…
+- I will watch it for a few days because it could be an issue that will be resolved before Ubuntu 18.04’s release
- Otherwise I will post a bug to the ubuntu-release mailing list
- Looks like the only way to fix this is to install
openjdk-8-jdk-headless
before (so it pulls in the JRE) in a separate transaction, or to manually install openjdk-8-jre-headless
in the same apt transaction as maven
- Also, I started porting PostgreSQL 9.6 into the Ansible infrastructure scripts
@@ -534,12 +534,12 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [localhost-startStop-2] Time
- Still testing the Ansible infrastructure playbooks for Ubuntu 18.04, Tomcat 8.5, and PostgreSQL 9.6
- One other new thing I notice is that PostgreSQL 9.6 no longer uses
createuser
and nocreateuser
, as those have actually meant superuser
and nosuperuser
and have been deprecated for ten years
-- So for my notes, when I'm importing a CGSpace database dump I need to amend my notes to give super user permission to a user, rather than create user:
+- So for my notes, when I’m importing a CGSpace database dump I need to amend my notes to give super user permission to a user, rather than create user:
$ psql dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -O -U dspacetest -d dspacetest -W -h localhost /tmp/dspace_2018-04-18.backup
-- There's another issue with Tomcat in Ubuntu 18.04:
+- There’s another issue with Tomcat in Ubuntu 18.04:
25-Apr-2018 13:26:21.493 SEVERE [http-nio-127.0.0.1-8443-exec-1] org.apache.coyote.AbstractProtocol$ConnectionHandler.process Error reading request, ignored
java.lang.NoSuchMethodError: java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer;
@@ -554,13 +554,13 @@ $ pg_restore -O -U dspacetest -d dspacetest -W -h localhost /tmp/dspace_2018-04-
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
-- There's a Debian bug about this from a few weeks ago
-- Apparently Tomcat was compiled with Java 9, so doesn't work with Java 8
+- There’s a Debian bug about this from a few weeks ago
+- Apparently Tomcat was compiled with Java 9, so doesn’t work with Java 8
2018-04-29
- DSpace Test crashed again, looks like memory issues again
-- JVM heap size was last increased to 6144m but the system only has 8GB total so there's not much we can do here other than get a bigger Linode instance or remove the massive Solr Statistics data
+- JVM heap size was last increased to 6144m but the system only has 8GB total so there’s not much we can do here other than get a bigger Linode instance or remove the massive Solr Statistics data
2018-04-30
diff --git a/docs/2018-05/index.html b/docs/2018-05/index.html
index 8436ba410..f6f2e7a0f 100644
--- a/docs/2018-05/index.html
+++ b/docs/2018-05/index.html
@@ -35,7 +35,7 @@ http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E
Then I reduced the JVM heap size from 6144 back to 5120m
Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
"/>
-
+
@@ -65,7 +65,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
-
+
@@ -112,7 +112,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
May, 2018
@@ -135,7 +135,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
- Looking over some IITA records for Sisay
- Other than trimming and collapsing consecutive whitespace, I made some other corrections
-- I need to check the correct formatting of COTE D'IVOIRE vs COTE D’IVOIRE
+- I need to check the correct formatting of COTE D’IVOIRE vs COTE D’IVOIRE
- I replaced all DOIs with HTTPS
- I checked a few DOIs and found at least one that was missing, so I Googled the title of the paper and found the correct DOI
- Also, I found an FAQ for DOI that says the
dx.doi.org
syntax is older, so I will replace all the DOIs with doi.org
instead
@@ -180,7 +180,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
$ for line in $(< /tmp/links.txt); do echo $line; http --print h $line; done
-- Most of the links are good, though one is duplicate and one seems to even be incorrect in the publisher's site so…
+- Most of the links are good, though one is duplicate and one seems to even be incorrect in the publisher’s site so…
- Also, there are some duplicates:
10568/92241
and 10568/92230
(same DOI)
@@ -216,8 +216,8 @@ $ ./resolve-orcids.py -i /tmp/2018-05-06-combined.txt -o /tmp/2018-05-06-combine
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
-- I made a pull request (#373) for this that I'll merge some time next week (I'm expecting Atmire to get back to us about DSpace 5.8 soon)
-- After testing quickly I just decided to merge it, and I noticed that I don't even need to restart Tomcat for the changes to get loaded
+- I made a pull request (#373) for this that I’ll merge some time next week (I’m expecting Atmire to get back to us about DSpace 5.8 soon)
+- After testing quickly I just decided to merge it, and I noticed that I don’t even need to restart Tomcat for the changes to get loaded
2018-05-07
@@ -225,7 +225,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
- The documentation regarding the Solr stuff is limited, and I cannot figure out what all the fields in
conciliator.properties
are supposed to be
- But then I found reconcile-csv, which allows you to reconcile against values in a CSV file!
- That, combined with splitting our multi-value fields on “||” in OpenRefine is amaaaaazing, because after reconciliation you can just join them again
-- Oh wow, you can also facet on the individual values once you've split them! That's going to be amazing for proofing CRPs, subjects, etc.
+- Oh wow, you can also facet on the individual values once you’ve split them! That’s going to be amazing for proofing CRPs, subjects, etc.
2018-05-09
@@ -276,7 +276,7 @@ Livestock and Fish
- It turns out there was a space in my “country” header that was causing reconcile-csv to crash
- After removing that it works fine!
-- Looking at Sisay's 2,640 CIFOR records on DSpace Test (10568/92904)
+
- Looking at Sisay’s 2,640 CIFOR records on DSpace Test (10568/92904)
- Trimmed all leading / trailing white space and condensed multiple spaces into one
- Corrected DOIs to use HTTPS and “doi.org” instead of “dx.doi.org”
@@ -318,9 +318,9 @@ return "blank"
- You could use this in a facet or in a new column
- More information and good examples here: https://programminghistorian.org/lessons/fetch-and-parse-data-with-openrefine
- Finish looking at the 2,640 CIFOR records on DSpace Test (10568/92904), cleaning up authors and adding collection mappings
-- They can now be moved to CGSpace as far as I'm concerned, but I don't know if Sisay will do it or me
-- I was checking the CIFOR data for duplicates using Atmire's Metadata Quality Module (and found some duplicates actually), but then DSpace died…
-- I didn't see anything in the Tomcat, DSpace, or Solr logs, but I saw this in
dmest -T
:
+- They can now be moved to CGSpace as far as I’m concerned, but I don’t know if Sisay will do it or me
+- I was checking the CIFOR data for duplicates using Atmire’s Metadata Quality Module (and found some duplicates actually), but then DSpace died…
+- I didn’t see anything in the Tomcat, DSpace, or Solr logs, but I saw this in
dmest -T
:
[Tue May 15 12:10:01 2018] Out of memory: Kill process 3763 (java) score 706 or sacrifice child
[Tue May 15 12:10:01 2018] Killed process 3763 (java) total-vm:14667688kB, anon-rss:5705268kB, file-rss:0kB, shmem-rss:0kB
@@ -335,7 +335,7 @@ return "blank"
2018-05-15 12:35:30,858 INFO org.dspace.submit.step.CompleteStep @ m.garruccio@cgiar.org:session_id=8AC4499945F38B45EF7A1226E3042DAE:submission_complete:Completed submission with id=96060
-- So I'm not sure…
+- So I’m not sure…
- I finally figured out how to get OpenRefine to reconcile values from Solr via conciliator:
- The trick was to use a more appropriate Solr fieldType
text_en
instead of text_general
so that more terms match, for example uppercase and lower case:
@@ -344,11 +344,11 @@ $ ./bin/solr create_core -c countries
$ curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"country", "type":"text_en", "multiValued":false, "stored":true}}' http://localhost:8983/solr/countries/schema
$ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
-- It still doesn't catch simple mistakes like “ALBANI” or “AL BANIA” for “ALBANIA”, and it doesn't return scores, so I have to select matches manually:
+- It still doesn’t catch simple mistakes like “ALBANI” or “AL BANIA” for “ALBANIA”, and it doesn’t return scores, so I have to select matches manually:

-- I should probably make a general copy field and set it to be the default search field, like DSpace's search core does (see schema.xml):
+- I should probably make a general copy field and set it to be the default search field, like DSpace’s search core does (see schema.xml):
<defaultSearchField>search_text</defaultSearchField>
...
@@ -356,7 +356,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
- Actually, I wonder how much of their schema I could just copy…
- Apparently the default search field is the
df
parameter and you could technically just add it to the query string, so no need to bother with that in the schema now
-- I copied over the DSpace
search_text
field type from the DSpace Solr config (had to remove some properties so Solr would start) but it doesn't seem to be any better at matching than the text_en
type
+- I copied over the DSpace
search_text
field type from the DSpace Solr config (had to remove some properties so Solr would start) but it doesn’t seem to be any better at matching than the text_en
type
- I think I need to focus on trying to return scores with conciliator
2018-05-16
@@ -364,9 +364,9 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
Discuss GDPR with James Stapleton
- As far as I see it, we are “Data Controllers” on CGSpace because we store peoples’ names, emails, and phone numbers if they register
-- We set cookies on the user's computer, but these do not contain personally identifiable information (PII) and they are “session” cookies which are deleted when the user closes their browser
+- We set cookies on the user’s computer, but these do not contain personally identifiable information (PII) and they are “session” cookies which are deleted when the user closes their browser
- We use Google Analytics to track website usage, which makes Google the “Data Processor” and in this case we merely need to limit or obfuscate the information we send to them
-- As the only personally identifiable information we send is the user's IP address, I think we only need to enable IP Address Anonymization in our
analytics.js
code snippets
+- As the only personally identifiable information we send is the user’s IP address, I think we only need to enable IP Address Anonymization in our
analytics.js
code snippets
- Then we can add a “Privacy” page to CGSpace that makes all of this clear
@@ -380,22 +380,22 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
- I tested loading a certain page before and after adding this and afterwards I saw that the parameter
aip=1
was being sent with the analytics response to Google
- According to the analytics.js protocol parameter documentation this means that IPs are being anonymized
-- After finding and fixing some duplicates in IITA's
IITA_April_27
test collection on DSpace Test (10568/92703) I told Sisay that he can move them to IITA's Journal Articles collection on CGSpace
+- After finding and fixing some duplicates in IITA’s
IITA_April_27
test collection on DSpace Test (10568/92703) I told Sisay that he can move them to IITA’s Journal Articles collection on CGSpace
2018-05-17
-- Testing reconciliation of countries against Solr via conciliator, I notice that
CÔTE D'IVOIRE
doesn't match COTE D'IVOIRE
, whereas with reconcile-csv it does
-- Also, when reconciling regions against Solr via conciliator
EASTERN AFRICA
doesn't match EAST AFRICA
, whereas with reconcile-csv it does
+- Testing reconciliation of countries against Solr via conciliator, I notice that
CÔTE D'IVOIRE
doesn’t match COTE D'IVOIRE
, whereas with reconcile-csv it does
+- Also, when reconciling regions against Solr via conciliator
EASTERN AFRICA
doesn’t match EAST AFRICA
, whereas with reconcile-csv it does
- And
SOUTH AMERICA
matches both SOUTH ASIA
and SOUTH AMERICA
with the same match score of 2… WTF.
- It could be that I just need to tune the query filter in Solr (currently using the example
text_en
field type)
- Oh sweet, it turns out that the issue with searching for characters with accents is called “code folding” in Solr
- You can use either a
solr.ASCIIFoldingFilterFactory
filter or a solr.MappingCharFilterFactory
charFilter mapping against mapping-FoldToASCII.txt
- Also see: https://opensourceconnections.com/blog/2017/02/20/solr-utf8/
- Now
CÔTE D'IVOIRE
matches COTE D'IVOIRE
!
-- I'm not sure which method is better, perhaps the
solr.ASCIIFoldingFilterFactory
filter because it doesn't require copying the mapping-FoldToASCII.txt
file
-- And actually I'm not entirely sure about the order of filtering before tokenizing, etc…
+- I’m not sure which method is better, perhaps the
solr.ASCIIFoldingFilterFactory
filter because it doesn’t require copying the mapping-FoldToASCII.txt
file
+- And actually I’m not entirely sure about the order of filtering before tokenizing, etc…
- Ah, I see that
charFilter
must be before the tokenizer because it works on a stream, whereas filter
operates on tokenized input so it must come after the tokenizer
-- Regarding the use of the
charFilter
vs the filter
class before and after the tokenizer, respectively, I think it's better to use the charFilter
to normalize the input stream before tokenizing it as I have no idea what kinda stuff might get removed by the tokenizer
+- Regarding the use of the
charFilter
vs the filter
class before and after the tokenizer, respectively, I think it’s better to use the charFilter
to normalize the input stream before tokenizing it as I have no idea what kinda stuff might get removed by the tokenizer
- Skype with Geoffrey from IITA in Nairobi who wants to deposit records to CGSpace via the REST API but I told him that this skips the submission workflows and because we cannot guarantee the data quality we would not allow anyone to use it this way
- I finished making the XMLUI changes for anonymization of IP addresses in Google Analytics and merged the changes to the
5_x-prod
branch (#375
- Also, I think we might be able to implement opt-out functionality for Google Analytics using a window property that could be managed by storing its status in a cookie
@@ -430,7 +430,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
2018-05-23
-- I'm investigating how many non-CGIAR users we have registered on CGSpace:
+- I’m investigating how many non-CGIAR users we have registered on CGSpace:
dspace=# select email, netid from eperson where email not like '%cgiar.org%' and email like '%@%';
@@ -443,13 +443,13 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
2018-05-28
- Daniel Haile-Michael sent a message that CGSpace was down (I am currently in Oregon so the time difference is ~10 hours)
-- I looked in the logs but didn't see anything that would be the cause of the crash
+- I looked in the logs but didn’t see anything that would be the cause of the crash
- Atmire finalized the DSpace 5.8 testing and sent a pull request: https://github.com/ilri/DSpace/pull/378
- They have asked if I can test this and get back to them by June 11th
2018-05-30
-- Talk to Samantha from Bioversity about something related to Google Analytics, I'm still not sure what they want
+- Talk to Samantha from Bioversity about something related to Google Analytics, I’m still not sure what they want
- DSpace Test crashed last night, seems to be related to system memory (not JVM heap)
- I see this in
dmesg
:
@@ -458,7 +458,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
[Wed May 30 00:00:40 2018] oom_reaper: reaped process 6082 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
- I need to check the Tomcat JVM heap size/usage, command line JVM heap size (for cron jobs), and PostgreSQL memory usage
-- It might be possible to adjust some things, but eventually we'll need a larger VPS instance
+- It might be possible to adjust some things, but eventually we’ll need a larger VPS instance
- For some reason there are no JVM stats in Munin, ugh
- Run all system updates on DSpace Test and reboot it
- I generated a list of CIFOR duplicates from the
CIFOR_May_9
collection using the Atmire MQM module and then dumped the HTML source so I could process it for sending to Vika
@@ -467,13 +467,13 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
$ grep -E 'aspect.duplicatechecker.DuplicateResults.field.del_handle_[0-9]{1,3}_Item' ~/Desktop/https\ _dspacetest.cgiar.org_atmire_metadata-quality_duplicate-checker.html > ~/cifor-duplicates.txt
$ sed 's/.*Item1.*/\n&/g' ~/cifor-duplicates.txt > ~/cifor-duplicates-cleaned.txt
-- I told Vika to look through the list manually and indicate which ones are indeed duplicates that we should delete, and which ones to map to CIFOR's collection
+- I told Vika to look through the list manually and indicate which ones are indeed duplicates that we should delete, and which ones to map to CIFOR’s collection
- A few weeks ago Peter wanted a list of authors from the ILRI collections, so I need to find a way to get the handles of all those collections
-- I can use the
/communities/{id}/collections
endpoint of the REST API but it only takes IDs (not handles) and doesn't seem to descend into sub communities
+- I can use the
/communities/{id}/collections
endpoint of the REST API but it only takes IDs (not handles) and doesn’t seem to descend into sub communities
- Shit, so I need the IDs for the the top-level ILRI community and all its sub communities (and their sub communities)
- There has got to be a better way to do this than going to each community and getting their handles and IDs manually
- Oh shit, I literally already wrote a script to get all collections in a community hierarchy from the REST API: rest-find-collections.py
-- The output isn't great, but all the handles and IDs are printed in debug mode:
+- The output isn’t great, but all the handles and IDs are printed in debug mode:
$ ./rest-find-collections.py -u https://cgspace.cgiar.org/rest -d 10568/1 2> /tmp/ilri-collections.txt
@@ -482,8 +482,8 @@ $ sed 's/.*Item1.*/\n&/g' ~/cifor-duplicates.txt > ~/cifor-duplicates-cle
dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/67236','10568/67274',...))) group by text_value order by count desc) to /tmp/ilri-authors.csv with csv;
2018-05-31
-- Clarify CGSpace's usage of Google Analytics and personally identifiable information during user registration for Bioversity team who had been asking about GDPR compliance
-- Testing running PostgreSQL in a Docker container on localhost because when I'm on Arch Linux there isn't an easily installable package for particular PostgreSQL versions
+- Clarify CGSpace’s usage of Google Analytics and personally identifiable information during user registration for Bioversity team who had been asking about GDPR compliance
+- Testing running PostgreSQL in a Docker container on localhost because when I’m on Arch Linux there isn’t an easily installable package for particular PostgreSQL versions
- Now I can just use Docker:
$ docker pull postgres:9.5-alpine
diff --git a/docs/2018-06/index.html b/docs/2018-06/index.html
index e094313c1..5a894697f 100644
--- a/docs/2018-06/index.html
+++ b/docs/2018-06/index.html
@@ -10,7 +10,7 @@
Test the DSpace 5.8 module upgrades from Atmire (#378)
-There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn't build
+There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
@@ -38,7 +38,7 @@ sys 2m7.289s
Test the DSpace 5.8 module upgrades from Atmire (#378)
-There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn't build
+There seems to be a problem with the CUA and L&R versions in pom.xml because they are using SNAPSHOT and it doesn’t build
I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
@@ -55,7 +55,7 @@ real 74m42.646s
user 8m5.056s
sys 2m7.289s
"/>
-
+
@@ -85,7 +85,7 @@ sys 2m7.289s
-
+
@@ -132,7 +132,7 @@ sys 2m7.289s
June, 2018
@@ -141,7 +141,7 @@ sys 2m7.289s
- Test the DSpace 5.8 module upgrades from Atmire (#378)
-- There seems to be a problem with the CUA and L&R versions in
pom.xml
because they are using SNAPSHOT and it doesn't build
+- There seems to be a problem with the CUA and L&R versions in
pom.xml
because they are using SNAPSHOT and it doesn’t build
- I added the new CCAFS Phase II Project Tag
PII-FP1_PACCA2
and merged it into the 5_x-prod
branch (#379)
@@ -160,8 +160,8 @@ sys 2m7.289s
2018-06-06
- It turns out that I needed to add a server block for
atmire.com-snapshots
to my Maven settings, so now the Atmire code builds
-- Now Maven and Ant run properly, but I'm getting SQL migration errors in
dspace.log
after starting Tomcat
-- I've updated my ticket on Atmire's bug tracker: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560
+- Now Maven and Ant run properly, but I’m getting SQL migration errors in
dspace.log
after starting Tomcat
+- I’ve updated my ticket on Atmire’s bug tracker: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560
2018-06-07
@@ -204,7 +204,7 @@ update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015
2018-06-09
-- It's pretty annoying, but the JVM monitoring for Munin was never set up when I migrated DSpace Test to its new server a few months ago
+- It’s pretty annoying, but the JVM monitoring for Munin was never set up when I migrated DSpace Test to its new server a few months ago
- I ran the tomcat and munin-node tags in Ansible again and now the stuff is all wired up and recording stats properly
- I applied the CIP author corrections on CGSpace and DSpace Test and re-ran the Discovery indexing
@@ -216,9 +216,9 @@ update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015
INFO [org.dspace.servicemanager.DSpaceServiceManager] Shutdown DSpace core service manager
Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'org.dspace.servicemanager.spring.DSpaceBeanPostProcessor#0' defined in class path resource [spring/spring-dspace-applicationContext.xml]: Unsatisfied dependency expressed through constructor argument with index 0 of type [org.dspace.servicemanager.config.DSpaceConfigurationService]: : Cannot find class [com.atmire.dspace.discovery.ItemCollectionPlugin] for bean with name 'itemCollectionPlugin' defined in file [/home/aorth/dspace/config/spring/api/discovery.xml];
-- I can fix this by commenting out the
ItemCollectionPlugin
line of discovery.xml
, but from looking at the git log I'm not actually sure if that is related to MQM or not
+- I can fix this by commenting out the
ItemCollectionPlugin
line of discovery.xml
, but from looking at the git log I’m not actually sure if that is related to MQM or not
- I will have to ask Atmire
-- I continued to look at Sisay's IITA records from last week
+
- I continued to look at Sisay’s IITA records from last week
- I normalized all DOIs to use HTTPS and “doi.org” instead of “dx.doi.org”
- I cleaned up white space in
cg.subject.iita
and dc.subject
@@ -254,14 +254,14 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
- “Institut de la Recherche Agronomique, Cameroon” and “Institut de Recherche Agronomique, Cameroon”
-- Inconsistency in countries: “COTE D’IVOIRE” and “COTE D'IVOIRE”
+- Inconsistency in countries: “COTE D’IVOIRE” and “COTE D’IVOIRE”
- A few DOIs with spaces or invalid characters
- Inconsistency in IITA subjects, for example “PRODUCTION VEGETALE” and “PRODUCTION VÉGÉTALE” and several others
- I ran
value.unescape('javascript')
on the abstract and citation fields because it looks like this data came from a SQL database and some stuff was escaped
-It turns out that Abenet actually did a lot of small corrections on this data so when Sisay uses Bosede's original file it doesn't have all those corrections
-So I told Sisay to re-create the collection using Abenet's XLS from last week (Mercy1805_AY.xls
)
+It turns out that Abenet actually did a lot of small corrections on this data so when Sisay uses Bosede’s original file it doesn’t have all those corrections
+So I told Sisay to re-create the collection using Abenet’s XLS from last week (Mercy1805_AY.xls
)
I was curious to see if I could create a GREL for use with a custom text facet in Open Refine to find cells with two or more consecutive spaces
I always use the built-in trim and collapse transformations anyways, but this seems to work to find the offending cells: isNotNull(value.match(/.*?\s{2,}.*?/))
I wonder if I should start checking for “smart” quotes like ’ (hex 2019)
@@ -271,15 +271,15 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
Udana from IWMI asked about the OAI base URL for their community on CGSpace
I think it should be this: https://cgspace.cgiar.org/oai/request?verb=ListRecords&metadataPrefix=oai_dc&set=com_10568_16814
The style sheet obfuscates the data, but if you look at the source it is all there, including information about pagination of results
-Regarding Udana's Book Chapters and Reports on DSpace Test last week, Abenet told him to fix some character encoding and CRP issues, then I told him I'd check them after that
-The latest batch of IITA's 200 records (based on Abenet's version Mercy1805_AY.xls
) are now in the IITA_Jan_9_II_Ab collection
+Regarding Udana’s Book Chapters and Reports on DSpace Test last week, Abenet told him to fix some character encoding and CRP issues, then I told him I’d check them after that
+The latest batch of IITA’s 200 records (based on Abenet’s version Mercy1805_AY.xls
) are now in the IITA_Jan_9_II_Ab collection
So here are some corrections:
- use of Unicode smart quote (hex 2019) in countries and affiliations, for example “COTE D’IVOIRE” and “Institut d’Economic Rurale, Mali”
- inconsistencies in
cg.contributor.affiliation
:
- “Centro Internacional de Agricultura Tropical” and “Centro International de Agricultura Tropical” should use the English name of CIAT (International Center for Tropical Agriculture)
-- “Institut International d'Agriculture Tropicale” should use the English name of IITA (International Institute of Tropical Agriculture)
+- “Institut International d’Agriculture Tropicale” should use the English name of IITA (International Institute of Tropical Agriculture)
- “East and Southern Africa Regional Center” and “Eastern and Southern Africa Regional Centre”
- “Institut de la Recherche Agronomique, Cameroon” and “Institut de Recherche Agronomique, Cameroon”
- “Institut des Recherches Agricoles du Bénin” and “Institut National des Recherche Agricoles du Benin” and “National Agricultural Research Institute, Benin”
@@ -320,7 +320,7 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
- “MATÉRIEL DE PLANTATION” and “MATÉRIELS DE PLANTATION”
-- I noticed that some records do have encoding errors in the
dc.description.abstract
field, but only four of them so probably not from Abenet's handling of the XLS file
+- I noticed that some records do have encoding errors in the
dc.description.abstract
field, but only four of them so probably not from Abenet’s handling of the XLS file
- Based on manually eyeballing the text I used a custom text facet with this GREL to identify the records:
@@ -344,7 +344,7 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
2018-06-13
-- Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Robin Buruchara's items
+- Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Robin Buruchara’s items
- I used my add-orcid-identifiers-csv.py script:
$ ./add-orcid-identifiers-csv.py -i 2018-06-13-Robin-Buruchara.csv -db dspace -u dspace -p 'fuuu'
@@ -355,7 +355,7 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
"Buruchara, Robin",Robin Buruchara: 0000-0003-0934-1218
"Buruchara, Robin A.",Robin Buruchara: 0000-0003-0934-1218
-- On a hunch I checked to see if CGSpace's bitstream cleanup was working properly and of course it's broken:
+- On a hunch I checked to see if CGSpace’s bitstream cleanup was working properly and of course it’s broken:
$ dspace cleanup -v
...
@@ -368,7 +368,7 @@ Error: ERROR: update or delete on table "bitstream" violates foreign k
UPDATE 1
2018-06-14
-- Check through Udana's IWMI records from last week on DSpace Test
+- Check through Udana’s IWMI records from last week on DSpace Test
- There were only some minor whitespace and one or two syntax errors, but they look very good otherwise
- I uploaded the twenty-four reports to the IWMI Reports collection: https://cgspace.cgiar.org/handle/10568/36188
- I uploaded the seventy-six book chapters to the IWMI Book Chapters collection: https://cgspace.cgiar.org/handle/10568/36178
@@ -384,22 +384,22 @@ $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h loca
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
- The
-O
option to pg_restore
makes the import process ignore ownership specified in the dump itself, and instead makes the schema owned by the user doing the restore
-- I always prefer to use the
postgres
user locally because it's just easier than remembering the dspacetest
user's password, but then I couldn't figure out why the resulting schema was owned by postgres
+- I always prefer to use the
postgres
user locally because it’s just easier than remembering the dspacetest
user’s password, but then I couldn’t figure out why the resulting schema was owned by postgres
- So with this you connect as the
postgres
superuser and then switch roles to dspacetest
(also, make sure this user has superuser
privileges before the restore)
- Last week Linode emailed me to say that our Linode 8192 instance used for DSpace Test qualified for an upgrade
- Apparently they announced some upgrades to most of their plans in 2018-05
-- After the upgrade I see we have more disk space available in the instance's dashboard, so I shut the instance down and resized it from 98GB to 160GB
+- After the upgrade I see we have more disk space available in the instance’s dashboard, so I shut the instance down and resized it from 98GB to 160GB
- The resize was very quick (less than one minute) and after booting the instance back up I now have 160GB for the root filesystem!
-- I will move the DSpace installation directory back to the root file system and delete the extra 300GB block storage, as it was actually kinda slow when we put Solr there and now we don't actually need it anymore because running the production Solr on this instance didn't work well with 8GB of RAM
-- Also, the larger instance we're using for CGSpace will go from 24GB of RAM to 32, and will also get a storage increase from 320GB to 640GB… that means we don't need to consider using block storage right now!
-- The smaller instances get increased storage and network speed but I doubt many are actually using much of their current allocations so we probably don't need to bother with upgrading them
+- I will move the DSpace installation directory back to the root file system and delete the extra 300GB block storage, as it was actually kinda slow when we put Solr there and now we don’t actually need it anymore because running the production Solr on this instance didn’t work well with 8GB of RAM
+- Also, the larger instance we’re using for CGSpace will go from 24GB of RAM to 32, and will also get a storage increase from 320GB to 640GB… that means we don’t need to consider using block storage right now!
+- The smaller instances get increased storage and network speed but I doubt many are actually using much of their current allocations so we probably don’t need to bother with upgrading them
- Last week Abenet asked if we could add
dc.language.iso
to the advanced search filters
-- There is already a search filter for this field defined in
discovery.xml
but we aren't using it, so I quickly enabled and tested it, then merged it to the 5_x-prod
branch (#380)
+- There is already a search filter for this field defined in
discovery.xml
but we aren’t using it, so I quickly enabled and tested it, then merged it to the 5_x-prod
branch (#380)
- Back to testing the DSpace 5.8 changes from Atmire, I had another issue with SQL migrations:
Caused by: org.flywaydb.core.api.FlywayException: Validate failed. Found differences between applied migrations and available migrations: Detected applied migration missing on the classpath: 5.8.2015.12.03.3
-- It took me a while to figure out that this migration is for MQM, which I removed after Atmire's original advice about the migrations so we actually need to delete this migration instead up updating it
+- It took me a while to figure out that this migration is for MQM, which I removed after Atmire’s original advice about the migrations so we actually need to delete this migration instead up updating it
- So I need to make sure to run the following during the DSpace 5.8 upgrade:
-- Delete existing CUA 4 migration if it exists
@@ -430,20 +430,20 @@ Done.
"Jarvis, Andrew",Andy Jarvis: 0000-0001-6543-0798
2018-06-26
-- Atmire got back to me to say that we can remove the
itemCollectionPlugin
and HasBitstreamsSSIPlugin
beans from DSpace's discovery.xml
file, as they are used by the Metadata Quality Module (MQM) that we are not using anymore
+- Atmire got back to me to say that we can remove the
itemCollectionPlugin
and HasBitstreamsSSIPlugin
beans from DSpace’s discovery.xml
file, as they are used by the Metadata Quality Module (MQM) that we are not using anymore
- I removed both those beans and did some simple tests to check item submission, media-filter of PDFs, REST API, but got an error “No matches for the query” when listing records in OAI
- This warning appears in the DSpace log:
2018-06-26 16:58:12,052 WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
-- It's actually only a warning and it also appears in the logs on DSpace Test (which is currently running DSpace 5.5), so I need to keep troubleshooting
+- It’s actually only a warning and it also appears in the logs on DSpace Test (which is currently running DSpace 5.5), so I need to keep troubleshooting
- Ah, I think I just need to run
dspace oai import
2018-06-27
- Vika from CIFOR sent back his annotations on the duplicates for the “CIFOR_May_9” archive import that I sent him last week
-- I'll have to figure out how to separate those we're keeping, deleting, and mapping into CIFOR's archive collection
-- First, get the 62 deletes from Vika's file and remove them from the collection:
+- I’ll have to figure out how to separate those we’re keeping, deleting, and mapping into CIFOR’s archive collection
+- First, get the 62 deletes from Vika’s file and remove them from the collection:
$ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-delete.txt
$ wc -l cifor-handle-to-delete.txt
@@ -470,7 +470,7 @@ $ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv
Then I can use Open Refine to add the “CIFOR Archive” collection to the mappings
Importing the 2398 items via dspace metadata-import
ends up with a Java garbage collection error, so I think I need to do it in batches of 1,000
After deleting the 62 duplicates, mapping the 50 items from elsewhere in CGSpace, and uploading 2,398 unique items, there are a total of 2,448 items added in this batch
-I'll let Abenet take one last look and then move them to CGSpace
+I’ll let Abenet take one last look and then move them to CGSpace
2018-06-28
@@ -481,9 +481,9 @@ $ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv
[Thu Jun 28 00:00:30 2018] Killed process 14501 (java) total-vm:14926704kB, anon-rss:5693608kB, file-rss:0kB, shmem-rss:0kB
[Thu Jun 28 00:00:30 2018] oom_reaper: reaped process 14501 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
-- Look over IITA's IITA_Jan_9_II_Ab collection from earlier this month on DSpace Test
+- Look over IITA’s IITA_Jan_9_II_Ab collection from earlier this month on DSpace Test
- Bosede fixed a few things (and seems to have removed many French IITA subjects like
AMÉLIORATION DES PLANTES
and SANTÉ DES PLANTES
)
-- I still see at least one issue with author affiliations, and I didn't bother to check the AGROVOC subjects because it's such a mess aanyways
+- I still see at least one issue with author affiliations, and I didn’t bother to check the AGROVOC subjects because it’s such a mess aanyways
- I suggested that IITA provide an updated list of subject to us so we can include their controlled vocabulary in CGSpace, which would also make it easier to do automated validation
diff --git a/docs/2018-07/index.html b/docs/2018-07/index.html
index 89bd5ecd8..5dda54087 100644
--- a/docs/2018-07/index.html
+++ b/docs/2018-07/index.html
@@ -33,7 +33,7 @@ During the mvn package stage on the 5.8 branch I kept getting issues with java r
There is insufficient memory for the Java Runtime Environment to continue.
"/>
-
+
@@ -63,7 +63,7 @@ There is insufficient memory for the Java Runtime Environment to continue.
-
+
@@ -110,7 +110,7 @@ There is insufficient memory for the Java Runtime Environment to continue.
July, 2018
@@ -217,7 +217,7 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
2018-07-04
- I verified that the autowire error indeed only occurs on Tomcat 8.5, but the application works fine on Tomcat 7
-- I have raised this in the DSpace 5.8 compatibility ticket on Atmire's tracker
+- I have raised this in the DSpace 5.8 compatibility ticket on Atmire’s tracker
- Abenet wants me to add “United Kingdom government” to the sponsors on CGSpace so I created a ticket to track it (#381)
- Also, Udana wants me to add “Enhancing Sustainability Across Agricultural Systems” to the WLE Phase II research themes so I created a ticket to track that (#382)
- I need to try to finish this DSpace 5.8 business first because I have too many branches with cherry-picks going on right now!
@@ -225,13 +225,13 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
2018-07-06
- CCAFS want me to add “PII-FP2_MSCCCAFS” to their Phase II project tags on CGSpace (#383)
-- I'll do it in a batch with all the other metadata updates next week
+- I’ll do it in a batch with all the other metadata updates next week
2018-07-08
-- I was tempted to do the Linode instance upgrade on CGSpace (linode18), but after looking closely at the system backups I noticed that Solr isn't being backed up to S3
-- I apparently noticed this—and fixed it!—in 2016-07, but it doesn't look like the backup has been updated since then!
-- It looks like I added Solr to the
backup_to_s3.sh
script, but that script is not even being used (s3cmd
is run directly from root's crontab)
+- I was tempted to do the Linode instance upgrade on CGSpace (linode18), but after looking closely at the system backups I noticed that Solr isn’t being backed up to S3
+- I apparently noticed this—and fixed it!—in 2016-07, but it doesn’t look like the backup has been updated since then!
+- It looks like I added Solr to the
backup_to_s3.sh
script, but that script is not even being used (s3cmd
is run directly from root’s crontab)
- For now I have just initiated a manual S3 backup of the Solr data:
# s3cmd sync --delete-removed /home/backup/solr/ s3://cgspace.cgiar.org/solr/
@@ -245,16 +245,16 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > /tmp/2018-07-08-orcids.txt
$ ./resolve-orcids.py -i /tmp/2018-07-08-orcids.txt -o /tmp/2018-07-08-names.txt -d
-- But after comparing to the existing list of names I didn't see much change, so I just ignored it
+- But after comparing to the existing list of names I didn’t see much change, so I just ignored it
2018-07-09
-- Uptime Robot said that CGSpace was down for two minutes early this morning but I don't see anything in Tomcat logs or dmesg
-- Uptime Robot said that CGSpace was down for two minutes again later in the day, and this time I saw a memory error in Tomcat's
catalina.out
:
+- Uptime Robot said that CGSpace was down for two minutes early this morning but I don’t see anything in Tomcat logs or dmesg
+- Uptime Robot said that CGSpace was down for two minutes again later in the day, and this time I saw a memory error in Tomcat’s
catalina.out
:
Exception in thread "http-bio-127.0.0.1-8081-exec-557" java.lang.OutOfMemoryError: Java heap space
-- I'm not sure if it's the same error, but I see this in DSpace's
solr.log
:
+- I’m not sure if it’s the same error, but I see this in DSpace’s
solr.log
:
2018-07-09 06:25:09,913 ERROR org.apache.solr.servlet.SolrDispatchFilter @ null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
@@ -284,17 +284,17 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-07-09
4435
-95.108.181.88
appears to be Yandex, so I dunno why it's creating so many sessions, as its user agent should match Tomcat's Crawler Session Manager Valve
-70.32.83.92
is on MediaTemple but I'm not sure who it is. They are mostly hitting REST so I guess that's fine
-35.227.26.162
doesn't declare a user agent and is on Google Cloud, so I should probably mark them as a bot in nginx
+95.108.181.88
appears to be Yandex, so I dunno why it’s creating so many sessions, as its user agent should match Tomcat’s Crawler Session Manager Valve
+70.32.83.92
is on MediaTemple but I’m not sure who it is. They are mostly hitting REST so I guess that’s fine
+35.227.26.162
doesn’t declare a user agent and is on Google Cloud, so I should probably mark them as a bot in nginx
178.154.200.38
is Yandex again
207.46.13.47
is Bing
157.55.39.234
is Bing
137.108.70.6
is our old friend CORE bot
-50.116.102.77
doesn't declare a user agent and lives on HostGator, but mostly just hits the REST API so I guess that's fine
+50.116.102.77
doesn’t declare a user agent and lives on HostGator, but mostly just hits the REST API so I guess that’s fine
40.77.167.84
is Bing again
- Interestingly, the first time that I see
35.227.26.162
was on 2018-06-08
-- I've added
35.227.26.162
to the bot tagging logic in the nginx vhost
+- I’ve added
35.227.26.162
to the bot tagging logic in the nginx vhost
2018-07-10
@@ -303,7 +303,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
- Add “PII-FP2_MSCCCAFS” to CCAFS Phase II Project Tags (#383)
- Add journal title (dc.source) to Discovery search filters (#384)
- All were tested and merged to the
5_x-prod
branch and will be deployed on CGSpace this coming weekend when I do the Linode server upgrade
-- I need to get them onto the 5.8 testing branch too, either via cherry-picking or by rebasing after we finish testing Atmire's 5.8 pull request (#378)
+- I need to get them onto the 5.8 testing branch too, either via cherry-picking or by rebasing after we finish testing Atmire’s 5.8 pull request (#378)
- Linode sent an alert about CPU usage on CGSpace again, about 13:00UTC
- These are the top ten users in the last two hours:
@@ -324,7 +324,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
213.139.52.250 - - [10/Jul/2018:13:39:41 +0000] "GET /bitstream/handle/10568/75668/dryad.png HTTP/2.0" 200 53750 "http://localhost:4200/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36"
- He said there was a bug that caused his app to request a bunch of invalid URLs
-- I'll have to keep and eye on this and see how their platform evolves
+- I’ll have to keep and eye on this and see how their platform evolves
2018-07-11
@@ -365,9 +365,9 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
96 40.77.167.90
7075 208.110.72.10
-- We have never seen
208.110.72.10
before… so that's interesting!
+- We have never seen
208.110.72.10
before… so that’s interesting!
- The user agent for these requests is: Pcore-HTTP/v0.44.0
-- A brief Google search doesn't turn up any information about what this bot is, but lots of users complaining about it
+- A brief Google search doesn’t turn up any information about what this bot is, but lots of users complaining about it
- This bot does make a lot of requests all through the day, although it seems to re-use its Tomcat session:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
@@ -387,7 +387,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
208.110.72.10 - - [12/Jul/2018:00:22:28 +0000] "GET /robots.txt HTTP/1.1" 200 1301 "https://cgspace.cgiar.org/robots.txt" "Pcore-HTTP/v0.44.0"
- So this bot is just like Baiduspider, and I need to add it to the nginx rate limiting
-- I'll also add it to Tomcat's Crawler Session Manager Valve to force the re-use of a common Tomcat sesssion for all crawlers just in case
+- I’ll also add it to Tomcat’s Crawler Session Manager Valve to force the re-use of a common Tomcat sesssion for all crawlers just in case
- Generate a list of all affiliations in CGSpace to send to Mohamed Salem to compare with the list on MEL (sorting the list by most occurrences):
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv header
@@ -406,7 +406,7 @@ COPY 4518
2018-07-15
- Run all system updates on CGSpace, add latest metadata changes from last week, and start the Linode instance upgrade
-- After the upgrade I see we have more disk space available in the instance's dashboard, so I shut the instance down and resized it from 392GB to 650GB
+- After the upgrade I see we have more disk space available in the instance’s dashboard, so I shut the instance down and resized it from 392GB to 650GB
- The resize was very quick (less than one minute) and after booting the instance back up I now have 631GB for the root filesystem (with 267GB available)!
- Peter had asked a question about how mapped items are displayed in the Altmetric dashboard
- For example, 10568/82810 is mapped to four collections, but only shows up in one “department” in their dashboard
@@ -452,9 +452,9 @@ $ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolv
- ICARDA sent me another refined list of ORCID iDs so I sorted and formatted them into our controlled vocabulary again
- Participate in call with IWMI and WLE to discuss Altmetric, CGSpace, and social media
-- I told them that they should try to be including the Handle link on their social media shares because that's the only way to get Altmetric to notice them and associate them with their DOIs
+- I told them that they should try to be including the Handle link on their social media shares because that’s the only way to get Altmetric to notice them and associate them with their DOIs
- I suggested that we should have a wider meeting about this, and that I would post that on Yammer
-- I was curious about how and when Altmetric harvests the OAI, so I looked in nginx's OAI log
+- I was curious about how and when Altmetric harvests the OAI, so I looked in nginx’s OAI log
- For every day in the past week I only see about 50 to 100 requests per day, but then about nine days ago I see 1500 requsts
- In there I see two bots making about 750 requests each, and this one is probably Altmetric:
@@ -494,7 +494,7 @@ X-XSS-Protection: 1; mode=block
- Post a note on Yammer about Altmetric and Handle best practices
- Update PostgreSQL JDBC jar from 42.2.2 to 42.2.4 in the RMG Ansible playbooks
- IWMI asked why all the dates in their OpenSearch RSS feed show up as January 01, 2018
-- On closer inspection I notice that many of their items use “2018” as their
dc.date.issued
, which is a valid ISO 8601 date but it's not very specific so DSpace assumes it is January 01, 2018 00:00:00…
+- On closer inspection I notice that many of their items use “2018” as their
dc.date.issued
, which is a valid ISO 8601 date but it’s not very specific so DSpace assumes it is January 01, 2018 00:00:00…
- I told her that they need to start using more accurate dates for their issue dates
- In the example item I looked at the DOI has a publish date of 2018-03-16, so they should really try to capture that
@@ -507,8 +507,8 @@ X-XSS-Protection: 1; mode=block
webui.itemlist.sort-option.3 = dateaccessioned:dc.date.accessioned:date
- Just because I was curious I made sure that these options are working as expected in DSpace 5.8 on DSpace Test (they are)
-- I tested the Atmire Listings and Reports (L&R) module one last time on my local test environment with a new snapshot of CGSpace's database and re-generated Discovery index and it worked fine
-- I finally informed Atmire that we're ready to proceed with deploying this to CGSpace and that they should advise whether we should wait about the SNAPSHOT versions in
pom.xml
+- I tested the Atmire Listings and Reports (L&R) module one last time on my local test environment with a new snapshot of CGSpace’s database and re-generated Discovery index and it worked fine
+- I finally informed Atmire that we’re ready to proceed with deploying this to CGSpace and that they should advise whether we should wait about the SNAPSHOT versions in
pom.xml
- There is no word on the issue I reported with Tomcat 8.5.32 yet, though…
2018-07-23
@@ -539,7 +539,7 @@ dspace=# select count(text_value) from metadatavalue where resource_type_id=2 an
2018-07-27
-- Follow up with Atmire again about the SNAPSHOT versions in our
pom.xml
because I want to finalize the DSpace 5.8 upgrade soon and I haven't heard from them in a month (ticket 560)
+- Follow up with Atmire again about the SNAPSHOT versions in our
pom.xml
because I want to finalize the DSpace 5.8 upgrade soon and I haven’t heard from them in a month (ticket 560)
diff --git a/docs/2018-08/index.html b/docs/2018-08/index.html
index 08cec5d29..4d3f9cc92 100644
--- a/docs/2018-08/index.html
+++ b/docs/2018-08/index.html
@@ -15,10 +15,10 @@ DSpace Test had crashed at some point yesterday morning and I see the following
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
-From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat's
-I'm not sure why Tomcat didn't crash with an OutOfMemoryError…
+From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s
+I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…
Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
-The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
+The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
I ran all system updates on DSpace Test and rebooted it
" />
@@ -37,13 +37,13 @@ DSpace Test had crashed at some point yesterday morning and I see the following
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
-From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat's
-I'm not sure why Tomcat didn't crash with an OutOfMemoryError…
+From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s
+I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…
Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
-The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
+The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
I ran all system updates on DSpace Test and rebooted it
"/>
-
+
@@ -73,7 +73,7 @@ I ran all system updates on DSpace Test and rebooted it
-
+
@@ -120,7 +120,7 @@ I ran all system updates on DSpace Test and rebooted it
August, 2018
@@ -134,10 +134,10 @@ I ran all system updates on DSpace Test and rebooted it
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
- Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
-- From the DSpace log I see that eventually Solr stopped responding, so I guess the
java
process that was OOM killed above was Tomcat's
-- I'm not sure why Tomcat didn't crash with an OutOfMemoryError…
+- From the DSpace log I see that eventually Solr stopped responding, so I guess the
java
process that was OOM killed above was Tomcat’s
+- I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…
- Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
-- The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
+- The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
- I ran all system updates on DSpace Test and rebooted it
@@ -152,23 +152,23 @@ I ran all system updates on DSpace Test and rebooted it
2018-08-02
-- DSpace Test crashed again and I don't see the only error I see is this in
dmesg
:
+- DSpace Test crashed again and I don’t see the only error I see is this in
dmesg
:
[Thu Aug 2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child
[Thu Aug 2 00:00:12 2018] Killed process 1407 (java) total-vm:18876328kB, anon-rss:6323836kB, file-rss:0kB, shmem-rss:0kB
- I am still assuming that this is the Tomcat process that is dying, so maybe actually we need to reduce its memory instead of increasing it?
-- The risk we run there is that we'll start getting OutOfMemory errors from Tomcat
+- The risk we run there is that we’ll start getting OutOfMemory errors from Tomcat
- So basically we need a new test server with more RAM very soon…
- Abenet asked about the workflow statistics in the Atmire CUA module again
-- Last year Atmire told me that it's disabled by default but you can enable it with
workflow.stats.enabled = true
in the CUA configuration file
-- There was a bug with adding users so they sent a patch, but I didn't merge it because it was very dirty and I wasn't sure it actually fixed the problem
-- I just tried to enable the stats again on DSpace Test now that we're on DSpace 5.8 with updated Atmire modules, but every user I search for shows “No data available”
+- Last year Atmire told me that it’s disabled by default but you can enable it with
workflow.stats.enabled = true
in the CUA configuration file
+- There was a bug with adding users so they sent a patch, but I didn’t merge it because it was very dirty and I wasn’t sure it actually fixed the problem
+- I just tried to enable the stats again on DSpace Test now that we’re on DSpace 5.8 with updated Atmire modules, but every user I search for shows “No data available”
- As a test I submitted a new item and I was able to see it in the workflow statistics “data” tab, but not in the graph
2018-08-15
-- Run through Peter's list of author affiliations from earlier this month
+- Run through Peter’s list of author affiliations from earlier this month
- I did some quick sanity checks and small cleanups in Open Refine, checking for spaces, weird accents, and encoding errors
- Finally I did a test run with the
fix-metadata-value.py
script:
@@ -210,8 +210,8 @@ Verchot, L.V.
Verchot, LV
Verchot, Louis V.
-- I'll just tag them all with Louis Verchot's ORCID identifier…
-- In the end, I'll run the following CSV with my add-orcid-identifiers-csv.py script:
+- I’ll just tag them all with Louis Verchot’s ORCID identifier…
+- In the end, I’ll run the following CSV with my add-orcid-identifiers-csv.py script:
dc.contributor.author,cg.creator.id
"Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859
@@ -290,17 +290,17 @@ sys 2m20.248s
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19
1724
-- I don't even know how its possible for the bot to use MORE sessions than total requests…
+- I don’t even know how its possible for the bot to use MORE sessions than total requests…
- The user agent is:
Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
-- So I'm thinking we should add “crawl” to the Tomcat Crawler Session Manager valve, as we already have “bot” that catches Googlebot, Bingbot, etc.
+- So I’m thinking we should add “crawl” to the Tomcat Crawler Session Manager valve, as we already have “bot” that catches Googlebot, Bingbot, etc.
2018-08-20
- Help Sisay with some UTF-8 encoding issues in a file Peter sent him
-- Finish up reconciling Atmire's pull request for DSpace 5.8 changes with the latest status of our
5_x-prod
branch
+- Finish up reconciling Atmire’s pull request for DSpace 5.8 changes with the latest status of our
5_x-prod
branch
- I had to do some
git rev-list --reverse --no-merges oldestcommit..newestcommit
and git cherry-pick -S
hackery to get everything all in order
- After building I ran the Atmire schema migrations and forced old migrations, then did the
ant update
- I tried to build it on DSpace Test, but it seems to still need more RAM to complete (like I experienced last month), so I stopped Tomcat and set
JAVA_OPTS
to 1024m and tried the mvn package
again
@@ -308,8 +308,8 @@ sys 2m20.248s
- I will try to reduce Tomcat memory from 4608m to 4096m and then retry the
mvn package
with 1024m of JAVA_OPTS
again
- After running the
mvn package
for the third time and waiting an hour, I attached strace
to the Java process and saw that it was indeed reading XMLUI theme data… so I guess I just need to wait more
- After waiting two hours the maven process completed and installation was successful
-- I restarted Tomcat and it seems everything is working well, so I'll merge the pull request and try to schedule the CGSpace upgrade for this coming Sunday, August 26th
-- I merged Atmire's pull request into our
5_x-dspace-5.8
temporary brach and then cherry-picked all the changes from 5_x-prod
since April, 2018 when that temporary branch was created
+- I restarted Tomcat and it seems everything is working well, so I’ll merge the pull request and try to schedule the CGSpace upgrade for this coming Sunday, August 26th
+- I merged Atmire’s pull request into our
5_x-dspace-5.8
temporary brach and then cherry-picked all the changes from 5_x-prod
since April, 2018 when that temporary branch was created
- As the branch histories are very different I cannot merge the new 5.8 branch into the current
5_x-prod
branch
- Instead, I will archive the current
5_x-prod
DSpace 5.5 branch as 5_x-prod-dspace-5.5
and then hard reset 5_x-prod
based on 5_x-dspace-5.8
- Unfortunately this will mess up the references in pull requests and issues on GitHub
@@ -320,8 +320,8 @@ sys 2m20.248s
[INFO] Processing overlay [ id org.dspace.modules:xmlui-mirage2]
-- It's the same on DSpace Test, my local laptop, and CGSpace…
-- It wasn't this way before when I was constantly building the previous 5.8 branch with Atmire patches…
+- It’s the same on DSpace Test, my local laptop, and CGSpace…
+- It wasn’t this way before when I was constantly building the previous 5.8 branch with Atmire patches…
- I will restore the previous
5_x-dspace-5.8
and atmire-module-upgrades-5.8
branches to see if the build time is different there
- … it seems that the
atmire-module-upgrades-5.8
branch still takes 1 hour and 23 minutes on my local machine…
- Let me try to build the old
5_x-prod-dspace-5.5
branch on my local machine and see how long it takes
@@ -330,7 +330,7 @@ sys 2m20.248s
[INFO] --- maven-war-plugin:2.4:war (default-war) @ xmlui ---
-- And I notice that Atmire changed something in the XMLUI module's
pom.xml
as part of the DSpace 5.8 changes, specifically to remove the exclude for node_modules
in the maven-war-plugin
step
+- And I notice that Atmire changed something in the XMLUI module’s
pom.xml
as part of the DSpace 5.8 changes, specifically to remove the exclude for node_modules
in the maven-war-plugin
step
- This exclude is present in vanilla DSpace, and if I add it back the build time goes from 1 hour 23 minutes to 12 minutes!
- It makes sense that it would take longer to complete this step because the
node_modules
folder has tens of thousands of files, and we have 27 themes!
- I need to test to see if this has any side effects when deployed…
@@ -342,14 +342,14 @@ sys 2m20.248s
- They say they want to start working on the ContentDM harvester middleware again
- I sent a list of the top 1500 author affiliations on CGSpace to CodeObia so we can compare ours with the ones on MELSpace
- Discuss CTA items with Sisay, he was trying to figure out how to do the collection mapping in combination with SAFBuilder
-- It appears that the web UI's upload interface requires you to specify the collection, whereas the CLI interface allows you to omit the collection command line flag and defer to the
collections
file inside each item in the bundle
+- It appears that the web UI’s upload interface requires you to specify the collection, whereas the CLI interface allows you to omit the collection command line flag and defer to the
collections
file inside each item in the bundle
- I imported the CTA items on CGSpace for Sisay:
$ dspace import -a -e s.webshet@cgiar.org -s /home/swebshet/ictupdates_uploads_August_21 -m /tmp/2018-08-23-cta-ictupdates.map
2018-08-26
- Doing the DSpace 5.8 upgrade on CGSpace (linode18)
-- I already finished the Maven build, now I'll take a backup of the PostgreSQL database and do a database cleanup just in case:
+- I already finished the Maven build, now I’ll take a backup of the PostgreSQL database and do a database cleanup just in case:
$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-08-26-before-dspace-58.backup dspace
$ dspace cleanup -v
@@ -371,7 +371,7 @@ dspace=> \q
$ dspace database migrate ignored
-- Then I'll run all system updates and reboot the server:
+- Then I’ll run all system updates and reboot the server:
$ sudo su -
# apt update && apt full-upgrade
@@ -380,9 +380,9 @@ $ dspace database migrate ignored
- After reboot I logged in and cleared all the XMLUI caches and everything looked to be working fine
- Adam from WLE had asked a few weeks ago about getting the metadata for a bunch of items related to gender from 2013 until now
-- They want a CSV with all metadata, which the Atmire Listings and Reports module can't do
+- They want a CSV with all metadata, which the Atmire Listings and Reports module can’t do
- I exported a list of items from Listings and Reports with the following criteria: from year 2013 until now, have WLE subject
GENDER
or GENDER POVERTY AND INSTITUTIONS
, and CRP Water, Land and Ecosystems
-- Then I extracted the Handle links from the report so I could export each item's metadata as CSV
+- Then I extracted the Handle links from the report so I could export each item’s metadata as CSV
$ grep -o -E "[0-9]{5}/[0-9]{0,5}" listings-export.txt > /tmp/iwmi-gender-items.txt
@@ -391,21 +391,21 @@ $ dspace database migrate ignored
$ while read -r line; do dspace metadata-export -f "/tmp/${line/\//-}.csv" -i $line; sleep 2; done < /tmp/iwmi-gender-items.txt
- But from here I realized that each of the fifty-nine items will have different columns in their CSVs, making it difficult to combine them
-- I'm not sure how to proceed without writing some script to parse and join the CSVs, and I don't think it's worth my time
-- I tested DSpace 5.8 in Tomcat 8.5.32 and it seems to work now, so I'm not sure why I got those errors last time I tried
+- I’m not sure how to proceed without writing some script to parse and join the CSVs, and I don’t think it’s worth my time
+- I tested DSpace 5.8 in Tomcat 8.5.32 and it seems to work now, so I’m not sure why I got those errors last time I tried
- It could have been a configuration issue, though, as I also reconciled the
server.xml
with the one in our Ansible infrastructure scripts
- But now I can start testing and preparing to move DSpace Test to Ubuntu 18.04 + Tomcat 8.5 + OpenJDK + PostgreSQL 9.6…
- Actually, upon closer inspection, it seems that when you try to go to Listings and Reports under Tomcat 8.5.33 you are taken to the JSPUI login page despite having already logged in in XMLUI
- If I type my username and password again it does take me to Listings and Reports, though…
-- I don't see anything interesting in the Catalina or DSpace logs, so I might have to file a bug with Atmire
-- For what it's worth, the Content and Usage (CUA) module does load, though I can't seem to get any results in the graph
+- I don’t see anything interesting in the Catalina or DSpace logs, so I might have to file a bug with Atmire
+- For what it’s worth, the Content and Usage (CUA) module does load, though I can’t seem to get any results in the graph
- I just checked to see if the Listings and Reports issue with using the CGSpace citation field was fixed as planned alongside the DSpace 5.8 upgrades (#589
- I was able to create a new layout containing only the citation field, so I closed the ticket
2018-08-29
- Discuss COPO with Martin Mueller
-- He and the consortium's idea is to use this for metadata annotation (submission?) to all repositories
+- He and the consortium’s idea is to use this for metadata annotation (submission?) to all repositories
- It is somehow related to adding events as items in the repository, and then linking related papers, presentations, etc to the event item using
dc.relation
, etc.
- Discuss Linode server charges with Abenet, apparently we want to start charging these to Big Data
diff --git a/docs/2018-09/index.html b/docs/2018-09/index.html
index b34a047e3..6c148a8bb 100644
--- a/docs/2018-09/index.html
+++ b/docs/2018-09/index.html
@@ -9,9 +9,9 @@
@@ -23,11 +23,11 @@ I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
-
+
@@ -57,7 +57,7 @@ I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
-
+
@@ -104,7 +104,7 @@ I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
September, 2018
@@ -112,9 +112,9 @@ I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
2018-09-02
- New PostgreSQL JDBC driver version 42.2.5
-- I'll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
-- Also, I'll re-run the
postgresql
tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month
-- I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:
+- I’ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
+- Also, I’ll re-run the
postgresql
tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
+- I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
02-Sep-2018 11:18:52.678 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
@@ -138,11 +138,11 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
XMLUI fails to load, but the REST, SOLR, JSPUI, etc work
The old 5_x-prod-dspace-5.5
branch does work in Ubuntu 18.04 with Tomcat 8.5.30-1ubuntu1.4, however!
And the 5_x-prod
DSpace 5.8 branch does work in Tomcat 8.5.x on my Arch Linux laptop…
-I'm not sure where the issue is then!
+I’m not sure where the issue is then!
2018-09-03
-- Abenet says she's getting three emails about periodic statistics reports every day since the DSpace 5.8 upgrade last week
+- Abenet says she’s getting three emails about periodic statistics reports every day since the DSpace 5.8 upgrade last week
- They are from the CUA module
- Two of them have “no data” and one has a “null” title
- The last one is a report of the top downloaded items, and includes a graph
@@ -151,13 +151,13 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
2018-09-04
-- I'm looking over the latest round of IITA records from Sisay: Mercy1806_August_29
+
- I’m looking over the latest round of IITA records from Sisay: Mercy1806_August_29
- All fields are split with multiple columns like
cg.authorship.types
and cg.authorship.types[]
- This makes it super annoying to do the checks and cleanup, so I will merge them (also time consuming)
- Five items had
dc.date.issued
values like 2013-5
so I corrected them to be 2013-05
- Several metadata fields had values with newlines in them (even in some titles!), which I fixed by trimming the consecutive whitespaces in Open Refine
-- Many (91!) items from before 2011 are indicated as having a CRP, but CRPs didn't exist then so this is impossible
+
- Many (91!) items from before 2011 are indicated as having a CRP, but CRPs didn’t exist then so this is impossible
- I got all items that were from 2011 and onwards using a custom facet with this GREL on the
dc.date.issued
column: isNotNull(value.match(/201[1-8].*/))
and then blanking their CRPs
@@ -170,7 +170,7 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
- One invalid value for
dc.type
-- Abenet says she hasn't received any more subscription emails from the CUA module since she unsubscribed yesterday, so I think we don't need create an issue on Atmire's bug tracker anymore
+- Abenet says she hasn’t received any more subscription emails from the CUA module since she unsubscribed yesterday, so I think we don’t need create an issue on Atmire’s bug tracker anymore
2018-09-10
@@ -213,7 +213,7 @@ requests:
2018-09-10 07:26:35,551 ERROR org.dspace.submit.step.CompleteStep @ Caught exception in submission step:
org.dspace.authorize.AuthorizeException: Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:2 by user 3819
-- Seems to be during submit step, because it's workflow step 1…?
+- Seems to be during submit step, because it’s workflow step 1…?
- Move some top-level CRP communities to be below the new CGIAR Research Programs and Platforms community:
$ dspace community-filiator --set -p 10568/97114 -c 10568/51670
@@ -237,7 +237,7 @@ UPDATE 15
The current cg.identifier.status
field will become “Access rights” and dc.rights
will become “Usage rights”
I have some work in progress on the 5_x-rights
branch
Linode said that CGSpace (linode18) had a high CPU load earlier today
-When I looked, I see it's the same Russian IP that I noticed last month:
+When I looked, I see it’s the same Russian IP that I noticed last month:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Sep/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
1459 157.55.39.202
@@ -260,7 +260,7 @@ UPDATE 15
Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
-- I added
.*crawl.*
to the Tomcat Session Crawler Manager Valve, so I'm not sure why the bot is creating so many sessions…
+- I added
.*crawl.*
to the Tomcat Session Crawler Manager Valve, so I’m not sure why the bot is creating so many sessions…
- I just tested that user agent on CGSpace and it does not create a new session:
$ http --print Hh https://cgspace.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)'
@@ -298,17 +298,17 @@ $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e
- Sisay is still having problems with the controlled vocabulary for top authors
- I took a look at the submission template and Firefox complains that the XML file is missing a root element
-- I guess it's because Firefox is receiving an empty XML file
+- I guess it’s because Firefox is receiving an empty XML file
- I told Sisay to run the XML file through tidy
- More testing of the access and usage rights changes
2018-09-13
- Peter was communicating with Altmetric about the OAI mapping issue for item 10568/82810 again
-- Altmetric said it was somehow related to the OAI
dateStamp
not getting updated when the mappings changed, but I said that back in 2018-07 when this happened it was because the OAI was actually just not reflecting all the item's mappings
+- Altmetric said it was somehow related to the OAI
dateStamp
not getting updated when the mappings changed, but I said that back in 2018-07 when this happened it was because the OAI was actually just not reflecting all the item’s mappings
- After forcing a complete re-indexing of OAI the mappings were fine
-- The
dateStamp
is most probably only updated when the item's metadata changes, not its mappings, so if Altmetric is relying on that we're in a tricky spot
-- We need to make sure that our OAI isn't publicizing stale data… I was going to post something on the dspace-tech mailing list, but never did
+- The
dateStamp
is most probably only updated when the item’s metadata changes, not its mappings, so if Altmetric is relying on that we’re in a tricky spot
+- We need to make sure that our OAI isn’t publicizing stale data… I was going to post something on the dspace-tech mailing list, but never did
- Linode says that CGSpace (linode18) has had high CPU for the past two hours
- The top IP addresses today are:
@@ -331,8 +331,8 @@ $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-13 | sort | uniq
2
-- So I'm not sure what's going on
-- Valerio asked me if there's a way to get the page views and downloads from CGSpace
+- So I’m not sure what’s going on
+- Valerio asked me if there’s a way to get the page views and downloads from CGSpace
- I said no, but that we might be able to piggyback on the Atmire statlet REST API
- For example, when you expand the “statlet” at the bottom of an item like 10568/97103 you can see the following request in the browser console:
@@ -340,12 +340,12 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
- That JSON file has the total page views and item downloads for the item…
- Abenet forwarded a request by CIP that item thumbnails be included in RSS feeds
-- I had a quick look at the DSpace 5.x manual and it doesn't not seem that this is possible (you can only add metadata)
-- Testing the new LDAP server the CGNET says will be replacing the old one, it doesn't seem that they are using the global catalog on port 3269 anymore, now only 636 is open
+- I had a quick look at the DSpace 5.x manual and it doesn’t not seem that this is possible (you can only add metadata)
+- Testing the new LDAP server the CGNET says will be replacing the old one, it doesn’t seem that they are using the global catalog on port 3269 anymore, now only 636 is open
- I did a clean deploy of DSpace 5.8 on Ubuntu 18.04 with some stripped down Tomcat 8 configuration and actually managed to get it up and running without the autowire errors that I had previously experienced
- I realized that it always works on my local machine with Tomcat 8.5.x, but not when I do the deployment from Ansible in Ubuntu 18.04
- So there must be something in my Tomcat 8
server.xml
template
-- Now I re-deployed it with the normal server template and it's working, WTF?
+- Now I re-deployed it with the normal server template and it’s working, WTF?
- Must have been something like an old DSpace 5.5 file in the spring folder… weird
- But yay, this means we can update DSpace Test to Ubuntu 18.04, Tomcat 8, PostgreSQL 9.6, etc…
@@ -357,7 +357,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
2018-09-16
- Add the DSpace build.properties as a template into my Ansible infrastructure scripts for configuring DSpace machines
-- One stupid thing there is that I add all the variables in a private vars file, which is apparently higher precedence than host vars, meaning that I can't override them (like SMTP server) on a per-host basis
+- One stupid thing there is that I add all the variables in a private vars file, which is apparently higher precedence than host vars, meaning that I can’t override them (like SMTP server) on a per-host basis
- Discuss access and usage rights with Peter
- I suggested that we leave access rights (
cg.identifier.access
) as it is now, with “Open Access” or “Limited Access”, and then simply re-brand that as “Access rights” in the UIs and relevant drop downs
- Then we continue as planned to add
dc.rights
as “Usage rights”
@@ -374,26 +374,26 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
- Update these immediately, but talk to CodeObia to create a mapping between the old and new values
- Finalize
dc.rights
“Usage rights” with seven combinations of Creative Commons, plus the others
-- Need to double check the new CRP community to see why the collection counts aren't updated after we moved the communities there last week
+
- Need to double check the new CRP community to see why the collection counts aren’t updated after we moved the communities there last week
- I forced a full Discovery re-index and now the community shows 1,600 items
-- Check if it's possible to have items deposited via REST use a workflow so we can perhaps tell ICARDA to use that from MEL
-- Agree that we'll publicize AReS explorer on the week before the Big Data Platform workshop
+
- Check if it’s possible to have items deposited via REST use a workflow so we can perhaps tell ICARDA to use that from MEL
+- Agree that we’ll publicize AReS explorer on the week before the Big Data Platform workshop
- Put a link and or picture on the CGSpace homepage saying “Visualized CGSpace research” or something, and post a message on Yammer
- I want to explore creating a thin API to make the item view and download stats available from Solr so CodeObia can use them in the AReS explorer
-- Currently CodeObia is exploring using the Atmire statlets internal API, but I don't really like that…
+- Currently CodeObia is exploring using the Atmire statlets internal API, but I don’t really like that…
- There are some example queries on the DSpace Solr wiki
- For example, this query returns 1655 rows for item 10568/10630:
$ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false'
-- The id in the Solr query is the item's database id (get it from the REST API or something)
-- Next, I adopted a query to get the downloads and it shows 889, which is similar to the number Atmire's statlet shows, though the query logic here is confusing:
+- The id in the Solr query is the item’s database id (get it from the REST API or something)
+- Next, I adopted a query to get the downloads and it shows 889, which is similar to the number Atmire’s statlet shows, though the query logic here is confusing:
$ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
@@ -404,7 +404,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
-(bundleName:[*+TO+*]-bundleName:ORIGINAL)
seems to be a negative query starting with all documents, subtracting those with bundleName:ORIGINAL
, and then negating the whole thing… meaning only documents from bundleName:ORIGINAL
?
-What the shit, I think I'm right: the simplified logic in this query returns the same 889:
+What the shit, I think I’m right: the simplified logic in this query returns the same 889:
$ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=bundleName:ORIGINAL&fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
@@ -412,12 +412,12 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
$ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=bundleName:ORIGINAL&fq=statistics_type:view'
-- As for item views, I suppose that's just the same query, minus the
bundleName:ORIGINAL
:
+- As for item views, I suppose that’s just the same query, minus the
bundleName:ORIGINAL
:
$ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=-bundleName:ORIGINAL&fq=statistics_type:view'
- That one returns 766, which is exactly 1655 minus 889…
-- Also, Solr's
fq
is similar to the regular q
query parameter, but it is considered for the Solr query cache so it should be faster for multiple queries
+- Also, Solr’s
fq
is similar to the regular q
query parameter, but it is considered for the Solr query cache so it should be faster for multiple queries
2018-09-18
@@ -432,7 +432,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
"views": 15
}
-- The numbers are different than those that come from Atmire's statlets for some reason, but as I'm querying Solr directly, I have no idea where their numbers come from!
+- The numbers are different than those that come from Atmire’s statlets for some reason, but as I’m querying Solr directly, I have no idea where their numbers come from!
- Moayad from CodeObia asked if I could make the API be able to paginate over all items, for example: /statistics?limit=100&page=1
- Getting all the item IDs from PostgreSQL is certainly easy:
@@ -443,7 +443,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
2018-09-19
- I emailed Jane Poole to ask if there is some money we can use from the Big Data Platform (BDP) to fund the purchase of some Atmire credits for CGSpace
-- I learned that there is an efficient way to do “deep paging” in large Solr results sets by using
cursorMark
, but it doesn't work with faceting
+- I learned that there is an efficient way to do “deep paging” in large Solr results sets by using
cursorMark
, but it doesn’t work with faceting
2018-09-20
@@ -464,21 +464,21 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
2018-09-21
- I see that there was a nice optimization to the ImageMagick PDF CMYK detection in the upstream
dspace-5_x
branch: DS-3664
-- The fix will go into DSpace 5.10, and we are currently on DSpace 5.8 but I think I'll cherry-pick that fix into our
5_x-prod
branch:
+ - The fix will go into DSpace 5.10, and we are currently on DSpace 5.8 but I think I’ll cherry-pick that fix into our
5_x-prod
branch:
- 4e8c7b578bdbe26ead07e36055de6896bbf02f83: ImageMagick: Only execute “identify” on first page
- I think it would also be nice to cherry-pick the fixes for DS-3883, which is related to optimizing the XMLUI item display of items with many bitstreams
-- a0ea20bd1821720b111e2873b08e03ce2bf93307: DS-3883: Don't loop through original bitstreams if only displaying thumbnails
+- a0ea20bd1821720b111e2873b08e03ce2bf93307: DS-3883: Don’t loop through original bitstreams if only displaying thumbnails
- 8d81e825dee62c2aa9d403a505e4a4d798964e8d: DS-3883: If only including thumbnails, only load the main item thumbnail.
2019-09-23
-- I did more work on my cgspace-statistics-api, fixing some item view counts and adding indexing via SQLite (I'm trying to avoid having to set up yet another database, user, password, etc) during deployment
+- I did more work on my cgspace-statistics-api, fixing some item view counts and adding indexing via SQLite (I’m trying to avoid having to set up yet another database, user, password, etc) during deployment
- I created a new branch called
5_x-upstream-cherry-picks
to test and track those cherry-picks from the upstream 5.x branch
- Also, I need to test the new LDAP server, so I will deploy that on DSpace Test today
- Rename my cgspace-statistics-api to dspace-statistics-api on GitHub
@@ -486,7 +486,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
2018-09-24
- Trying to figure out how to get item views and downloads from SQLite in a join
-- It appears SQLite doesn't support
FULL OUTER JOIN
so some people on StackOverflow have emulated it with LEFT JOIN
and UNION
:
+- It appears SQLite doesn’t support
FULL OUTER JOIN
so some people on StackOverflow have emulated it with LEFT JOIN
and UNION
:
> SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemviews views
LEFT JOIN itemdownloads downloads USING(id)
@@ -495,7 +495,7 @@ SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemdownloa
LEFT JOIN itemviews views USING(id)
WHERE views.id IS NULL;
-- This “works” but the resulting rows are kinda messy so I'd have to do extra logic in Python
+- This “works” but the resulting rows are kinda messy so I’d have to do extra logic in Python
- Maybe we can use one “items” table with defaults values and UPSERT (aka insert… on conflict … do update):
sqlite> CREATE TABLE items(id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0);
@@ -507,9 +507,9 @@ sqlite> INSERT INTO items(id, views) VALUES(0, 3) ON CONFLICT(id) DO UPDATE S
sqlite> INSERT INTO items(id, views) VALUES(0, 7) ON CONFLICT(id) DO UPDATE SET downloads=excluded.views;
- This totally works!
-- Note the special
excluded.views
form! See SQLite's lang_UPSERT documentation
-- Oh nice, I finally finished the Falcon API route to page through all the results using SQLite's amazing
LIMIT
and OFFSET
support
-- But when I deployed it on my Ubuntu 16.04 environment I realized Ubuntu's SQLite is old and doesn't support
UPSERT
, so my indexing doesn't work…
+- Note the special
excluded.views
form! See SQLite’s lang_UPSERT documentation
+- Oh nice, I finally finished the Falcon API route to page through all the results using SQLite’s amazing
LIMIT
and OFFSET
support
+- But when I deployed it on my Ubuntu 16.04 environment I realized Ubuntu’s SQLite is old and doesn’t support
UPSERT
, so my indexing doesn’t work…
- Apparently
UPSERT
came in SQLite 3.24.0 (2018-06-04), and Ubuntu 16.04 has 3.11.0
- Ok this is hilarious, I manually downloaded the libsqlite3 3.24.0 deb from Ubuntu 18.10 “cosmic” and installed it in Ubnutu 16.04 and now the Python
indexer.py
works
- This is definitely a dirty hack, but the list of packages we use that depend on
libsqlite3-0
in Ubuntu 16.04 are actually pretty few:
@@ -543,28 +543,28 @@ dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DE
2018-09-25
- I deployed the DSpace statistics API on CGSpace, but when I ran the indexer it wanted to index 180,000 pages of item views
-- I'm not even sure how that's possible, as we only have 74,000 items!
+- I’m not even sure how that’s possible, as we only have 74,000 items!
- I need to inspect the
id
values that are returned for views and cross check them with the owningItem
values for bitstream downloads…
-- Also, I could try to check all IDs against the items table to see if they are actually items (perhaps the Solr
id
field doesn't correspond with actual DSpace items?)
-- I want to purge the bot hits from the Solr statistics core, as I am now realizing that I don't give a shit about tens of millions of hits by Google and Bing indexing my shit every day (at least not in Solr!)
-- CGSpace's Solr core has 150,000,000 documents in it… and it's still pretty fast to query, but it's really a maintenance and backup burden
-- DSpace Test currently has about 2,000,000 documents with
isBot:true
in its Solr statistics core, and the size on disk is 2GB (it's not much, but I have to test this somewhere!)
-- According to the DSpace 5.x Solr documentation I can use
dspace stats-util -f
, so let's try it:
+- Also, I could try to check all IDs against the items table to see if they are actually items (perhaps the Solr
id
field doesn’t correspond with actual DSpace items?)
+- I want to purge the bot hits from the Solr statistics core, as I am now realizing that I don’t give a shit about tens of millions of hits by Google and Bing indexing my shit every day (at least not in Solr!)
+- CGSpace’s Solr core has 150,000,000 documents in it… and it’s still pretty fast to query, but it’s really a maintenance and backup burden
+- DSpace Test currently has about 2,000,000 documents with
isBot:true
in its Solr statistics core, and the size on disk is 2GB (it’s not much, but I have to test this somewhere!)
+- According to the DSpace 5.x Solr documentation I can use
dspace stats-util -f
, so let’s try it:
$ dspace stats-util -f
- The command comes back after a few seconds and I still see 2,000,000 documents in the statistics core with
isBot:true
-- I was just writing a message to the dspace-tech mailing list and then I decided to check the number of bot view events on DSpace Test again, and now it's 201 instead of 2,000,000, and statistics core is only 30MB now!
+- I was just writing a message to the dspace-tech mailing list and then I decided to check the number of bot view events on DSpace Test again, and now it’s 201 instead of 2,000,000, and statistics core is only 30MB now!
- I will set the
logBots = false
property in dspace/config/modules/usage-statistics.cfg
on DSpace Test and check if the number of isBot:true
events goes up any more…
- I restarted the server with
logBots = false
and after it came back up I see 266 events with isBots:true
(maybe they were buffered)… I will check again tomorrow
-- After a few hours I see there are still only 266 view events with
isBot:true
on DSpace Test's Solr statistics core, so I'm definitely going to deploy this on CGSpace soon
-- Also, CGSpace currently has 60,089,394 view events with
isBot:true
in it's Solr statistics core and it is 124GB!
+- After a few hours I see there are still only 266 view events with
isBot:true
on DSpace Test’s Solr statistics core, so I’m definitely going to deploy this on CGSpace soon
+- Also, CGSpace currently has 60,089,394 view events with
isBot:true
in it’s Solr statistics core and it is 124GB!
- Amazing! After running
dspace stats-util -f
on CGSpace the Solr statistics core went from 124GB to 60GB, and now there are only 700 events with isBot:true
so I should really disable logging of bot events!
-- I'm super curious to see how the JVM heap usage changes…
+- I’m super curious to see how the JVM heap usage changes…
- I made (and merged) a pull request to disable bot logging on the
5_x-prod
branch (#387)
-- Now I'm wondering if there are other bot requests that aren't classified as bots because the IP lists or user agents are outdated
+- Now I’m wondering if there are other bot requests that aren’t classified as bots because the IP lists or user agents are outdated
- DSpace ships a list of spider IPs, for example:
config/spiders/iplists.com-google.txt
-- I checked the list against all the IPs we've seen using the “Googlebot” useragent on CGSpace's nginx access logs
+- I checked the list against all the IPs we’ve seen using the “Googlebot” useragent on CGSpace’s nginx access logs
- The first thing I learned is that shit tons of IPs in Russia, Ukraine, Ireland, Brazil, Portugal, the US, Canada, etc are pretending to be “Googlebot”…
- According to the Googlebot FAQ the domain name in the reverse DNS lookup should contain either
googlebot.com
or google.com
- In Solr this appears to be an appropriate query that I can maybe use later (returns 81,000 documents):
@@ -577,7 +577,7 @@ dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DE
- And magically all those 81,000 documents are gone!
- After a few hours the Solr statistics core is down to 44GB on CGSpace!
-- I did a major refactor and logic fix in the DSpace Statistics API's
indexer.py
+- I did a major refactor and logic fix in the DSpace Statistics API’s
indexer.py
- Basically, it turns out that using
facet.mincount=1
is really beneficial for me because it reduces the size of the Solr result set, reduces the amount of data we need to ingest into PostgreSQL, and the API returns HTTP 404 Not Found for items without views or downloads anyways
- I deployed the new version on CGSpace and now it looks pretty good!
@@ -585,14 +585,14 @@ dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DE
...
Indexing item downloads (page 260 of 260)
-- And now it's fast as hell due to the muuuuch smaller Solr statistics core
+- And now it’s fast as hell due to the muuuuch smaller Solr statistics core
2018-09-26
- Linode emailed to say that CGSpace (linode18) was using 30Mb/sec of outward bandwidth for two hours around midnight
-- I don't see anything unusual in the nginx logs, so perhaps it was the cron job that syncs the Solr database to Amazon S3?
+- I don’t see anything unusual in the nginx logs, so perhaps it was the cron job that syncs the Solr database to Amazon S3?
- It could be that the bot purge yesterday changed the core significantly so there was a lot to change?
-- I don't see any drop in JVM heap size in CGSpace's munin stats since I did the Solr cleanup, but this looks pretty good:
+- I don’t see any drop in JVM heap size in CGSpace’s munin stats since I did the Solr cleanup, but this looks pretty good:

@@ -610,16 +610,16 @@ real 77m3.755s
user 7m39.785s
sys 2m18.485s
-- I told Peter it's better to do the access rights before the usage rights because the git branches are conflicting with each other and it's actually a pain in the ass to keep changing the values as we discuss, rebase, merge, fix conflicts…
+- I told Peter it’s better to do the access rights before the usage rights because the git branches are conflicting with each other and it’s actually a pain in the ass to keep changing the values as we discuss, rebase, merge, fix conflicts…
- Udana and Mia from WLE were asking some questions about their WLE Feedburner feed
-- It's pretty confusing, because until recently they were entering issue dates as only YYYY (like 2018) and their feeds were all showing items in the wrong order
-- I'm not exactly sure what their problem now is, though (confusing)
-- I updated the dspace-statistiscs-api to use psycopg2's
execute_values()
to insert batches of 100 values into PostgreSQL instead of doing every insert individually
+- It’s pretty confusing, because until recently they were entering issue dates as only YYYY (like 2018) and their feeds were all showing items in the wrong order
+- I’m not exactly sure what their problem now is, though (confusing)
+- I updated the dspace-statistiscs-api to use psycopg2’s
execute_values()
to insert batches of 100 values into PostgreSQL instead of doing every insert individually
- On CGSpace this reduces the total run time of
indexer.py
from 432 seconds to 400 seconds (most of the time is actually spent in getting the data from Solr though)
2018-09-27
-- Linode emailed to say that CGSpace's (linode19) CPU load was high for a few hours last night
+- Linode emailed to say that CGSpace’s (linode19) CPU load was high for a few hours last night
- Looking in the nginx logs around that time I see some new IPs that look like they are harvesting things:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "26/Sep/2018:(19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
@@ -643,7 +643,7 @@ sys 2m18.485s
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26 | sort | uniq
758
-- I will add their IPs to the list of bad bots in nginx so we can add a “bot” user agent to them and let Tomcat's Crawler Session Manager Valve handle them
+- I will add their IPs to the list of bad bots in nginx so we can add a “bot” user agent to them and let Tomcat’s Crawler Session Manager Valve handle them
- I asked Atmire to prepare an invoice for 125 credits
2018-09-29
@@ -670,7 +670,7 @@ $ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p
dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2018-09-30-languages.csv with csv;
-- Then I can simply delete the “Other” and “other” ones because that's not useful at all:
+- Then I can simply delete the “Other” and “other” ones because that’s not useful at all:
dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Other';
DELETE 6
diff --git a/docs/2018-10/index.html b/docs/2018-10/index.html
index 194f7707d..cd29ea479 100644
--- a/docs/2018-10/index.html
+++ b/docs/2018-10/index.html
@@ -9,7 +9,7 @@
@@ -21,9 +21,9 @@ I created a GitHub issue to track this #389, because I'm super busy in Nairo
-
+
@@ -53,7 +53,7 @@ I created a GitHub issue to track this #389, because I'm super busy in Nairo
-
+
@@ -100,7 +100,7 @@ I created a GitHub issue to track this #389, because I'm super busy in Nairo
October, 2018
@@ -108,7 +108,7 @@ I created a GitHub issue to track this #389, because I'm super busy in Nairo
2018-10-01
- Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
-- I created a GitHub issue to track this #389, because I'm super busy in Nairobi right now
+- I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now
2018-10-03
@@ -133,7 +133,7 @@ I created a GitHub issue to track this #389, because I'm super busy in Nairo
118927 200
31435 500
-- I added Phil Thornton and Sonal Henson's ORCID identifiers to the controlled vocabulary for
cg.creator.orcid
and then re-generated the names using my resolve-orcids.py script:
+- I added Phil Thornton and Sonal Henson’s ORCID identifiers to the controlled vocabulary for
cg.creator.orcid
and then re-generated the names using my resolve-orcids.py script:
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > 2018-10-03-orcids.txt
$ ./resolve-orcids.py -i 2018-10-03-orcids.txt -o 2018-10-03-names.txt -d
@@ -160,14 +160,14 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
87646 34.218.226.147
111729 213.139.53.62
-- But in super positive news, he says they are using my new dspace-statistics-api and it's MUCH faster than using Atmire CUA's internal “restlet” API
-- I don't recognize the
138.201.49.199
IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams:
+- But in super positive news, he says they are using my new dspace-statistics-api and it’s MUCH faster than using Atmire CUA’s internal “restlet” API
+- I don’t recognize the
138.201.49.199
IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams:
# grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c
8324 GET /bitstream
4193 GET /handle
-- Suspiciously, it's only grabbing the CGIAR System Office community (handle prefix 10947):
+- Suspiciously, it’s only grabbing the CGIAR System Office community (handle prefix 10947):
# grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c
7 GET /handle/10568
@@ -177,9 +177,9 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
-- It's clearly a bot and it's not re-using its Tomcat session, so I will add its IP to the nginx bad bot list
-- I looked in Solr's statistics core and these hits were actually all counted as
isBot:false
(of course)… hmmm
-- I tagged all of Sonal and Phil's items with their ORCID identifiers on CGSpace using my add-orcid-identifiers.py script:
+- It’s clearly a bot and it’s not re-using its Tomcat session, so I will add its IP to the nginx bad bot list
+- I looked in Solr’s statistics core and these hits were actually all counted as
isBot:false
(of course)… hmmm
+- I tagged all of Sonal and Phil’s items with their ORCID identifiers on CGSpace using my add-orcid-identifiers.py script:
$ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu'
@@ -205,7 +205,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
- I see there are other bundles we might need to pay attention to:
TEXT
, @_LOGO-COLLECTION_@
, @_LOGO-COMMUNITY_@
, etc…
- On a hunch I dropped the statistics table and re-indexed and now those two items above have no downloads
-- So it's fixed, but I'm not sure why!
+- So it’s fixed, but I’m not sure why!
- Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests):
# zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets'
@@ -216,7 +216,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
2018-10-05
-- Meet with Peter, Abenet, and Sisay to discuss CGSpace meeting in Nairobi and Sisay's work plan
+- Meet with Peter, Abenet, and Sisay to discuss CGSpace meeting in Nairobi and Sisay’s work plan
- We agreed that he would do monthly updates of the controlled vocabularies and generate a new one for the top 1,000 AGROVOC terms
- Add a link to AReS explorer to the CGSpace homepage introduction text
@@ -224,30 +224,30 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
- Follow up with AgriKnowledge about including Handle links (
dc.identifier.uri
) on their item pages
- In July, 2018 they had said their programmers would include the field in the next update of their website software
-- CIMMYT's DSpace repository is now running DSpace 5.x!
-- It's running OAI, but not REST, so I need to talk to Richard about that!
+- CIMMYT’s DSpace repository is now running DSpace 5.x!
+- It’s running OAI, but not REST, so I need to talk to Richard about that!
2018-10-08
-- AgriKnowledge says they're going to add the
dc.identifier.uri
to their item view in November when they update their website software
+- AgriKnowledge says they’re going to add the
dc.identifier.uri
to their item view in November when they update their website software
2018-10-10
-- Peter noticed that some recently added PDFs don't have thumbnails
-- When I tried to force them to be generated I got an error that I've never seen before:
+- Peter noticed that some recently added PDFs don’t have thumbnails
+- When I tried to force them to be generated I got an error that I’ve never seen before:
$ dspace filter-media -v -f -i 10568/97613
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: not authorized `/tmp/impdfthumb5039464037201498062.pdf' @ error/constitute.c/ReadImage/412.
-- I see there was an update to Ubuntu's ImageMagick on 2018-10-05, so maybe something changed or broke?
-- I get the same error when forcing
filter-media
to run on DSpace Test too, so it's gotta be an ImageMagic bug
+- I see there was an update to Ubuntu’s ImageMagick on 2018-10-05, so maybe something changed or broke?
+- I get the same error when forcing
filter-media
to run on DSpace Test too, so it’s gotta be an ImageMagic bug
- The ImageMagick version is currently 8:6.8.9.9-7ubuntu5.13, and there is an Ubuntu Security Notice from 2018-10-04
- Wow, someone on Twitter posted about this breaking his web application (and it was retweeted by the ImageMagick acount!)
- I commented out the line that disables PDF thumbnails in
/etc/ImageMagick-6/policy.xml
:
<!--<policy domain="coder" rights="none" pattern="PDF" />-->
-- This works, but I'm not sure what ImageMagick's long-term plan is if they are going to disable ALL image formats…
+- This works, but I’m not sure what ImageMagick’s long-term plan is if they are going to disable ALL image formats…
- I suppose I need to enable a workaround for this in Ansible?
2018-10-11
@@ -292,7 +292,7 @@ COPY 10000
2018-10-13
- Run all system updates on DSpace Test (linode19) and reboot it
-- Look through Peter's list of 746 author corrections in OpenRefine
+- Look through Peter’s list of 746 author corrections in OpenRefine
- I first facet by blank, trim whitespace, and then check for weird characters that might be indicative of encoding issues with this GREL:
or(
@@ -307,13 +307,13 @@ COPY 10000
$ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t CORRECT -m 3
-- I will apply these on CGSpace when I do the other updates tomorrow, as well as double check the high scoring ones to see if they are correct in Sisay's author controlled vocabulary
+- I will apply these on CGSpace when I do the other updates tomorrow, as well as double check the high scoring ones to see if they are correct in Sisay’s author controlled vocabulary
2018-10-14
- Merge the authors controlled vocabulary (#393), usage rights (#394), and the upstream DSpace 5.x cherry-picks (#394) into our
5_x-prod
branch
-- Switch to new CGIAR LDAP server on CGSpace, as it's been running (at least for authentication) on DSpace Test for the last few weeks, and I think they old one will be deprecated soon (today?)
-- Apply Peter's 746 author corrections on CGSpace and DSpace Test using my fix-metadata-values.py script:
+- Switch to new CGIAR LDAP server on CGSpace, as it’s been running (at least for authentication) on DSpace Test for the last few weeks, and I think they old one will be deprecated soon (today?)
+- Apply Peter’s 746 author corrections on CGSpace and DSpace Test using my fix-metadata-values.py script:
$ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p 'fuuu'
@@ -322,21 +322,21 @@ COPY 10000
- Restarting the service with systemd works for a few seconds, then the java process quits
- I suspect that the systemd service type needs to be
forking
rather than simple
, because the service calls the default DSpace start-handle-server
shell script, which uses nohup
and &
to background the java process
- It would be nice if there was a cleaner way to start the service and then just log to the systemd journal rather than all this hiding and log redirecting
-- Email the Landportal.org people to ask if they would consider Dublin Core metadata tags in their page's header, rather than the HTML properties they are using in their body
+- Email the Landportal.org people to ask if they would consider Dublin Core metadata tags in their page’s header, rather than the HTML properties they are using in their body
- Peter pointed out that some thumbnails were still not getting generated
- When I tried to generate them manually I noticed that the path to the CMYK profile had changed because Ubuntu upgraded Ghostscript from 9.18 to 9.25 last week… WTF?
- Looks like I can use
/usr/share/ghostscript/current
instead of /usr/share/ghostscript/9.25
…
-- I limited the tall thumbnails even further to 170px because Peter said CTA's were still too tall at 200px (#396)
+- I limited the tall thumbnails even further to 170px because Peter said CTA’s were still too tall at 200px (#396)
2018-10-15
- Tomcat on DSpace Test (linode19) has somehow stopped running all the DSpace applications
-- I don't see anything in the Catalina logs or
dmesg
, and the Tomcat manager shows XMLUI, REST, OAI, etc all “Running: false”
+- I don’t see anything in the Catalina logs or
dmesg
, and the Tomcat manager shows XMLUI, REST, OAI, etc all “Running: false”
- Actually, now I remember that yesterday when I deployed the latest changes from git on DSpace Test I noticed a syntax error in one XML file when I was doing the discovery reindexing
-- I fixed it so that I could reindex, but I guess the rest of DSpace actually didn't start up…
+- I fixed it so that I could reindex, but I guess the rest of DSpace actually didn’t start up…
- Create an account on DSpace Test for Felix from Earlham so he can test COPO submission
- I created a new collection and added him as the administrator so he can test submission
@@ -360,8 +360,8 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser
dspace=# \copy (SELECT (CASE when metadata_schema_id=1 THEN 'dc' WHEN metadata_schema_id=2 THEN 'cg' END) AS schema, element, qualifier, scope_note FROM metadatafieldregistry where metadata_schema_id IN (1,2)) TO /tmp/cgspace-schema.csv WITH CSV HEADER;
-- Talking to the CodeObia guys about the REST API I started to wonder why it's so slow and how I can quantify it in order to ask the dspace-tech mailing list for help profiling it
-- Interestingly, the speed doesn't get better after you request the same thing multiple times–it's consistently bad on both CGSpace and DSpace Test!
+- Talking to the CodeObia guys about the REST API I started to wonder why it’s so slow and how I can quantify it in order to ask the dspace-tech mailing list for help profiling it
+- Interestingly, the speed doesn’t get better after you request the same thing multiple times–it’s consistently bad on both CGSpace and DSpace Test!
$ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
...
@@ -441,13 +441,13 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
Looking up the names associated with ORCID iD: 0000-0001-7930-5752
Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
-- So I need to handle that situation in the script for sure, but I'm not sure what to do organizationally or ethically, since that user disabled their name! Do we remove him from the list?
+- So I need to handle that situation in the script for sure, but I’m not sure what to do organizationally or ethically, since that user disabled their name! Do we remove him from the list?
- I made a pull request and merged the ORCID updates into the
5_x-prod
branch (#397)
- Improve the logic of name checking in my resolve-orcids.py script
2018-10-18
-- I granted MEL's deposit user admin access to IITA, CIP, Bioversity, and RTB communities on DSpace Test so they can start testing real depositing
+- I granted MEL’s deposit user admin access to IITA, CIP, Bioversity, and RTB communities on DSpace Test so they can start testing real depositing
- After they do some tests and we check the values Enrico will send a formal email to Peter et al to ask that they start depositing officially
- I upgraded PostgreSQL to 9.6 on DSpace Test using Ansible, then had to manually migrate from 9.5 to 9.6:
@@ -474,12 +474,12 @@ $ exit
1629 66.249.64.91
1758 5.9.6.51
-- 5.9.6.51 is MegaIndex, which I've seen before…
+- 5.9.6.51 is MegaIndex, which I’ve seen before…
2018-10-20
-- I was going to try to run Solr in Docker because I learned I can run Docker on Travis-CI (for testing my dspace-statistics-api), but the oldest official Solr images are for 5.5, and DSpace's Solr configuration is for 4.9
-- This means our existing Solr configuration doesn't run in Solr 5.5:
+- I was going to try to run Solr in Docker because I learned I can run Docker on Travis-CI (for testing my dspace-statistics-api), but the oldest official Solr images are for 5.5, and DSpace’s Solr configuration is for 4.9
+- This means our existing Solr configuration doesn’t run in Solr 5.5:
$ sudo docker pull solr:5
$ sudo docker run --name my_solr -v ~/dspace/solr/statistics/conf:/tmp/conf -d -p 8983:8983 -t solr:5
@@ -488,7 +488,7 @@ $ sudo docker logs my_solr
ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics] Caused by: solr.IntField
- Apparently a bunch of variable types were removed in Solr 5
-- So for now it's actually a huge pain in the ass to run the tests for my dspace-statistics-api
+- So for now it’s actually a huge pain in the ass to run the tests for my dspace-statistics-api
- Linode sent a message that the CPU usage was high on CGSpace (linode18) last night
- According to the nginx logs around that time it was 5.9.6.51 (MegaIndex) again:
@@ -517,11 +517,11 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-10-20 | sort | uniq
8915
-- Last month I added “crawl” to the Tomcat Crawler Session Manager Valve's regular expression matching, and it seems to be working for MegaIndex's user agent:
+- Last month I added “crawl” to the Tomcat Crawler Session Manager Valve’s regular expression matching, and it seems to be working for MegaIndex’s user agent:
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1' User-Agent:'"Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"'
-- So I'm not sure why this bot uses so many sessions — is it because it requests very slowly?
+- So I’m not sure why this bot uses so many sessions — is it because it requests very slowly?
2018-10-21
@@ -552,7 +552,7 @@ UPDATE 76608
- Improve the usage rights (dc.rights) on CGSpace again by adding the long names in the submission form, as well as adding versio 3.0 and Creative Commons Zero (CC0) public domain license (#399)
- Add “usage rights” to the XMLUI item display (#400)
- I emailed the MARLO guys to ask if they can send us a dump of rights data and Handles from their system so we can tag our older items on CGSpace
-- Testing REST login and logout via httpie because Felix from Earlham says he's having issues:
+- Testing REST login and logout via httpie because Felix from Earlham says he’s having issues:
$ http --print b POST 'https://dspacetest.cgiar.org/rest/login' email='testdeposit@cgiar.org' password=deposit
acef8a4a-41f3-4392-b870-e873790f696b
@@ -576,8 +576,8 @@ $ curl -X GET -H "Content-Type: application/json" -H "Accept: app
- I deployed the new Creative Commons choices to the usage rights on the CGSpace submission form
- Also, I deployed the changes to show usage rights on the item view
-- Re-work the dspace-statistics-api to use Python's native json instead of ujson to make it easier to deploy in places where we don't have — or don't want to have — Python headers and a compiler (like containers)
-- Re-work the deployment of the API to use systemd's
EnvironmentFile
to read the environment variables instead of Environment
in the RMG Ansible infrastructure scripts
+- Re-work the dspace-statistics-api to use Python’s native json instead of ujson to make it easier to deploy in places where we don’t have — or don’t want to have — Python headers and a compiler (like containers)
+- Re-work the deployment of the API to use systemd’s
EnvironmentFile
to read the environment variables instead of Environment
in the RMG Ansible infrastructure scripts
2018-10-25
@@ -602,7 +602,7 @@ $ curl -X GET -H "Content-Type: application/json" -H "Accept: app
- Then I re-generated the
requirements.txt
in the dspace-statistics-library and released version 0.5.2
- Then I re-deployed the API on DSpace Test, ran all system updates on the server, and rebooted it
- I tested my hack of depositing to one collection where the default item and bistream READ policies are restricted and then mapping the item to another collection, but the item retains its default policies so Anonymous cannot see them in the mapped collection either
-- Perhaps we need to try moving the item and inheriting the target collection's policies?
+- Perhaps we need to try moving the item and inheriting the target collection’s policies?
- I merged the changes for adding publisher (
dc.publisher
) to the advanced search to the 5_x-prod
branch (#402)
- I merged the changes for adding versionless Creative Commons licenses to the submission form to the
5_x-prod
branch (#403)
- I will deploy them later this week
@@ -617,7 +617,7 @@ $ curl -X GET -H "Content-Type: application/json" -H "Accept: app
- Meet with the COPO guys to walk them through the CGSpace submission workflow and discuss CG core, REST API, etc
- I suggested that they look into submitting via the SWORDv2 protocol because it respects the workflows
-- They said that they're not too worried about the hierarchical CG core schema, that they would just flatten metadata like affiliations when depositing to a DSpace repository
+- They said that they’re not too worried about the hierarchical CG core schema, that they would just flatten metadata like affiliations when depositing to a DSpace repository
- I said that it might be time to engage the DSpace community to add support for more advanced schemas in DSpace 7+ (perhaps partnership with Atmire?)
diff --git a/docs/2018-11/index.html b/docs/2018-11/index.html
index db2199bd2..7918a46e4 100644
--- a/docs/2018-11/index.html
+++ b/docs/2018-11/index.html
@@ -33,7 +33,7 @@ Send a note about my dspace-statistics-api to the dspace-tech mailing list
Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage
Today these are the top 10 IPs:
"/>
-
+
@@ -63,7 +63,7 @@ Today these are the top 10 IPs:
-
+
@@ -110,7 +110,7 @@ Today these are the top 10 IPs:
November, 2018
@@ -138,7 +138,7 @@ Today these are the top 10 IPs:
22508 66.249.64.59
- The
66.249.64.x
are definitely Google
-70.32.83.92
is well known, probably CCAFS or something, as it's only a few thousand requests and always to REST API
+70.32.83.92
is well known, probably CCAFS or something, as it’s only a few thousand requests and always to REST API
84.38.130.177
is some new IP in Latvia that is only hitting the XMLUI, using the following user agent:
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.792.0 Safari/535.1
@@ -154,13 +154,13 @@ Today these are the top 10 IPs:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
-- And it doesn't seem they are re-using their Tomcat sessions:
+- And it doesn’t seem they are re-using their Tomcat sessions:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03
1243
-- Ah, we've apparently seen this server exactly a year ago in 2017-11, making 40,000 requests in one day…
-- I wonder if it's worth adding them to the list of bots in the nginx config?
+- Ah, we’ve apparently seen this server exactly a year ago in 2017-11, making 40,000 requests in one day…
+- I wonder if it’s worth adding them to the list of bots in the nginx config?
- Linode sent a mail that CGSpace (linode18) is using high outgoing bandwidth
- Looking at the nginx logs again I see the following top ten IPs:
@@ -176,11 +176,11 @@ Today these are the top 10 IPs:
12557 78.46.89.18
32152 66.249.64.59
-78.46.89.18
is new since I last checked a few hours ago, and it's from Hetzner with the following user agent:
+78.46.89.18
is new since I last checked a few hours ago, and it’s from Hetzner with the following user agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
-- It's making lots of requests, though actually it does seem to be re-using its Tomcat sessions:
+- It’s making lots of requests, though actually it does seem to be re-using its Tomcat sessions:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
8449
@@ -190,7 +190,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
Updated on 2018-12-04 to correct the grep command above, as it was inaccurate and it seems the bot was actually already re-using its Tomcat sessions
I could add this IP to the list of bot IPs in nginx, but it seems like a futile effort when some new IP could come along and do the same thing
Perhaps I should think about adding rate limits to dynamic pages like /discover
and /browse
-I think it's reasonable for a human to click one of those links five or ten times a minute…
+I think it’s reasonable for a human to click one of those links five or ten times a minute…
To contrast, 78.46.89.18
made about 300 requests per minute for a few hours today:
# grep 78.46.89.18 /var/log/nginx/access.log | grep -o -E '03/Nov/2018:[0-9][0-9]:[0-9][0-9]' | sort | uniq -c | sort -n | tail -n 20
@@ -221,7 +221,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
2018-11-04
-- Forward Peter's information about CGSpace financials to Modi from ICRISAT
+- Forward Peter’s information about CGSpace financials to Modi from ICRISAT
- Linode emailed about the CPU load and outgoing bandwidth on CGSpace (linode18) again
- Here are the top ten IPs active so far this morning:
@@ -355,7 +355,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
2018-11-08
- I deployed verison 0.7.0 of the dspace-statistics-api on DSpace Test (linode19) so I can test it for a few days (and check the Munin stats to see the change in database connections) before deploying on CGSpace
-- I also enabled systemd's persistent journal by setting
Storage=persistent
in journald.conf
+- I also enabled systemd’s persistent journal by setting
Storage=persistent
in journald.conf
- Apparently Ubuntu 16.04 defaulted to using rsyslog for boot records until early 2018, so I removed
rsyslog
too
- Proof 277 IITA records on DSpace Test: IITA_ ALIZZY1802-csv_oct23
@@ -371,7 +371,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
2018-11-13
- Help troubleshoot an issue with Judy Kimani submitting to the ILRI project reports, papers and documents collection on CGSpace
-- For some reason there is an existing group for the “Accept/Reject” workflow step, but it's empty
+- For some reason there is an existing group for the “Accept/Reject” workflow step, but it’s empty
- I added Judy to the group and told her to try again
- Sisay changed his leave to be full days until December so I need to finish the IITA records that he was working on (IITA_ ALIZZY1802-csv_oct23)
- Sisay had said there were a few PDFs missing and Bosede sent them this week, so I had to find those items on DSpace Test and add the bitstreams to the items manually
@@ -381,7 +381,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
2018-11-14
- Finally import the 277 IITA (ALIZZY1802) records to CGSpace
-- I had to export them from DSpace Test and import them into a temporary collection on CGSpace first, then export the collection as CSV to map them to new owning collections (IITA books, IITA posters, etc) with OpenRefine because DSpace's
dspace export
command doesn't include the collections for the items!
+- I had to export them from DSpace Test and import them into a temporary collection on CGSpace first, then export the collection as CSV to map them to new owning collections (IITA books, IITA posters, etc) with OpenRefine because DSpace’s
dspace export
command doesn’t include the collections for the items!
- Delete all old IITA collections on DSpace Test and run
dspace cleanup
to get rid of all the bitstreams
2018-11-15
@@ -428,12 +428,12 @@ java.lang.IllegalStateException: DSpace kernel cannot be null
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
2018-11-19 15:23:04,223 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (4629 of 76007): 72731
-- I looked in the Solr log around that time and I don't see anything…
-- Working on Udana's WLE records from last month, first the sixteen records in 2018-11-20 RDL Temp
+
- I looked in the Solr log around that time and I don’t see anything…
+- Working on Udana’s WLE records from last month, first the sixteen records in 2018-11-20 RDL Temp
- these items will go to the Restoring Degraded Landscapes collection
- a few items missing DOIs, but they are easily available on the publication page
-- clean up DOIs to use “https://doi.org" format
+- clean up DOIs to use “https://doi.org” format
- clean up some cg.identifier.url to remove unneccessary query strings
- remove columns with no metadata (river basin, place, target audience, isbn, uri, publisher, ispartofseries, subject)
- fix column with invalid spaces in metadata field name (cg. subject. wle)
@@ -447,12 +447,12 @@ java.lang.IllegalStateException: DSpace kernel cannot be null
- these items will go to the Variability, Risks and Competing Uses collection
- trim and collapse whitespace in all fields (lots in WLE subject!)
- clean up some cg.identifier.url fields that had unneccessary anchors in their links
-- clean up DOIs to use “https://doi.org" format
+- clean up DOIs to use “https://doi.org” format
- fix column with invalid spaces in metadata field name (cg. subject. wle)
- remove columns with no metadata (place, target audience, isbn, uri, publisher, ispartofseries, subject)
- remove some weird Unicode characters (0xfffd) from abstracts, citations, and titles using Open Refine:
value.replace('�','')
-- I notice a few items using DOIs pointing at ICARDA's DSpace like: https://doi.org/20.500.11766/8178, which then points at the “real” DOI on the publisher's site… these should be using the real DOI instead of ICARDA's “fake” Handle DOI
-- Some items missing DOIs, but they clearly have them if you look at the publisher's site
+- I notice a few items using DOIs pointing at ICARDA’s DSpace like: https://doi.org/20.500.11766/8178, which then points at the “real” DOI on the publisher’s site… these should be using the real DOI instead of ICARDA’s “fake” Handle DOI
+- Some items missing DOIs, but they clearly have them if you look at the publisher’s site
@@ -463,7 +463,7 @@ java.lang.IllegalStateException: DSpace kernel cannot be null
Judy Kimani was having issues resuming submissions in another ILRI collection recently, and the issue there was due to an empty group defined for the “accept/reject” step (aka workflow step 1)
The error then was “authorization denied for workflow step 1” where “workflow step 1” was the “accept/reject” step, which had a group defined, but was empty
Adding her to this group solved her issues
-Tezira says she's also getting the same “authorization denied” error for workflow step 1 when resuming submissions, so I told Abenet to delete the empty group
+Tezira says she’s also getting the same “authorization denied” error for workflow step 1 when resuming submissions, so I told Abenet to delete the empty group
@@ -475,7 +475,7 @@ java.lang.IllegalStateException: DSpace kernel cannot be null
$ dspace index-discovery -r 10568/41888
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
-- … but the item still doesn't appear in the collection
+- … but the item still doesn’t appear in the collection
- Now I will try a full Discovery re-index:
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
@@ -503,7 +503,7 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
4564 70.32.83.92
- We know 70.32.83.92 is CCAFS harvester on MediaTemple, but 205.186.128.185 is new appears to be a new CCAFS harvester
-- I think we might want to prune some old accounts from CGSpace, perhaps users who haven't logged in in the last two years would be a conservative bunch:
+- I think we might want to prune some old accounts from CGSpace, perhaps users who haven’t logged in in the last two years would be a conservative bunch:
$ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 | wc -l
409
@@ -514,15 +514,15 @@ $ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 -d
- The workflow step 1 (accept/reject) is now undefined for some reason
- Last week the group was defined, but empty, so we added her to the group and she was able to take the tasks
-- Since then it looks like the group was deleted, so now she didn't have permission to take or leave the tasks in her pool
-- We added her back to the group, then she was able to take the tasks, and then we removed the group again, as we generally don't use this step in CGSpace
+- Since then it looks like the group was deleted, so now she didn’t have permission to take or leave the tasks in her pool
+- We added her back to the group, then she was able to take the tasks, and then we removed the group again, as we generally don’t use this step in CGSpace
Help Marianne troubleshoot some issue with items in their WLE collections and the WLE publicatons website
2018-11-28
-- Change the usage rights text a bit based on Maria Garruccio's feedback on “all rights reserved” (#404)
+- Change the usage rights text a bit based on Maria Garruccio’s feedback on “all rights reserved” (#404)
- Run all system updates on DSpace Test (linode19) and reboot the server
diff --git a/docs/2018-12/index.html b/docs/2018-12/index.html
index 032c9d61b..77d4e8f38 100644
--- a/docs/2018-12/index.html
+++ b/docs/2018-12/index.html
@@ -33,7 +33,7 @@ Then I ran all system updates and restarted the server
I noticed that there is another issue with PDF thumbnails on CGSpace, and I see there was another Ghostscript vulnerability last week
"/>
-
+
@@ -63,7 +63,7 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see
-
+
@@ -110,7 +110,7 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see
December, 2018
@@ -148,7 +148,7 @@ org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.c
- A comment on StackOverflow question from yesterday suggests it might be a bug with the
pngalpha
device in Ghostscript and links to an upstream bug
- I think we need to wait for a fix from Ubuntu
-- For what it's worth, I get the same error on my local Arch Linux environment with Ghostscript 9.26:
+- For what it’s worth, I get the same error on my local Arch Linux environment with Ghostscript 9.26:
$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
DEBUG: FC_WEIGHT didn't match
@@ -167,7 +167,7 @@ DEBUG: FC_WEIGHT didn't match
One item had “MADAGASCAR” for ISI Journal
Minor corrections in IITA subject (LIVELIHOOD→LIVELIHOODS)
Trim whitespace in abstract field
-Fix some sponsors (though some with “Governments of Canada” etc I'm not sure why those are plural)
+Fix some sponsors (though some with “Governments of Canada” etc I’m not sure why those are plural)
Eighteen items had en||fr
for the language, but the content was only in French so changed them to just fr
Six items had encoding errors in French text so I will ask Bosede to re-do them carefully
Correct and normalize a few AGROVOC subjects
@@ -198,18 +198,18 @@ DEBUG: FC_WEIGHT didn't match
Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf[0]=>Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf PDF 595x841 595x841+0+0 16-bit sRGB 107443B 0.000u 0:00.000
identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
-- And wow, I can't even run ImageMagick's
identify
on the first page of the second item (10568/98930):
+- And wow, I can’t even run ImageMagick’s
identify
on the first page of the second item (10568/98930):
$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
zsh: abort (core dumped) identify Food\ safety\ Kenya\ fruits.pdf\[0\]
-- But with GraphicsMagick's
identify
it works:
+- But with GraphicsMagick’s
identify
it works:
$ gm identify Food\ safety\ Kenya\ fruits.pdf\[0\]
DEBUG: FC_WEIGHT didn't match
Food safety Kenya fruits.pdf PDF 612x792+0+0 DirectClass 8-bit 1.4Mi 0.000u 0m:0.000002s
-- Interesting that ImageMagick's
identify
does work if you do not specify a page, perhaps as alluded to in the recent Ghostscript bug report:
+- Interesting that ImageMagick’s
identify
does work if you do not specify a page, perhaps as alluded to in the recent Ghostscript bug report:
$ identify Food\ safety\ Kenya\ fruits.pdf
Food safety Kenya fruits.pdf[0] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
@@ -226,7 +226,7 @@ zsh: abort (core dumped) convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnai
$ gm convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
DEBUG: FC_WEIGHT didn't match
-- I inspected the troublesome PDF using jhove and noticed that it is using
ISO PDF/A-1, Level B
and the other one doesn't list a profile, though I don't think this is relevant
+- I inspected the troublesome PDF using jhove and noticed that it is using
ISO PDF/A-1, Level B
and the other one doesn’t list a profile, though I don’t think this is relevant
- I found another item that fails when generating a thumbnail (10568/98391, DSpace complains:
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
@@ -256,16 +256,16 @@ Caused by: org.im4java.core.CommandException: identify: FailedToExecuteCommand `
at org.im4java.core.ImageCommand.run(ImageCommand.java:215)
... 15 more
-- And on my Arch Linux environment ImageMagick's
convert
also segfaults:
+- And on my Arch Linux environment ImageMagick’s
convert
also segfaults:
$ convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
zsh: abort (core dumped) convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] x60
-- But GraphicsMagick's
convert
works:
+- But GraphicsMagick’s
convert
works:
$ gm convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
-- So far the only thing that stands out is that the two files that don't work were created with Microsoft Office 2016:
+- So far the only thing that stands out is that the two files that don’t work were created with Microsoft Office 2016:
$ pdfinfo bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf | grep -E '^(Creator|Producer)'
Creator: Microsoft® Word 2016
@@ -285,14 +285,14 @@ Producer: Microsoft® Word for Office 365
$ inkscape Food\ safety\ Kenya\ fruits.pdf -z --export-dpi=72 --export-area-drawing --export-png='cover.png'
$ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
-- I've tried a few times this week to register for the Ethiopian eVisa website, but it is never successful
+- I’ve tried a few times this week to register for the Ethiopian eVisa website, but it is never successful
- In the end I tried one last time to just apply without registering and it was apparently successful
- Testing DSpace 5.8 (
5_x-prod
branch) in an Ubuntu 18.04 VM with Tomcat 8.5 and had some issues:
- JSPUI shows an internal error (log shows something about tag cloud, though, so might be unrelated)
-- Atmire Listings and Reports, which use JSPUI, asks you to log in again and then doesn't work
-- Content and Usage Analysis doesn't show up in the sidebar after logging in
-- I can navigate to /atmire/reporting-suite/usage-graph-editor, but it's only the Atmire theme and a “page not found” message
+- Atmire Listings and Reports, which use JSPUI, asks you to log in again and then doesn’t work
+- Content and Usage Analysis doesn’t show up in the sidebar after logging in
+- I can navigate to /atmire/reporting-suite/usage-graph-editor, but it’s only the Atmire theme and a “page not found” message
- Related messages from dspace.log:
@@ -311,7 +311,7 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
2018-12-04
-- Last night Linode sent a message that the load on CGSpace (linode18) was too high, here's a list of the top users at the time and throughout the day:
+- Last night Linode sent a message that the load on CGSpace (linode18) was too high, here’s a list of the top users at the time and throughout the day:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018:1(5|6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
225 40.77.167.142
@@ -336,14 +336,14 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
3210 2a01:4f8:140:3192::2
4190 35.237.175.180
-35.237.175.180
is known to us (CCAFS?), and I've already added it to the list of bot IPs in nginx, which appears to be working:
+35.237.175.180
is known to us (CCAFS?), and I’ve already added it to the list of bot IPs in nginx, which appears to be working:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03
4772
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 | sort | uniq | wc -l
630
-- I haven't seen
2a01:4f8:140:3192::2
before. Its user agent is some new bot:
+- I haven’t seen
2a01:4f8:140:3192::2
before. Its user agent is some new bot:
Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
@@ -366,7 +366,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 | sort | uniq | wc -l
1
-- In other news, it's good to see my re-work of the database connectivity in the dspace-statistics-api actually caused a reduction of persistent database connections (from 1 to 0, but still!):
+- In other news, it’s good to see my re-work of the database connectivity in the dspace-statistics-api actually caused a reduction of persistent database connections (from 1 to 0, but still!):

2018-12-05
@@ -376,7 +376,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
2018-12-06
- Linode sent a message that the CPU usage of CGSpace (linode18) is too high last night
-- I looked in the logs and there's nothing particular going on:
+- I looked in the logs and there’s nothing particular going on:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
1225 157.55.39.177
@@ -402,8 +402,8 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
1156
2a01:7e00::f03c:91ff:fe0a:d645
appears to be the CKM dev server where Danny is testing harvesting via Drupal
-- It seems they are hitting the XMLUI's OpenSearch a bit, but mostly on the REST API so no issues here yet
-Drupal
is already in the Tomcat Crawler Session Manager Valve's regex so that's good!
+- It seems they are hitting the XMLUI’s OpenSearch a bit, but mostly on the REST API so no issues here yet
+Drupal
is already in the Tomcat Crawler Session Manager Valve’s regex so that’s good!
2018-12-10
@@ -414,7 +414,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
- It sounds kinda crazy, but she said when she talked to Altmetric about their Twitter harvesting they said their coverage is not perfect, so it might be some kinda prioritization thing where they only do it for popular items?
- I am testing this by tweeting one WLE item from CGSpace that currently has no Altmetric score
- Interestingly, after about an hour I see it has already been picked up by Altmetric and has my tweet as well as some other tweet from over a month ago…
-- I tweeted a link to the item's DOI to see if Altmetric will notice it, hopefully associated with the Handle I tweeted earlier
+- I tweeted a link to the item’s DOI to see if Altmetric will notice it, hopefully associated with the Handle I tweeted earlier
@@ -429,9 +429,9 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
2018-12-13
-- Oh this is very interesting: WorldFish's repository is live now
-- It's running DSpace 5.9-SNAPSHOT running on KnowledgeArc and the OAI and REST interfaces are active at least
-- Also, I notice they ended up registering a Handle (they had been considering taking KnowledgeArc's advice to not use Handles!)
+- Oh this is very interesting: WorldFish’s repository is live now
+- It’s running DSpace 5.9-SNAPSHOT running on KnowledgeArc and the OAI and REST interfaces are active at least
+- Also, I notice they ended up registering a Handle (they had been considering taking KnowledgeArc’s advice to not use Handles!)
- Did some coordination work on the hotel bookings for the January AReS workshop in Amman
2018-12-17
@@ -479,7 +479,7 @@ $ ls -lh cgspace_2018-12-19.backup*
-rw-r--r-- 1 aorth aorth 94M Dec 20 11:36 cgspace_2018-12-19.backup.gz
-rw-r--r-- 1 aorth aorth 93M Dec 20 11:35 cgspace_2018-12-19.backup.xz
-- Looks like it's really not worth it…
+- Looks like it’s really not worth it…
- Peter pointed out that Discovery filters for CTA subjects on item pages were not working
- It looks like there were some mismatches in the Discovery index names and the XMLUI configuration, so I fixed them (#406)
- Peter asked if we could create a controlled vocabulary for publishers (
dc.publisher
)
@@ -491,7 +491,7 @@ $ ls -lh cgspace_2018-12-19.backup*
3522
(1 row)
-- I reverted the metadata changes related to “Unrestricted Access” and “Restricted Access” on DSpace Test because we're not pushing forward with the new status terms for now
+- I reverted the metadata changes related to “Unrestricted Access” and “Restricted Access” on DSpace Test because we’re not pushing forward with the new status terms for now
- Purge remaining Oracle Java 8 stuff from CGSpace (linode18) since we migrated to OpenJDK a few months ago:
# dpkg -P oracle-java8-installer oracle-java8-set-default
@@ -514,7 +514,7 @@ Fixed 466 occurences of: Copyrighted; Any re-use allowed
# pg_dropcluster 9.5 main
# dpkg -l | grep postgresql | grep 9.5 | awk '{print $2}' | xargs dpkg -r
-- I've been running PostgreSQL 9.6 for months on my local development and public DSpace Test (linode19) environments
+- I’ve been running PostgreSQL 9.6 for months on my local development and public DSpace Test (linode19) environments
- Run all system updates on CGSpace (linode18) and restart the server
- Try to run the DSpace cleanup script on CGSpace (linode18), but I get some errors about foreign key constraints:
@@ -564,7 +564,7 @@ UPDATE 1
1253 54.70.40.11
- All these look ok (
54.70.40.11
is known to us from earlier this month and should be reusing its Tomcat sessions)
-- So I'm not sure what was going on last night…
+- So I’m not sure what was going on last night…
diff --git a/docs/2019-01/index.html b/docs/2019-01/index.html
index 693576725..3f135b7b5 100644
--- a/docs/2019-01/index.html
+++ b/docs/2019-01/index.html
@@ -9,7 +9,7 @@
-
+
@@ -77,7 +77,7 @@ I don't see anything interesting in the web server logs around that time tho
-
+
@@ -124,7 +124,7 @@ I don't see anything interesting in the web server logs around that time tho
January, 2019
@@ -132,7 +132,7 @@ I don't see anything interesting in the web server logs around that time tho
2019-01-02
- Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
-- I don't see anything interesting in the web server logs around that time though:
+- I don’t see anything interesting in the web server logs around that time though:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
@@ -158,7 +158,7 @@ I don't see anything interesting in the web server logs around that time tho
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 46.101.86.248 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
261 handle
-- It's not clear to me what was causing the outbound traffic spike
+- It’s not clear to me what was causing the outbound traffic spike
- Oh nice! The once-per-year cron job for rotating the Solr statistics actually worked now (for the first time ever!):
Moving: 81742 into core statistics-2010
@@ -182,7 +182,7 @@ Moving: 18497180 into core statistics-2018
$ sudo docker rm dspacedb
$ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
-- Testing DSpace 5.9 with Tomcat 8.5.37 on my local machine and I see that Atmire's Listings and Reports still doesn't work
+
- Testing DSpace 5.9 with Tomcat 8.5.37 on my local machine and I see that Atmire’s Listings and Reports still doesn’t work
- After logging in via XMLUI and clicking the Listings and Reports link from the sidebar it redirects me to a JSPUI login page
- If I log in again there the Listings and Reports work… hmm.
@@ -264,17 +264,17 @@ org.apache.jasper.JasperException: /home.jsp (line: [214], column: [1]) /discove
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
-- I notice that I get different JSESSIONID cookies for
/
(XMLUI) and /jspui
(JSPUI) on Tomcat 8.5.37, I wonder if it's the same on Tomcat 7.0.92… yes I do.
+- I notice that I get different JSESSIONID cookies for
/
(XMLUI) and /jspui
(JSPUI) on Tomcat 8.5.37, I wonder if it’s the same on Tomcat 7.0.92… yes I do.
- Hmm, on Tomcat 7.0.92 I see that I get a
dspace.current.user.id
session cookie after logging into XMLUI, and then when I browse to JSPUI I am still logged in…
-- I didn't see that cookie being set on Tomcat 8.5.37
+- I didn’t see that cookie being set on Tomcat 8.5.37
- I sent a message to the dspace-tech mailing list to ask
2019-01-04
-- Linode sent a message last night that CGSpace (linode18) had high CPU usage, but I don't see anything around that time in the web server logs:
+- Linode sent a message last night that CGSpace (linode18) had high CPU usage, but I don’t see anything around that time in the web server logs:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Jan/2019:1(7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
189 207.46.13.192
@@ -288,7 +288,7 @@ org.apache.jasper.JasperException: /home.jsp (line: [214], column: [1]) /discove
1776 66.249.70.27
2099 54.70.40.11
-- I'm thinking about trying to validate our
dc.subject
terms against AGROVOC webservices
+- I’m thinking about trying to validate our
dc.subject
terms against AGROVOC webservices
- There seem to be a few APIs and the documentation is kinda confusing, but I found this REST endpoint that does work well, for example searching for
SOIL
:
$ http http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=SOIL&lang=en
@@ -336,7 +336,7 @@ X-Frame-Options: ALLOW-FROM http://aims.fao.org
}
- The API does not appear to be case sensitive (searches for
SOIL
and soil
return the same thing)
-- I'm a bit confused that there's no obvious return code or status when a term is not found, for example
SOILS
:
+- I’m a bit confused that there’s no obvious return code or status when a term is not found, for example
SOILS
:
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
@@ -428,8 +428,8 @@ In [14]: for row in result.fetchone():
- Tim Donohue responded to my thread about the cookies on the dspace-tech mailing list
-- He suspects it's a change of behavior in Tomcat 8.5, and indeed I see a mention of new cookie processing in the Tomcat 8.5 migration guide
-- I tried to switch my XMLUI and JSPUI contexts to use the
LegacyCookieProcessor
, but it didn't seem to help
+- He suspects it’s a change of behavior in Tomcat 8.5, and indeed I see a mention of new cookie processing in the Tomcat 8.5 migration guide
+- I tried to switch my XMLUI and JSPUI contexts to use the
LegacyCookieProcessor
, but it didn’t seem to help
- I filed DS-4140 on the DSpace issue tracker
@@ -438,8 +438,8 @@ In [14]: for row in result.fetchone():
- Tezira wrote to say she has stopped receiving the
DSpace Submission Approved and Archived
emails from CGSpace as of January 2nd
-- I told her that I haven't done anything to disable it lately, but that I would check
-- Bizu also says she hasn't received them lately
+- I told her that I haven’t done anything to disable it lately, but that I would check
+- Bizu also says she hasn’t received them lately
@@ -452,12 +452,12 @@ In [14]: for row in result.fetchone():
- Day two of CGSpace AReS meeting in Amman
- Discuss possibly extending the dspace-statistics-api to make community and collection statistics available
-- Discuss new “final” CG Core document and some changes that we'll need to do on CGSpace and other repositories
+- Discuss new “final” CG Core document and some changes that we’ll need to do on CGSpace and other repositories
- We agreed to try to stick to pure Dublin Core where possible, then use fields that exist in standard DSpace, and use “cg” namespace for everything else
- Major changes are to move
dc.contributor.author
to dc.creator
(which MELSpace and WorldFish are already using in their DSpace repositories)
-- I am testing the speed of the WorldFish DSpace repository's REST API and it's five to ten times faster than CGSpace as I tested in 2018-10:
+- I am testing the speed of the WorldFish DSpace repository’s REST API and it’s five to ten times faster than CGSpace as I tested in 2018-10:
$ time http --print h 'https://digitalarchive.worldfishcenter.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
@@ -582,8 +582,8 @@ In [14]: for row in result.fetchone():
- Something happened to the Solr usage statistics on CGSpace
-- I looked on the server and the Solr cores are there (56GB!), and I don't see any obvious errors in dmesg or anything
-- I see that the server hasn't been rebooted in 26 days so I rebooted it
+- I looked on the server and the Solr cores are there (56GB!), and I don’t see any obvious errors in dmesg or anything
+- I see that the server hasn’t been rebooted in 26 days so I rebooted it
- After reboot the Solr stats are still messed up in the Atmire Usage Stats module, it only shows 2019-01!
@@ -712,7 +712,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
- Abenet was asking if the Atmire Usage Stats are correct because they are over 2 million the last few months…
- For 2019-01 alone the Usage Stats are already around 1.2 million
-- I tried to look in the nginx logs to see how many raw requests there are so far this month and it's about 1.4 million:
+- I tried to look in the nginx logs to see how many raw requests there are so far this month and it’s about 1.4 million:
# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
1442874
@@ -724,8 +724,8 @@ sys 0m2.396s
- Send reminder to Atmire about purchasing the MQM module
- Trying to decide the solid action points for CGSpace on the CG Core 2.0 metadata…
-- It's difficult to decide some of these because the current CG Core 2.0 document does not provide guidance or rationale (yet)!
-- Also, there is not a good Dublin Core reference (or maybe I just don't understand?)
+- It’s difficult to decide some of these because the current CG Core 2.0 document does not provide guidance or rationale (yet)!
+- Also, there is not a good Dublin Core reference (or maybe I just don’t understand?)
- Several authoritative documents on Dublin Core appear to be:
- Dublin Core Metadata Element Set, Version 1.1: Reference Description
@@ -762,7 +762,7 @@ sys 0m2.396s
2019-01-19
-
-
There's no official set of Dublin Core qualifiers so I can't tell if things like dc.contributor.author
that are used by DSpace are official
+There’s no official set of Dublin Core qualifiers so I can’t tell if things like dc.contributor.author
that are used by DSpace are official
-
I found a great presentation from 2015 by the Digital Repository of Ireland that discusses using MARC Relator Terms with Dublin Core elements
@@ -777,12 +777,12 @@ sys 0m2.396s
2019-01-20
-- That's weird, I logged into DSpace Test (linode19) and it says it has been up for 213 days:
+- That’s weird, I logged into DSpace Test (linode19) and it says it has been up for 213 days:
# w
04:46:14 up 213 days, 7:25, 4 users, load average: 1.94, 1.50, 1.35
-- I've definitely rebooted it several times in the past few months… according to
journalctl -b
it was a few weeks ago on 2019-01-02
+- I’ve definitely rebooted it several times in the past few months… according to
journalctl -b
it was a few weeks ago on 2019-01-02
- I re-ran the Ansible DSpace tag, ran all system updates, and rebooted the host
- After rebooting I notice that the Linode kernel went down from 4.19.8 to 4.18.16…
- Atmire sent a quote on our ticket about purchasing the Metadata Quality Module (MQM) for DSpace 5.8
@@ -793,7 +793,7 @@ sys 0m2.396s
2019-01-21
-- Investigating running Tomcat 7 on Ubuntu 18.04 with the tarball and a custom systemd package instead of waiting for our DSpace to get compatible with Ubuntu 18.04's Tomcat 8.5
+- Investigating running Tomcat 7 on Ubuntu 18.04 with the tarball and a custom systemd package instead of waiting for our DSpace to get compatible with Ubuntu 18.04’s Tomcat 8.5
- I could either run with a simple
tomcat7.service
like this:
[Unit]
@@ -808,7 +808,7 @@ Group=aorth
[Install]
WantedBy=multi-user.target
-- Or try to use adapt a real systemd service like Arch Linux's:
+- Or try to use adapt a real systemd service like Arch Linux’s:
[Unit]
Description=Tomcat 7 servlet container
@@ -847,7 +847,7 @@ ExecStop=/usr/bin/jsvc \
WantedBy=multi-user.target
- I see that
jsvc
and libcommons-daemon-java
are both available on Ubuntu so that should be easy to port
-- We probably don't need Eclipse Java Bytecode Compiler (ecj)
+- We probably don’t need Eclipse Java Bytecode Compiler (ecj)
- I tested Tomcat 7.0.92 on Arch Linux using the
tomcat7.service
with jsvc
and it works… nice!
- I think I might manage this the same way I do the restic releases in the Ansible infrastructure scripts, where I download a specific version and symlink to some generic location without the version number
- I verified that there is indeed an issue with sharded Solr statistics cores on DSpace, which will cause inaccurate results in the dspace-statistics-api:
@@ -858,7 +858,7 @@ $ http 'http://localhost:3000/solr/statistics-2018/select?indent=on&rows=0&a
<result name="response" numFound="241" start="0">
- I opened an issue on the GitHub issue tracker (#10)
-- I don't think the SolrClient library we are currently using supports these type of queries so we might have to just do raw queries with requests
+- I don’t think the SolrClient library we are currently using supports these type of queries so we might have to just do raw queries with requests
- The pysolr library says it supports multicore indexes, but I am not sure it does (or at least not with our setup):
import pysolr
@@ -899,8 +899,8 @@ $ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/
I implemented a proof of concept to query the Solr STATUS for active cores and to add them with a shards
query string
A few things I noticed:
-- Solr doesn't mind if you use an empty
shards
parameter
-- Solr doesn't mind if you have an extra comma at the end of the
shards
parameter
+- Solr doesn’t mind if you use an empty
shards
parameter
+- Solr doesn’t mind if you have an extra comma at the end of the
shards
parameter
- If you are searching multiple cores, you need to include the base core in the
shards
parameter as well
- For example, compare the following two queries, first including the base core and the shard in the
shards
parameter, and then only including the shard:
@@ -930,7 +930,7 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=
915 35.237.175.180
- 35.237.175.180 is known to us
-- I don't think we've seen 196.191.127.37 before. Its user agent is:
+- I don’t think we’ve seen 196.191.127.37 before. Its user agent is:
Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/7.0.185.1002 Safari/537.36
@@ -957,7 +957,7 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=
Very interesting discussion of methods for running Tomcat under systemd
-
-
We can set the ulimit options that used to be in /etc/default/tomcat7
with systemd's LimitNOFILE
and LimitAS
(see the systemd.exec
man page)
+We can set the ulimit options that used to be in /etc/default/tomcat7
with systemd’s LimitNOFILE
and LimitAS
(see the systemd.exec
man page)
- Note that we need to use
infinity
instead of unlimited
for the address space
@@ -991,7 +991,7 @@ COPY 1109
9265 45.5.186.2
-
-
I think it's the usual IPs:
+I think it’s the usual IPs:
- 45.5.186.2 is CIAT
- 70.32.83.92 is CCAFS
@@ -1009,7 +1009,7 @@ COPY 1109
-
-
Just to make sure these were not uploaded by the user or something, I manually forced the regeneration of these with DSpace's filter-media
:
+Just to make sure these were not uploaded by the user or something, I manually forced the regeneration of these with DSpace’s filter-media
:
$ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98390
@@ -1022,9 +1022,9 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace fi
2019-01-24
-- I noticed Ubuntu's Ghostscript 9.26 works on some troublesome PDFs where Arch's Ghostscript 9.26 doesn't, so the fix for the first/last page crash is not the patch I found yesterday
-- Ubuntu's Ghostscript uses another patch from Ghostscript git (upstream bug report)
-- I re-compiled Arch's ghostscript with the patch and then I was able to generate a thumbnail from one of the troublesome PDFs
+- I noticed Ubuntu’s Ghostscript 9.26 works on some troublesome PDFs where Arch’s Ghostscript 9.26 doesn’t, so the fix for the first/last page crash is not the patch I found yesterday
+- Ubuntu’s Ghostscript uses another patch from Ghostscript git (upstream bug report)
+- I re-compiled Arch’s ghostscript with the patch and then I was able to generate a thumbnail from one of the troublesome PDFs
- Before and after:
$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
@@ -1068,7 +1068,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:" | grep 45.5.186.2 | grep -Eo "GET /(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n
-- CIAT's community currently has 12,000 items in it so this is normal
+- CIAT’s community currently has 12,000 items in it so this is normal
- The issue with goo.gl links that we saw yesterday appears to be resolved, as links are working again…
- For example: https://goo.gl/fb/VRj9Gq
- The full list of MARC Relators on the Library of Congress website linked from the DMCI relators page is very confusing
@@ -1085,9 +1085,9 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
- I tested by doing a Tomcat 7.0.91 installation, then switching it to 7.0.92 and it worked… nice!
- I refined the tasks so much that I was confident enough to deploy them on DSpace Test and it went very well
-- Basically I just stopped tomcat7, created a dspace user, removed tomcat7, chown'd everything to the dspace user, then ran the playbook
+- Basically I just stopped tomcat7, created a dspace user, removed tomcat7, chown’d everything to the dspace user, then ran the playbook
- So now DSpace Test (linode19) is running Tomcat 7.0.92… w00t
-- Now we need to monitor it for a few weeks to see if there is anything we missed, and then I can change CGSpace (linode18) as well, and we're ready for Ubuntu 18.04 too!
+- Now we need to monitor it for a few weeks to see if there is anything we missed, and then I can change CGSpace (linode18) as well, and we’re ready for Ubuntu 18.04 too!
@@ -1107,7 +1107,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
4644 205.186.128.185
4644 70.32.83.92
-- I think it's the usual IPs:
+
- I think it’s the usual IPs:
- 70.32.83.92 is CCAFS
- 205.186.128.185 is CCAFS or perhaps another Macaroni Bros harvester (new ILRI website?)
@@ -1158,7 +1158,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
2107 199.47.87.140
2540 45.5.186.2
-- Of course there is CIAT's
45.5.186.2
, but also 45.5.184.2
appears to be CIAT… I wonder why they have two harvesters?
+- Of course there is CIAT’s
45.5.186.2
, but also 45.5.184.2
appears to be CIAT… I wonder why they have two harvesters?
199.47.87.140
and 199.47.87.141
is TurnItIn with the following user agent:
TurnitinBot (https://turnitin.com/robot/crawlerinfo.html)
@@ -1181,7 +1181,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
45.5.186.2
is CIAT as usual…
70.32.83.92
and 205.186.128.185
are CCAFS as usual…
66.249.66.219
is Google…
-- I'm thinking it might finally be time to increase the threshold of the Linode CPU alerts
+
- I’m thinking it might finally be time to increase the threshold of the Linode CPU alerts
- I adjusted the alert threshold from 250% to 275%
@@ -1233,7 +1233,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
9239 45.5.186.2
45.5.186.2
and 45.5.184.2
are CIAT as always
-85.25.237.71
is some new server in Germany that I've never seen before with the user agent:
+85.25.237.71
is some new server in Germany that I’ve never seen before with the user agent:
Linguee Bot (http://www.linguee.com/bot; bot@linguee.com)
diff --git a/docs/2019-02/index.html b/docs/2019-02/index.html
index 4610a19b1..ab3e3c04a 100644
--- a/docs/2019-02/index.html
+++ b/docs/2019-02/index.html
@@ -69,7 +69,7 @@ real 0m19.873s
user 0m22.203s
sys 0m1.979s
"/>
-
+
@@ -99,7 +99,7 @@ sys 0m1.979s
-
+
@@ -146,7 +146,7 @@ sys 0m1.979s
February, 2019
@@ -179,7 +179,7 @@ real 0m19.873s
user 0m22.203s
sys 0m1.979s
-- Normally I'd say this was very high, but about this time last year I remember thinking the same thing when we had 3.1 million…
+- Normally I’d say this was very high, but about this time last year I remember thinking the same thing when we had 3.1 million…
- I will have to keep an eye on this to see if there is some error in Solr…
- Atmire sent their pull request to re-enable the Metadata Quality Module (MQM) on our
5_x-dev
branch today
@@ -292,7 +292,7 @@ COPY 321
4658 205.186.128.185
4658 70.32.83.92
-- At this rate I think I just need to stop paying attention to these alerts—DSpace gets thrashed when people use the APIs properly and there's nothing we can do to improve REST API performance!
+- At this rate I think I just need to stop paying attention to these alerts—DSpace gets thrashed when people use the APIs properly and there’s nothing we can do to improve REST API performance!
- Perhaps I just need to keep increasing the Linode alert threshold (currently 300%) for this host?
2019-02-05
@@ -461,7 +461,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
848 66.249.66.219
- So it seems that the load issue comes from the REST API, not the XMLUI
-- I could probably rate limit the REST API, or maybe just keep increasing the alert threshold so I don't get alert spam (this is probably the correct approach because it seems like the REST API can keep up with the requests and is returning HTTP 200 status as far as I can tell)
+- I could probably rate limit the REST API, or maybe just keep increasing the alert threshold so I don’t get alert spam (this is probably the correct approach because it seems like the REST API can keep up with the requests and is returning HTTP 200 status as far as I can tell)
- Bosede from IITA sent a message that a colleague is having problems submitting to some collections in their community:
Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1056 by user 1759
@@ -470,7 +470,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE

-- IITA editors or approvers should be added to that step (though I'm curious why nobody is in that group currently)
+- IITA editors or approvers should be added to that step (though I’m curious why nobody is in that group currently)
- Abenet says we are not using the “Accept/Reject” step so this group should be deleted
- Bizuwork asked about the “DSpace Submission Approved and Archived” emails that stopped working last month
- I tried the
test-email
command on DSpace and it indeed is not working:
@@ -489,7 +489,7 @@ Error sending email:
Please see the DSpace documentation for assistance.
-- I can't connect to TCP port 25 on that server so I sent a mail to CGNET support to ask what's up
+- I can’t connect to TCP port 25 on that server so I sent a mail to CGNET support to ask what’s up
- CGNET said these servers were discontinued in 2018-01 and that I should use Office 365
2019-02-08
@@ -577,18 +577,18 @@ Please see the DSpace documentation for assistance.
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq | wc -l
95
-- It's very clear to me now that the API requests are the heaviest!
-- I think I need to increase the Linode alert threshold from 300 to 350% now so I stop getting some of these alerts—it's becoming a bit of the boy who cried wolf because it alerts like clockwork twice per day!
+- It’s very clear to me now that the API requests are the heaviest!
+- I think I need to increase the Linode alert threshold from 300 to 350% now so I stop getting some of these alerts—it’s becoming a bit of the boy who cried wolf because it alerts like clockwork twice per day!
- Add my Python- and shell-based metadata workflow helper scripts as well as the environment settings for pipenv to our DSpace repository (#408) so I can track changes and distribute them more formally instead of just keeping them collected on the wiki
- Started adding IITA research theme (
cg.identifier.iitatheme
) to CGSpace
-- I'm still waiting for feedback from IITA whether they actually want to use “SOCIAL SCIENCE & AGRIC BUSINESS” because it is listed as “Social Science and Agribusiness” on their website
+- I’m still waiting for feedback from IITA whether they actually want to use “SOCIAL SCIENCE & AGRIC BUSINESS” because it is listed as “Social Science and Agribusiness” on their website
- Also, I think they want to do some mappings of items with existing subjects to these new themes
- Update ILRI author name style in the controlled vocabulary (Domelevo Entfellner, Jean-Baka) (#409)
-- I'm still waiting to hear from Bizuwork whether we'll batch update all existing items with the old name style
+- I’m still waiting to hear from Bizuwork whether we’ll batch update all existing items with the old name style
- No, there is only one entry and Bizu already fixed it
@@ -606,7 +606,7 @@ Please see the DSpace documentation for assistance.
Error sending email:
- Error: cannot test email because mail.server.disabled is set to true
-- I'm not sure why I didn't know about this configuration option before, and always maintained multiple configurations for development and production
+
- I’m not sure why I didn’t know about this configuration option before, and always maintained multiple configurations for development and production
- I will modify the Ansible DSpace role to use this in its
build.properties
template
@@ -645,11 +645,11 @@ Please see the DSpace documentation for assistance.
dspacestatistics=# SELECT * FROM items WHERE views > 0 ORDER BY views DESC LIMIT 10;
dspacestatistics=# SELECT * FROM items WHERE downloads > 0 ORDER BY downloads DESC LIMIT 10;
-- I'd have to think about what to make the REST API endpoints, perhaps:
/statistics/top/items?limit=10
+- I’d have to think about what to make the REST API endpoints, perhaps:
/statistics/top/items?limit=10
- But how do I do top items by views / downloads separately?
- I re-deployed DSpace 6.3 locally to test the PDFBox thumbnails, especially to see if they handle CMYK files properly
-- The quality is JPEG 75 and I don't see a way to set the thumbnail dimensions, but the resulting image is indeed sRGB:
+- The quality is JPEG 75 and I don’t see a way to set the thumbnail dimensions, but the resulting image is indeed sRGB:
@@ -661,7 +661,7 @@ dspacestatistics=# SELECT * FROM items WHERE downloads > 0 ORDER BY downloads
2019-02-13
-- ILRI ICT reset the password for the CGSpace mail account, but I still can't get it to send mail from DSpace's
test-email
utility
+- ILRI ICT reset the password for the CGSpace mail account, but I still can’t get it to send mail from DSpace’s
test-email
utility
- I even added extra mail properties to
dspace.cfg
as suggested by someone on the dspace-tech mailing list:
mail.extraproperties = mail.smtp.starttls.required = true, mail.smtp.auth=true
@@ -671,8 +671,8 @@ dspacestatistics=# SELECT * FROM items WHERE downloads > 0 ORDER BY downloads
Error sending email:
- Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR06CA0001.eurprd06.prod.outlook.com]
-- I tried to log into the Outlook 365 web mail and it doesn't work so I've emailed ILRI ICT again
-- After reading the common mistakes in the JavaMail FAQ I reconfigured the extra properties in DSpace's mail configuration to be simply:
+- I tried to log into the Outlook 365 web mail and it doesn’t work so I’ve emailed ILRI ICT again
+- After reading the common mistakes in the JavaMail FAQ I reconfigured the extra properties in DSpace’s mail configuration to be simply:
mail.extraproperties = mail.smtp.starttls.enable=true
@@ -707,7 +707,7 @@ $ sudo sysctl kernel.unprivileged_userns_clone=1
$ podman pull postgres:9.6-alpine
$ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
-- Which totally works, but Podman's rootless support doesn't work with port mappings yet…
+- Which totally works, but Podman’s rootless support doesn’t work with port mappings yet…
- Deploy the Tomcat-7-from-tarball branch on CGSpace (linode18), but first stop the Ubuntu Tomcat 7 and do some basic prep before running the Ansible playbook:
# systemctl stop tomcat7
@@ -731,14 +731,14 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
# find /home/cgspace.cgiar.org/solr/ -iname "write.lock" -delete
- After restarting Tomcat the usage statistics are back
-- Interestingly, many of the locks were from last month, last year, and even 2015! I'm pretty sure that's not supposed to be how locks work…
+- Interestingly, many of the locks were from last month, last year, and even 2015! I’m pretty sure that’s not supposed to be how locks work…
- Help Sarah Kasyoka finish an item submission that she was having issues with due to the file size
-- I increased the nginx upload limit, but she said she was having problems and couldn't really tell me why
+- I increased the nginx upload limit, but she said she was having problems and couldn’t really tell me why
- I logged in as her and completed the submission with no problems…
2019-02-15
-- Tomcat was killed around 3AM by the kernel's OOM killer according to
dmesg
:
+- Tomcat was killed around 3AM by the kernel’s OOM killer according to
dmesg
:
[Fri Feb 15 03:10:42 2019] Out of memory: Kill process 12027 (java) score 670 or sacrifice child
[Fri Feb 15 03:10:42 2019] Killed process 12027 (java) total-vm:14108048kB, anon-rss:5450284kB, file-rss:0kB, shmem-rss:0kB
@@ -748,7 +748,7 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
Feb 15 03:10:44 linode19 systemd[1]: tomcat7.service: Main process exited, code=killed, status=9/KILL
-- I suspect it was related to the media-filter cron job that runs at 3AM but I don't see anything particular in the log files
+- I suspect it was related to the media-filter cron job that runs at 3AM but I don’t see anything particular in the log files
- I want to try to normalize the
text_lang
values to make working with metadata easier
- We currently have a bunch of weird values that DSpace uses like
NULL
, en_US
, and en
and others that have been entered manually by editors:
@@ -769,19 +769,19 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
- The majority are
NULL
, en_US
, the blank string, and en
—the rest are not enough to be significant
- Theoretically this field could help if you wanted to search for Spanish-language fields in the API or something, but even for the English fields there are two different values (and those are from DSpace itself)!
-- I'm going to normalized these to
NULL
at least on DSpace Test for now:
+- I’m going to normalized these to
NULL
at least on DSpace Test for now:
dspace=# UPDATE metadatavalue SET text_lang = NULL WHERE resource_type_id=2 AND text_lang IS NOT NULL;
UPDATE 1045410
-- I started proofing IITA's 2019-01 records that Sisay uploaded this week
+
- I started proofing IITA’s 2019-01 records that Sisay uploaded this week
-- There were 259 records in IITA's original spreadsheet, but there are 276 in Sisay's collection
+- There were 259 records in IITA’s original spreadsheet, but there are 276 in Sisay’s collection
- Also, I found that there are at least twenty duplicates in these records that we will need to address
- ILRI ICT fixed the password for the CGSpace support email account and I tested it on Outlook 365 web and DSpace and it works
-- Re-create my local PostgreSQL container to for new PostgreSQL version and to use podman's volumes:
+- Re-create my local PostgreSQL container to for new PostgreSQL version and to use podman’s volumes:
$ podman pull postgres:9.6-alpine
$ podman volume create dspacedb_data
@@ -793,7 +793,7 @@ $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h loca
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
-- And it's all running without root!
+- And it’s all running without root!
- Then re-create my Artifactory container as well, taking into account ulimit open file requirements by Artifactory as well as the user limitations caused by rootless subuid mappings:
$ podman volume create artifactory_data
@@ -808,7 +808,7 @@ $ podman start artifactory
2019-02-17
-- I ran DSpace's cleanup task on CGSpace (linode18) and there were errors:
+- I ran DSpace’s cleanup task on CGSpace (linode18) and there were errors:
$ dspace cleanup -v
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
@@ -946,7 +946,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
2019-02-19
- Linode sent another alert about CPU usage on CGSpace (linode18) averaging 417% this morning
-- Unfortunately, I don't see any strange activity in the web server API or XMLUI logs at that time in particular
+- Unfortunately, I don’t see any strange activity in the web server API or XMLUI logs at that time in particular
- So far today the top ten IPs in the XMLUI logs are:
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "19/Feb/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
@@ -962,9 +962,9 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
14686 143.233.242.130
- 143.233.242.130 is in Greece and using the user agent “Indy Library”, like the top IP yesterday (94.71.244.172)
-- That user agent is in our Tomcat list of crawlers so at least its resource usage is controlled by forcing it to use a single Tomcat session, but I don't know if DSpace recognizes if this is a bot or not, so the logs are probably skewed because of this
-- The user is requesting only things like
/handle/10568/56199?show=full
so it's nothing malicious, only annoying
-- Otherwise there are still shit loads of IPs from Amazon still hammering the server, though I see HTTP 503 errors now after yesterday's nginx rate limiting updates
+
- That user agent is in our Tomcat list of crawlers so at least its resource usage is controlled by forcing it to use a single Tomcat session, but I don’t know if DSpace recognizes if this is a bot or not, so the logs are probably skewed because of this
+- The user is requesting only things like
/handle/10568/56199?show=full
so it’s nothing malicious, only annoying
+- Otherwise there are still shit loads of IPs from Amazon still hammering the server, though I see HTTP 503 errors now after yesterday’s nginx rate limiting updates
- I should really try to script something around ipapi.co to get these quickly and easily
@@ -984,7 +984,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
12360 2a01:7e00::f03c:91ff:fe0a:d645
2a01:7e00::f03c:91ff:fe0a:d645
is on Linode, and I can see from the XMLUI access logs that it is Drupal, so I assume it is part of the new ILRI website harvester…
-- Jesus, Linode just sent another alert as we speak that the load on CGSpace (linode18) has been at 450% the last two hours! I'm so fucking sick of this
+- Jesus, Linode just sent another alert as we speak that the load on CGSpace (linode18) has been at 450% the last two hours! I’m so fucking sick of this
- Our usage stats have exploded the last few months:

@@ -1027,12 +1027,12 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
Mozilla/5.0 (Linux; Android 7.0; TECNO Camon CX Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/33.0.0.0 Mobile Safari/537.36
-- I wrote a quick and dirty Python script called
resolve-addresses.py
to resolve IP addresses to their owning organization's name, ASN, and country using the IPAPI.co API
+- I wrote a quick and dirty Python script called
resolve-addresses.py
to resolve IP addresses to their owning organization’s name, ASN, and country using the IPAPI.co API
2019-02-20
- Ben Hack was asking about getting authors publications programmatically from CGSpace for the new ILRI website
-- I told him that they should probably try to use the REST API's
find-by-metadata-field
endpoint
+- I told him that they should probably try to use the REST API’s
find-by-metadata-field
endpoint
- The annoying thing is that you have to match the text language attribute of the field exactly, but it does work:
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://cgspace.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.creator.id","value": "Alan S. Orth: 0000-0002-1735-7458", "language": ""}'
@@ -1041,7 +1041,7 @@ $ curl -s -H "accept: application/json" -H "Content-Type: applica
- This returns six items for me, which is the same I see in a Discovery search
- Hector Tobon from CIAT asked if it was possible to get item statistics from CGSpace so I told him to use my dspace-statistics-api
-- I was playing with YasGUI to query AGROVOC's SPARQL endpoint, but they must have a cached version or something because I get an HTTP 404 if I try to go to the endpoint manually
+- I was playing with YasGUI to query AGROVOC’s SPARQL endpoint, but they must have a cached version or something because I get an HTTP 404 if I try to go to the endpoint manually
- I think I want to stick to the regular web services to validate AGROVOC terms

@@ -1064,7 +1064,7 @@ $ ./agrovoc-lookup.py -l fr -i /tmp/top-1500-subjects.txt -om /tmp/matched-subje
$ cat /tmp/matched-subjects-* | sort | uniq > /tmp/2019-02-21-matched-subjects.txt
-- And then a list of all the unique unmatched terms using some utility I've never heard of before called
comm
or with diff
:
+- And then a list of all the unique unmatched terms using some utility I’ve never heard of before called
comm
or with diff
:
$ sort /tmp/top-1500-subjects.txt > /tmp/subjects-sorted.txt
$ comm -13 /tmp/2019-02-21-matched-subjects.txt /tmp/subjects-sorted.txt > /tmp/2019-02-21-unmatched-subjects.txt
@@ -1077,7 +1077,7 @@ COPY 202
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 227 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-regions.csv WITH CSV HEADER;
COPY 33
-- I did a bit more work on the IITA research theme (adding it to Discovery search filters) and it's almost ready so I created a pull request (#413)
+- I did a bit more work on the IITA research theme (adding it to Discovery search filters) and it’s almost ready so I created a pull request (#413)
- I still need to test the batch tagging of IITA items with themes based on their IITA subjects:
- NATURAL RESOURCE MANAGEMENT research theme to items with NATURAL RESOURCE MANAGEMENT subject
@@ -1095,13 +1095,13 @@ COPY 33
Help Udana from WLE with some issues related to CGSpace items on their Publications website
- He wanted some IWMI items to show up in their publications website
-- The items were mapped into WLE collections, but still weren't showing up on the publications website
+- The items were mapped into WLE collections, but still weren’t showing up on the publications website
- I told him that he needs to add the
cg.identifier.wletheme
to the items so that the website indexer finds them
- A few days ago he added the metadata to 10568/93011 and now I see that the item is present on the WLE publications website
-
-
Start looking at IITA's latest round of batch uploads called “IITA_Feb_14” on DSpace Test
+Start looking at IITA’s latest round of batch uploads called “IITA_Feb_14” on DSpace Test
- One mispelled authorship type
- A few dozen incorrect inconsistent affiliations (I dumped a list of the top 1500 affiliations and reconciled against it, but it was still a lot of work)
@@ -1110,7 +1110,7 @@ COPY 33
- Some whitespace and consistency issues in sponsorships
- Eight items with invalid ISBN: 0-471-98560-3
- Two incorrectly formatted ISSNs
-- Lots of incorrect values in subjects, but that's a difficult problem to do in an automated way
+- Lots of incorrect values in subjects, but that’s a difficult problem to do in an automated way
-
@@ -1137,8 +1137,8 @@ return "unmatched"
2019-02-24
-- I decided to try to validate the AGROVOC subjects in IITA's recent batch upload by dumping all their terms, checking them in en/es/fr with
agrovoc-lookup.py
, then reconciling against the final list using reconcile-csv with OpenRefine
-- I'm not sure how to deal with terms like “CORN” that are alternative labels (
altLabel
) in AGROVOC where the preferred label (prefLabel
) would be “MAIZE”
+- I decided to try to validate the AGROVOC subjects in IITA’s recent batch upload by dumping all their terms, checking them in en/es/fr with
agrovoc-lookup.py
, then reconciling against the final list using reconcile-csv with OpenRefine
+- I’m not sure how to deal with terms like “CORN” that are alternative labels (
altLabel
) in AGROVOC where the preferred label (prefLabel
) would be “MAIZE”
- For example, a query for
CORN*
returns:
"results": [
@@ -1160,7 +1160,7 @@ return "unmatched"
I did a duplicate check of the IITA Feb 14 records on DSpace Test and there were about fifteen or twenty items reported
- A few of them are actually in previous IITA batch updates, which means they have been uploaded to CGSpace yet, so I worry that there would be many more
-- I want to re-synchronize CGSpace to DSpace Test to make sure that the duplicate checking is accurate, but I'm not sure I can because the Earlham guys are still testing COPO actively on DSpace Test
+- I want to re-synchronize CGSpace to DSpace Test to make sure that the duplicate checking is accurate, but I’m not sure I can because the Earlham guys are still testing COPO actively on DSpace Test
@@ -1185,7 +1185,7 @@ return "unmatched"
/home/cgspace.cgiar.org/log/solr.log.2019-02-23.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-24:34
-- But I don't see anything interesting in yesterday's Solr log…
+- But I don’t see anything interesting in yesterday’s Solr log…
- I see this in the Tomcat 7 logs yesterday:
Feb 25 21:09:29 linode18 tomcat7[1015]: Error while updating
@@ -1209,7 +1209,7 @@ Feb 25 21:37:49 linode18 tomcat7[28363]: at java.lang.Throwable.readObje
Feb 25 21:37:49 linode18 tomcat7[28363]: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Feb 25 21:37:49 linode18 tomcat7[28363]: at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
-- I don't think that's related…
+- I don’t think that’s related…
- Also, now the Solr admin UI says “statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher”
- In the Solr log I see:
@@ -1245,12 +1245,12 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
On a hunch I tried adding ulimit -v unlimited
to the Tomcat catalina.sh
and now Solr starts up with no core errors and I actually have statistics for January and February on some communities, but not others
I wonder if the address space limits that I added via LimitAS=infinity
in the systemd service are somehow not working?
I did some tests with calling a shell script from systemd on DSpace Test (linode19) and the LimitAS
setting does work, and the infinity
setting in systemd does get translated to “unlimited” on the service
-I thought it might be open file limit, but it seems we're nowhere near the current limit of 16384:
+I thought it might be open file limit, but it seems we’re nowhere near the current limit of 16384:
# lsof -u dspace | wc -l
3016
-- For what it's worth I see the same errors about
solr_update_time_stamp
on DSpace Test (linode19)
+- For what it’s worth I see the same errors about
solr_update_time_stamp
on DSpace Test (linode19)
- Update DSpace Test to Tomcat 7.0.93
- Something seems to have happened (some Atmire scheduled task, perhaps the CUA one at 7AM?) on CGSpace because I checked a few communities and collections on CGSpace and there are now statistics for January and February
@@ -1267,27 +1267,27 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
- According to the REST API collection 1021 appears to be CCAFS Tools, Maps, Datasets and Models
- I looked at the
WORKFLOW_STEP_1
(Accept/Reject) and the group is of course empty
-- As we've seen several times recently, we are not using this step so it should simply be deleted
+- As we’ve seen several times recently, we are not using this step so it should simply be deleted
2019-02-27
- Discuss batch uploads with Sisay
-- He's trying to upload some CTA records, but it's not possible to do collection mapping when using the web UI
+
- He’s trying to upload some CTA records, but it’s not possible to do collection mapping when using the web UI
- I sent a mail to the dspace-tech mailing list to ask about the inability to perform mappings when uploading via the XMLUI batch upload
-- He asked me to upload the files for him via the command line, but the file he referenced (
Thumbnails_feb_2019.zip
) doesn't exist
-- I noticed that the command line batch import functionality is a bit weird when using zip files because you have to specify the directory where the zip file is location as well as the zip file's name:
+- He asked me to upload the files for him via the command line, but the file he referenced (
Thumbnails_feb_2019.zip
) doesn’t exist
+- I noticed that the command line batch import functionality is a bit weird when using zip files because you have to specify the directory where the zip file is location as well as the zip file’s name:
$ ~/dspace/bin/dspace import -a -e aorth@stfu.com -m mapfile -s /home/aorth/Downloads/2019-02-27-test/ -z SimpleArchiveFormat.zip
-- Why don't they just derive the directory from the path to the zip file?
-- Working on Udana's Restoring Degraded Landscapes (RDL) WLE records that we originally started in 2018-11 and fixing many of the same problems that I originally did then
+
- Why don’t they just derive the directory from the path to the zip file?
+- Working on Udana’s Restoring Degraded Landscapes (RDL) WLE records that we originally started in 2018-11 and fixing many of the same problems that I originally did then
- I also added a few regions because they are obvious for the countries
- Also I added some rights fields that I noticed were easily available from the publications pages
-- I imported the records into my local environment with a fresh snapshot of the CGSpace database and ran the Atmire duplicate checker against them and it didn't find any
+- I imported the records into my local environment with a fresh snapshot of the CGSpace database and ran the Atmire duplicate checker against them and it didn’t find any
- I uploaded fifty-two records to the Restoring Degraded Landscapes collection on CGSpace
@@ -1299,7 +1299,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
$ dspace import -a -e swebshet@stfu.org -s /home/swebshet/Thumbnails_feb_2019 -m 2019-02-28-CTA-Thumbnails.map
- Mails from CGSpace stopped working, looks like ICT changed the password again or we got locked out sigh
-- Now I'm getting this message when trying to use DSpace's
test-email
script:
+- Now I’m getting this message when trying to use DSpace’s
test-email
script:
$ dspace test-email
@@ -1313,8 +1313,8 @@ Error sending email:
Please see the DSpace documentation for assistance.
-- I've tried to log in with the last two passwords that ICT reset it to earlier this month, but they are not working
-- I sent a mail to ILRI ICT to check if we're locked out or reset the password again
+- I’ve tried to log in with the last two passwords that ICT reset it to earlier this month, but they are not working
+- I sent a mail to ILRI ICT to check if we’re locked out or reset the password again
diff --git a/docs/2019-03/index.html b/docs/2019-03/index.html
index 3dde03db9..41e2899cd 100644
--- a/docs/2019-03/index.html
+++ b/docs/2019-03/index.html
@@ -8,9 +8,9 @@
-
+
@@ -73,7 +73,7 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca
-
+
@@ -120,16 +120,16 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca
March, 2019
2019-03-01
-- I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
+- I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
- I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
-- Looking at the other half of Udana's WLE records from 2018-11
+
- Looking at the other half of Udana’s WLE records from 2018-11
- I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
- I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
@@ -142,14 +142,14 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca
2019-03-03
-- Trying to finally upload IITA's 259 Feb 14 items to CGSpace so I exported them from DSpace Test:
+- Trying to finally upload IITA’s 259 Feb 14 items to CGSpace so I exported them from DSpace Test:
$ mkdir 2019-03-03-IITA-Feb14
$ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-03-03-IITA-Feb14
- As I was inspecting the archive I noticed that there were some problems with the bitsreams:
-- First, Sisay didn't include the bitstream descriptions
+- First, Sisay didn’t include the bitstream descriptions
- Second, only five items had bitstreams and I remember in the discussion with IITA that there should have been nine!
- I had to refer to the original CSV from January to find the file names, then download and add them to the export contents manually!
@@ -158,11 +158,11 @@ $ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-03-03-IITA-Feb14
$ dspace import -a -c 10568/99832 -e aorth@stfu.com -m 2019-03-03-IITA-Feb14.map -s /tmp/2019-03-03-IITA-Feb14
-- DSpace's export function doesn't include the collections for some reason, so you need to import them somewhere first, then export the collection metadata and re-map the items to proper owning collections based on their types using OpenRefine or something
+- DSpace’s export function doesn’t include the collections for some reason, so you need to import them somewhere first, then export the collection metadata and re-map the items to proper owning collections based on their types using OpenRefine or something
- After re-importing to CGSpace to apply the mappings, I deleted the collection on DSpace Test and ran the
dspace cleanup
script
- Merge the IITA research theme changes from last month to the
5_x-prod
branch (#413)
-- I will deploy to CGSpace soon and then think about how to batch tag all IITA's existing items with this metadata
+- I will deploy to CGSpace soon and then think about how to batch tag all IITA’s existing items with this metadata
- Deploy Tomcat 7.0.93 on CGSpace (linode18) after having tested it on DSpace Test (linode19) for a week
@@ -170,7 +170,7 @@ $ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-03-03-IITA-Feb14
2019-03-06
- Abenet was having problems with a CIP user account, I think that the user could not register
-- I suspect it's related to the email issue that ICT hasn't responded about since last week
+- I suspect it’s related to the email issue that ICT hasn’t responded about since last week
- As I thought, I still cannot send emails from CGSpace:
$ dspace test-email
@@ -203,17 +203,17 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.x
2019-03-08
-- There's an issue with CGSpace right now where all items are giving a blank page in the XMLUI
+
- There’s an issue with CGSpace right now where all items are giving a blank page in the XMLUI
Interestingly, if I check an item in the REST API it is also mostly blank: only the title and the ID! On second thought I realize I probably was just seeing the default view without any “expands”
-- I don't see anything unusual in the Tomcat logs, though there are thousands of those
solr_update_time_stamp
errors:
+- I don’t see anything unusual in the Tomcat logs, though there are thousands of those
solr_update_time_stamp
errors:
# journalctl -u tomcat7 | grep -c 'Multiple update components target the same field:solr_update_time_stamp'
1076
-- I restarted Tomcat and it's OK now…
+- I restarted Tomcat and it’s OK now…
- Skype meeting with Peter and Abenet and Sisay
- We want to try to crowd source the correction of invalid AGROVOC terms starting with the ~313 invalid ones from our top 1500
@@ -244,7 +244,7 @@ UPDATE 44
2019-03-10
-- Working on tagging IITA's items with their new research theme (
cg.identifier.iitatheme
) based on their existing IITA subjects (see notes from 2019-02)
+- Working on tagging IITA’s items with their new research theme (
cg.identifier.iitatheme
) based on their existing IITA subjects (see notes from 2019-02)
- I exported the entire IITA community from CGSpace and then used
csvcut
to extract only the needed fields:
$ csvcut -c 'id,cg.subject.iita,cg.subject.iita[],cg.subject.iita[en],cg.subject.iita[en_US]' ~/Downloads/10568-68616.csv > /tmp/iita.csv
@@ -258,7 +258,7 @@ UPDATE 44
if(isBlank(value), 'PLANT PRODUCTION & HEALTH', value + '||PLANT PRODUCTION & HEALTH')
-- Then it's more annoying because there are four IITA subject columns…
+- Then it’s more annoying because there are four IITA subject columns…
- In total this would add research themes to 1,755 items
- I want to double check one last time with Bosede that they would like to do this, because I also see that this will tag a few hundred items from the 1970s and 1980s
@@ -268,7 +268,7 @@ UPDATE 44
2019-03-12
-- I imported the changes to 256 of IITA's records on CGSpace
+- I imported the changes to 256 of IITA’s records on CGSpace
2019-03-14
@@ -291,21 +291,21 @@ UPDATE 44
done
-- Then I couldn't figure out a clever way to join all the CSVs, so I just grepped them to find the IDs with dates from 2018 and 2019 and there are apparently only three:
+- Then I couldn’t figure out a clever way to join all the CSVs, so I just grepped them to find the IDs with dates from 2018 and 2019 and there are apparently only three:
$ grep -oE '201[89]' /tmp/*.csv | sort -u
/tmp/94834.csv:2018
/tmp/95615.csv:2018
/tmp/96747.csv:2018
-- And looking at those items more closely, only one of them has an issue date of after 2018-04, so I will only update that one (as the countrie's name only changed in 2018-04)
+- And looking at those items more closely, only one of them has an issue date of after 2018-04, so I will only update that one (as the countrie’s name only changed in 2018-04)
- Run all system updates and reboot linode20
-- Follow up with Felix from Earlham to see if he's done testing DSpace Test with COPO so I can re-sync the server from CGSpace
+- Follow up with Felix from Earlham to see if he’s done testing DSpace Test with COPO so I can re-sync the server from CGSpace
2019-03-15
- CGSpace (linode18) has the blank page error again
-- I'm not sure if it's related, but I see the following error in DSpace's log:
+- I’m not sure if it’s related, but I see the following error in DSpace’s log:
2019-03-15 14:09:32,685 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is closed.
@@ -354,7 +354,7 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is c
10 dspaceCli
15 dspaceWeb
-- I didn't see anything interesting in the PostgreSQL logs, though this stack trace from the Tomcat logs (in the systemd journal) from earlier today might be related?
+- I didn’t see anything interesting in the PostgreSQL logs, though this stack trace from the Tomcat logs (in the systemd journal) from earlier today might be related?
SEVERE: Servlet.service() for servlet [spring] in context with path [] threw exception [org.springframework.web.util.NestedServletException: Request processing failed; nested exception is java.util.EmptyStackException] with root cause
java.util.EmptyStackException
@@ -408,7 +408,7 @@ java.util.EmptyStackException
Last week Felix from Earlham said that they finished testing on DSpace Test (linode19) so I made backups of some things there and re-deployed the system on Ubuntu 18.04
- During re-deployment I hit a few issues with the Ansible playbooks and made some minor improvements
-- There seems to be an issue with nodejs's dependencies now, which causes npm to get uninstalled when installing the certbot dependencies (due to a conflict in libssl dependencies)
+- There seems to be an issue with nodejs’s dependencies now, which causes npm to get uninstalled when installing the certbot dependencies (due to a conflict in libssl dependencies)
- I re-worked the playbooks to use Node.js from the upstream official repository for now
@@ -421,13 +421,13 @@ java.util.EmptyStackException
- After restarting Tomcat, Solr was giving the “Error opening new searcher” error for all cores
- I stopped Tomcat, added
ulimit -v unlimited
to the catalina.sh
script and deleted all old locks in the DSpace solr
directory and then DSpace started up normally
-- I'm still not exactly sure why I see this error and if the
ulimit
trick actually helps, as the tomcat7.service
has LimitAS=infinity
anyways (and from checking the PID's limits file in /proc
it seems to be applied)
+- I’m still not exactly sure why I see this error and if the
ulimit
trick actually helps, as the tomcat7.service
has LimitAS=infinity
anyways (and from checking the PID’s limits file in /proc
it seems to be applied)
- Then I noticed that the item displays were blank… so I checked the database info and saw there were some unfinished migrations
-- I'm not entirely sure if it's related, but I tried to delete the old migrations and then force running the ignored ones like when we upgraded to DSpace 5.8 in 2018-06 and then after restarting Tomcat I could see the item displays again
+- I’m not entirely sure if it’s related, but I tried to delete the old migrations and then force running the ignored ones like when we upgraded to DSpace 5.8 in 2018-06 and then after restarting Tomcat I could see the item displays again
I copied the 2019 Solr statistics core from CGSpace to DSpace Test and it works (and is only 5.5GB currently), so now we have some useful stats on DSpace Test for the CUA module and the dspace-statistics-api
-I ran DSpace's cleanup task on CGSpace (linode18) and there were errors:
+I ran DSpace’s cleanup task on CGSpace (linode18) and there were errors:
$ dspace cleanup -v
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
@@ -485,8 +485,8 @@ $ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F
72 dspace.log.2019-03-17
8 dspace.log.2019-03-18
-- It seems to be something with grep doing binary matching on some log files for some reason, so I guess I need to always use
-I
to say binary files don't match
-- Anyways, the full error in DSpace's log is:
+- It seems to be something with grep doing binary matching on some log files for some reason, so I guess I need to always use
-I
to say binary files don’t match
+- Anyways, the full error in DSpace’s log is:
2019-03-18 12:26:23,331 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@75eaa668 is closed.
@@ -509,7 +509,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR: column "waiting" does not exist at character 217
- This is unrelated and apparently due to Munin checking a column that was changed in PostgreSQL 9.6
-- I suspect that this issue with the blank pages might not be PostgreSQL after all, perhaps it's a Cocoon thing?
+- I suspect that this issue with the blank pages might not be PostgreSQL after all, perhaps it’s a Cocoon thing?
- Looking in the cocoon logs I see a large number of warnings about “Can not load requested doc” around 11AM and 12PM:
$ grep 'Can not load requested doc' cocoon.log.2019-03-18 | grep -oE '2019-03-18 [0-9]{2}:' | sort | uniq -c
@@ -567,7 +567,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
717 2019-03-08 11:
59 2019-03-08 12:
-- I'm not sure if it's cocoon or that's just a symptom of something else
+- I’m not sure if it’s cocoon or that’s just a symptom of something else
2019-03-19
@@ -581,8 +581,8 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
(1 row)
- Perhaps my
agrovoc-lookup.py
script could notify if it finds these because they potentially give false negatives
-- CGSpace (linode18) is having problems with Solr again, I'm seeing “Error opening new searcher” in the Solr logs and there are no stats for previous years
-- Apparently the Solr statistics shards didn't load properly when we restarted Tomcat yesterday:
+- CGSpace (linode18) is having problems with Solr again, I’m seeing “Error opening new searcher” in the Solr logs and there are no stats for previous years
+- Apparently the Solr statistics shards didn’t load properly when we restarted Tomcat yesterday:
2019-03-18 12:32:39,799 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
...
@@ -593,7 +593,7 @@ Caused by: org.apache.solr.common.SolrException: Error opening new searcher
... 31 more
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
-- For reference, I don't see the
ulimit -v unlimited
in the catalina.sh
script, though the tomcat7
systemd service has LimitAS=infinity
+- For reference, I don’t see the
ulimit -v unlimited
in the catalina.sh
script, though the tomcat7
systemd service has LimitAS=infinity
- The limits of the current Tomcat java process are:
# cat /proc/27182/limits
@@ -615,7 +615,7 @@ Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
-- I will try to add
ulimit -v unlimited
to the Catalina startup script and check the output of the limits to see if it's different in practice, as some wisdom on Stack Overflow says this solves the Solr core issues and I've superstitiously tried it various times in the past
+ - I will try to add
ulimit -v unlimited
to the Catalina startup script and check the output of the limits to see if it’s different in practice, as some wisdom on Stack Overflow says this solves the Solr core issues and I’ve superstitiously tried it various times in the past
- The result is the same before and after, so adding the ulimit directly is unneccessary (whether or not unlimited address space is useful or not is another question)
@@ -627,7 +627,7 @@ Max realtime timeout unlimited unlimited us
# systemctl start tomcat7
- After restarting I confirmed that all Solr statistics cores were loaded successfully…
-- Another avenue might be to look at point releases in Solr 4.10.x, as we're running 4.10.2 and they released 4.10.3 and 4.10.4 back in 2014 or 2015
+
- Another avenue might be to look at point releases in Solr 4.10.x, as we’re running 4.10.2 and they released 4.10.3 and 4.10.4 back in 2014 or 2015
- I see several issues regarding locks and IndexWriter that were fixed in Solr and Lucene 4.10.3 and 4.10.4…
@@ -651,7 +651,7 @@ Max realtime timeout unlimited unlimited us
2019-03-21
-- It's been two days since we had the blank page issue on CGSpace, and looking in the Cocoon logs I see very low numbers of the errors that we were seeing the last time the issue occurred:
+- It’s been two days since we had the blank page issue on CGSpace, and looking in the Cocoon logs I see very low numbers of the errors that we were seeing the last time the issue occurred:
$ grep 'Can not load requested doc' cocoon.log.2019-03-20 | grep -oE '2019-03-20 [0-9]{2}:' | sort | uniq -c
3 2019-03-20 00:
@@ -732,7 +732,7 @@ $ grep 'Can not load requested doc' cocoon.log.2019-03-23 | grep -oE '2019-03-23
440 2019-03-23 08:
260 2019-03-23 09:
-- I was curious to see if clearing the Cocoon cache in the XMLUI control panel would fix it, but it didn't
+- I was curious to see if clearing the Cocoon cache in the XMLUI control panel would fix it, but it didn’t
- Trying to drill down more, I see that the bulk of the errors started aroundi 21:20:
$ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 21:[0-9]' | sort | uniq -c
@@ -794,16 +794,16 @@ org.postgresql.util.PSQLException: This statement has been closed.
I restarted Tomcat and now the item displays are working again for now
-I am wondering if this is an issue with removing abandoned connections in Tomcat's JDBC pooling?
+I am wondering if this is an issue with removing abandoned connections in Tomcat’s JDBC pooling?
-- It's hard to tell because we have
logAbanded
enabled, but I don't see anything in the tomcat7
service logs in the systemd journal
+- It’s hard to tell because we have
logAbanded
enabled, but I don’t see anything in the tomcat7
service logs in the systemd journal
I sent another mail to the dspace-tech mailing list with my observations
-I spent some time trying to test and debug the Tomcat connection pool's settings, but for some reason our logs are either messed up or no connections are actually getting abandoned
+I spent some time trying to test and debug the Tomcat connection pool’s settings, but for some reason our logs are either messed up or no connections are actually getting abandoned
I compiled this TomcatJdbcConnectionTest and created a bunch of database connections and waited a few minutes but they never got abandoned until I created over maxActive
(75), after which almost all were purged at once
@@ -820,7 +820,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
$ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=3000 service:jmx:rmi:///jndi/rmi://localhost:5400/jmxrmi -J-DsocksNonProxyHosts=
- I need to remember to check the active connections next time we have issues with blank item pages on CGSpace
-- In other news, I've been running G1GC on DSpace Test (linode19) since 2018-11-08 without realizing it, which is probably a good thing
+- In other news, I’ve been running G1GC on DSpace Test (linode19) since 2018-11-08 without realizing it, which is probably a good thing
- I deployed the latest
5_x-prod
branch on CGSpace (linode18) and added more validation to the JDBC pool in our Tomcat config
- This includes the new
testWhileIdle
and testOnConnect
pool settings as well as the two new JDBC interceptors: StatementFinalizer
and ConnectionState
that should hopefully make sure our connections in the pool are valid
@@ -828,7 +828,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
- I spent one hour looking at the invalid AGROVOC terms from last week
-- It doesn't seem like any of the editors did any work on this so I did most of them
+- It doesn’t seem like any of the editors did any work on this so I did most of them
@@ -842,21 +842,21 @@ org.postgresql.util.PSQLException: This statement has been closed.
Looking at the DBCP status on CGSpace via jconsole and everything looks good, though I wonder why timeBetweenEvictionRunsMillis
is -1, because the Tomcat 7.0 JDBC docs say the default is 5000…
- Could be an error in the docs, as I see the Apache Commons DBCP has -1 as the default
-- Maybe I need to re-evaluate the “defauts” of Tomcat 7's DBCP and set them explicitly in our config
+- Maybe I need to re-evaluate the “defauts” of Tomcat 7’s DBCP and set them explicitly in our config
- From Tomcat 8 they seem to default to Apache Commons’ DBCP 2.x
-Also, CGSpace doesn't have many Cocoon errors yet this morning:
+Also, CGSpace doesn’t have many Cocoon errors yet this morning:
$ grep 'Can not load requested doc' cocoon.log.2019-03-25 | grep -oE '2019-03-25 [0-9]{2}:' | sort | uniq -c
4 2019-03-25 00:
1 2019-03-25 01:
-- Holy shit I just realized we've been using the wrong DBCP pool in Tomcat
+
- Holy shit I just realized we’ve been using the wrong DBCP pool in Tomcat
- By default you get the Commons DBCP one unless you specify factory
org.apache.tomcat.jdbc.pool.DataSourceFactory
-- Now I see all my interceptor settings etc in jconsole, where I didn't see them before (also a new
tomcat.jdbc
mbean)!
-- No wonder our settings didn't quite match the ones in the Tomcat DBCP Pool docs
+- Now I see all my interceptor settings etc in jconsole, where I didn’t see them before (also a new
tomcat.jdbc
mbean)!
+- No wonder our settings didn’t quite match the ones in the Tomcat DBCP Pool docs
- Uptime Robot reported that CGSpace went down and I see the load is very high
@@ -885,7 +885,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
1222 35.174.184.209
1720 2a01:4f8:13b:1296::2
-- The IPs look pretty normal except we've never seen
93.179.69.74
before, and it uses the following user agent:
+- The IPs look pretty normal except we’ve never seen
93.179.69.74
before, and it uses the following user agent:
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.20 Safari/535.1
@@ -894,7 +894,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=93.179.69.74' dspace.log.2019-03-25 | sort | uniq | wc -l
1
-- That's weird because the total number of sessions today seems low compared to recent days:
+- That’s weird because the total number of sessions today seems low compared to recent days:
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-25 | sort -u | wc -l
5657
@@ -914,7 +914,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
- I restarted Tomcat and deployed the new Tomcat JDBC settings on CGSpace since I had to restart the server anyways
-- I need to watch this carefully though because I've read some places that Tomcat's DBCP doesn't track statements and might create memory leaks if an application doesn't close statements before a connection gets returned back to the pool
+- I need to watch this carefully though because I’ve read some places that Tomcat’s DBCP doesn’t track statements and might create memory leaks if an application doesn’t close statements before a connection gets returned back to the pool
- According the Uptime Robot the server was up and down a few more times over the next hour so I restarted Tomcat again
@@ -969,14 +969,14 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
216.244.66.198
is DotBot
93.179.69.74
is some IP in Ukraine, which I will add to the list of bot IPs in nginx
- I can only hope that this helps the load go down because all this traffic is disrupting the service for normal users and well-behaved bots (and interrupting my dinner and breakfast)
-- Looking at the database usage I'm wondering why there are so many connections from the DSpace CLI:
+- Looking at the database usage I’m wondering why there are so many connections from the DSpace CLI:
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
5 dspaceApi
10 dspaceCli
13 dspaceWeb
-- Looking closer I see they are all idle… so at least I know the load isn't coming from some background nightly task or something
+- Looking closer I see they are all idle… so at least I know the load isn’t coming from some background nightly task or something
- Make a minor edit to my
agrovoc-lookup.py
script to match subject terms with parentheses like COCOA (PLANT)
- Test 89 corrections and 79 deletions for AGROVOC subject terms from the ones I cleaned up in the last week
@@ -984,12 +984,12 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
$ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d -n
- UptimeRobot says CGSpace is down again, but it seems to just be slow, as the load is over 10.0
-- Looking at the nginx logs I don't see anything terribly abusive, but SemrushBot has made ~3,000 requests to Discovery and Browse pages today:
+- Looking at the nginx logs I don’t see anything terribly abusive, but SemrushBot has made ~3,000 requests to Discovery and Browse pages today:
# grep SemrushBot /var/log/nginx/access.log | grep -E "26/Mar/2019" | grep -E '(discover|browse)' | wc -l
2931
-- So I'm adding it to the badbot rate limiting in nginx, and actually, I kinda feel like just blocking all user agents with “bot” in the name for a few days to see if things calm down… maybe not just yet
+- So I’m adding it to the badbot rate limiting in nginx, and actually, I kinda feel like just blocking all user agents with “bot” in the name for a few days to see if things calm down… maybe not just yet
- Otherwise, these are the top users in the web and API logs the last hour (18–19):
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "26/Mar/2019:(18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
@@ -1021,7 +1021,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=(18.195.78.144|18.196.196.108)' dspace.log.2019-03-26 | sort | uniq | wc -l
937
-- I will add their IPs to the list of bot IPs in nginx so I can tag them as bots to let Tomcat's Crawler Session Manager Valve to force them to re-use their session
+- I will add their IPs to the list of bot IPs in nginx so I can tag them as bots to let Tomcat’s Crawler Session Manager Valve to force them to re-use their session
- Another user agent behaving badly in Colombia is “GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1”
- I will add curl to the Tomcat Crawler Session Manager because anyone using curl is most likely an automated read-only request
- I will add GuzzleHttp to the nginx badbots rate limiting, because it is making requests to dynamic Discovery pages
@@ -1029,7 +1029,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep 45.5.184.72 | grep -E "26/Mar/2019:" | grep -E '(discover|browse)' | wc -l
119
-- What's strange is that I can't see any of their requests in the DSpace log…
+- What’s strange is that I can’t see any of their requests in the DSpace log…
$ grep -I -c 45.5.184.72 dspace.log.2019-03-26
0
@@ -1050,7 +1050,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
- None of these 18.x.x.x IPs specify a user agent and they are all on Amazon!
- Shortly after I started the re-indexing UptimeRobot began to complain that CGSpace was down, then up, then down, then up…
-- I see the load on the server is about 10.0 again for some reason though I don't know WHAT is causing that load
+
- I see the load on the server is about 10.0 again for some reason though I don’t know WHAT is causing that load
- It could be the CPU steal metric, as if Linode has oversold the CPU resources on this VM host…
@@ -1061,14 +1061,14 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds


-- What's clear from this is that some other VM on our host has heavy usage for about four hours at 6AM and 6PM and that during that time the load on our server spikes
+
- What’s clear from this is that some other VM on our host has heavy usage for about four hours at 6AM and 6PM and that during that time the load on our server spikes
- CPU steal has drastically increased since March 25th
- It might be time to move to a dedicated CPU VM instances, or even real servers
-- For now I just sent a support ticket to bring this to Linode's attention
+- For now I just sent a support ticket to bring this to Linode’s attention
-- In other news, I see that it's not even the end of the month yet and we have 3.6 million hits already:
+- In other news, I see that it’s not even the end of the month yet and we have 3.6 million hits already:
# zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2019"
3654911
@@ -1120,7 +1120,7 @@ sys 0m2.551s
- It has 64GB of ECC RAM, six core Xeon processor from 2018, and 2x960GB NVMe storage
- The alternative of staying with Linode and using dedicated CPU instances with added block storage gets expensive quickly if we want to keep more than 16GB of RAM (do we?)
-- Regarding RAM, our JVM heap is 8GB and we leave the rest of the system's 32GB of RAM to PostgreSQL and Solr buffers
+- Regarding RAM, our JVM heap is 8GB and we leave the rest of the system’s 32GB of RAM to PostgreSQL and Solr buffers
- Seeing as we have 56GB of Solr data it might be better to have more RAM in order to keep more of it in memory
- Also, I know that the Linode block storage is a major bottleneck for Solr indexing
@@ -1128,7 +1128,7 @@ sys 0m2.551s
Looking at the weird issue with shitloads of downloads on the CTA item again
-The item was added on 2019-03-13 and these three IPs have attempted to download the item's bitstream 43,000 times since it was added eighteen days ago:
+The item was added on 2019-03-13 and these three IPs have attempted to download the item’s bitstream 43,000 times since it was added eighteen days ago:
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.{2..17}.gz | grep 'Spore-192-EN-web.pdf' | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 5
42 196.43.180.134
@@ -1147,7 +1147,7 @@ sys 0m2.551s
2019-03-29 09:10:07,311 ERROR org.dspace.rest.Resource @ Could not delete collection(id=1451), AuthorizeException. Message: org.dspace.authorize.AuthorizeException: Authorization denied for action ADMIN on COLLECTION:1451 by user 9492
-- IWMI people emailed to ask why two items with the same DOI don't have the same Altmetric score:
+
- IWMI people emailed to ask why two items with the same DOI don’t have the same Altmetric score:
- https://cgspace.cgiar.org/handle/10568/89846 (Bioversity)
- https://cgspace.cgiar.org/handle/10568/89975 (CIAT)
@@ -1178,7 +1178,7 @@ sys 0m2.551s
- https://www.altmetric.com/explorer/highlights?identifier=10568%2F89975
-- So it's likely the DSpace Altmetric badge code that is deciding not to show the badge
+- So it’s likely the DSpace Altmetric badge code that is deciding not to show the badge
diff --git a/docs/2019-04/index.html b/docs/2019-04/index.html
index b0b241e7d..cecf8024e 100644
--- a/docs/2019-04/index.html
+++ b/docs/2019-04/index.html
@@ -61,7 +61,7 @@ $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u ds
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
"/>
-
+
@@ -91,7 +91,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
-
+
@@ -138,7 +138,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
April, 2019
@@ -169,7 +169,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
2019-04-02
- CTA says the Amazon IPs are AWS gateways for real user traffic
-- I was trying to add Felix Shaw's account back to the Administrators group on DSpace Test, but I couldn't find his name in the user search of the groups page
+
- I was trying to add Felix Shaw’s account back to the Administrators group on DSpace Test, but I couldn’t find his name in the user search of the groups page
- If I searched for “Felix” or “Shaw” I saw other matches, included one for his personal email address!
- I ended up finding him via searching for his email address
@@ -192,12 +192,12 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
$ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
- After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim
-- One user's name has changed so I will update those using my
fix-metadata-values.py
script:
+- One user’s name has changed so I will update those using my
fix-metadata-values.py
script:
$ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
- I created a pull request and merged the changes to the
5_x-prod
branch (#417)
-- A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it's still going:
+- A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it’s still going:
2019-04-03 16:34:02,262 INFO org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
@@ -228,10 +228,10 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace

-- The other thing visible there is that the past few days the load has spiked to 500% and I don't think it's a coincidence that the Solr updating thing is happening…
+- The other thing visible there is that the past few days the load has spiked to 500% and I don’t think it’s a coincidence that the Solr updating thing is happening…
- I ran all system updates and rebooted the server
-- The load was lower on the server after reboot, but Solr didn't come back up properly according to the Solr Admin UI:
+- The load was lower on the server after reboot, but Solr didn’t come back up properly according to the Solr Admin UI:
@@ -241,7 +241,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
2019-04-06
-- Udana asked why item 10568/91278 didn't have an Altmetric badge on CGSpace, but on the WLE website it does
+
- Udana asked why item 10568/91278 didn’t have an Altmetric badge on CGSpace, but on the WLE website it does
- I looked and saw that the WLE website is using the Altmetric score associated with the DOI, and that the Handle has no score at all
- I tweeted the item and I assume this will link the Handle with the DOI in the system
@@ -273,12 +273,12 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
4267 45.5.186.2
4893 205.186.128.185
-45.5.184.72
is in Colombia so it's probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT's datasets collection:
+45.5.184.72
is in Colombia so it’s probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT’s datasets collection:
GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Dataset&filtertype=dateIssued&filter_relational_operator=equals&filter=2014
- Their user agent is the one I added to the badbots list in nginx last week: “GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1”
-- They made 22,000 requests to Discover on this collection today alone (and it's only 11AM):
+- They made 22,000 requests to Discover on this collection today alone (and it’s only 11AM):
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "06/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
22077 /handle/10568/72970/discover
@@ -332,7 +332,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
}
}
-- Strangely I don't see many hits in 2019-04:
+- Strangely I don’t see many hits in 2019-04:
$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-04&rows=0&wt=json&indent=true'
{
@@ -417,7 +417,7 @@ X-XSS-Protection: 1; mode=block
- So definitely the size of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
-- After twenty minutes of waiting I still don't see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:
+- After twenty minutes of waiting I still don’t see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:
@@ -426,7 +426,7 @@ X-XSS-Protection: 1; mode=block
- So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned
-- Strangely, the statistics Solr core says it hasn't been modified in 24 hours, so I tried to start the “optimize” process from the Admin UI and I see this in the Solr log:
+- Strangely, the statistics Solr core says it hasn’t been modified in 24 hours, so I tried to start the “optimize” process from the Admin UI and I see this in the Solr log:
@@ -434,7 +434,7 @@ X-XSS-Protection: 1; mode=block
- Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are
statistics_type:view
… very weird
-- I don't even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)
+- I don’t even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)
- I will try to re-deploy the
5_x-dev
branch and test again
@@ -465,7 +465,7 @@ X-XSS-Protection: 1; mode=block
}
- I confirmed the same on CGSpace itself after making one HEAD request
-- So I'm pretty sure it's something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week
+
- So I’m pretty sure it’s something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week
- I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace
@@ -482,12 +482,12 @@ X-XSS-Protection: 1; mode=block
- See: DS-3986
- See: DS-4020
- See: DS-3832
-- DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been removed from MaxMind's server as of 2018-04-01
+- DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been removed from MaxMind’s server as of 2018-04-01
- Now I made 100 requests and I see them in the Solr statistics… fuck my life for wasting five hours debugging this
- UptimeRobot said CGSpace went down and up a few times tonight, and my first instict was to check
iostat 1 10
and I saw that CPU steal is around 10–30 percent right now…
-- The load average is super high right now, as I've noticed the last few times UptimeRobot said that CGSpace went down:
+- The load average is super high right now, as I’ve noticed the last few times UptimeRobot said that CGSpace went down:
$ cat /proc/loadavg
10.70 9.17 8.85 18/633 4198
@@ -532,7 +532,7 @@ X-XSS-Protection: 1; mode=block
2019-04-08
-- Start checking IITA's last round of batch uploads from March on DSpace Test (20193rd.xls)
+
- Start checking IITA’s last round of batch uploads from March on DSpace Test (20193rd.xls)
- Lots of problems with affiliations, I had to correct about sixty of them
- I used lein to host the latest CSV of our affiliations for OpenRefine to reconcile against:
@@ -543,7 +543,7 @@ X-XSS-Protection: 1; mode=block
- After matching the values and creating some new matches I had trouble remembering how to copy the reconciled values to a new column
-- The matched values can be accessed with
cell.recon.match.name
, but some of the new values don't appear, perhaps because I edited the original cell values?
+- The matched values can be accessed with
cell.recon.match.name
, but some of the new values don’t appear, perhaps because I edited the original cell values?
- I ended up using this GREL expression to copy all values to a new column:
@@ -599,7 +599,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe

-- Linode Support still didn't respond to my ticket from yesterday, so I attached a new output of
iostat 1 10
and asked them to move the VM to a less busy host
+- Linode Support still didn’t respond to my ticket from yesterday, so I attached a new output of
iostat 1 10
and asked them to move the VM to a less busy host
- The web server logs are not very busy:
# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
@@ -679,7 +679,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
$ http 'https://api.crossref.org/funders?query=mercator&mailto=me@cgiar.org'
- Otherwise, they provide the funder data in CSV and RDF format
-- I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn't match will need a human to go and do some manual checking and informed decision making…
+- I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn’t match will need a human to go and do some manual checking and informed decision making…
- If I want to write a script for this I could use the Python habanero library:
from habanero import Crossref
@@ -687,7 +687,7 @@ cr = Crossref(mailto="me@cgiar.org")
x = cr.funders(query = "mercator")
2019-04-11
-- Continue proofing IITA's last round of batch uploads from March on DSpace Test (20193rd.xls)
+
- Continue proofing IITA’s last round of batch uploads from March on DSpace Test (20193rd.xls)
- One misspelled country
- Three incorrect regions
@@ -711,7 +711,7 @@ x = cr.funders(query = "mercator")
-I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA's records, so I applied them to DSpace Test and CGSpace:
+I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA’s records, so I applied them to DSpace Test and CGSpace:
$ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d
@@ -719,9 +719,9 @@ $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspac
Answer more questions about DOIs and Altmetric scores from WLE
Answer more questions about DOIs and Altmetric scores from IWMI
-- They can't seem to understand the Altmetric + Twitter flow for associating Handles and DOIs
-- To make things worse, many of their items DON'T have DOIs, so when Altmetric harvests them of course there is no link! - Then, a bunch of their items don't have scores because they never tweeted them!
-- They added a DOI to this old item 10567/97087 this morning and wonder why Altmetric's score hasn't linked with the DOI magically
+- They can’t seem to understand the Altmetric + Twitter flow for associating Handles and DOIs
+- To make things worse, many of their items DON’T have DOIs, so when Altmetric harvests them of course there is no link! - Then, a bunch of their items don’t have scores because they never tweeted them!
+- They added a DOI to this old item 10567/97087 this morning and wonder why Altmetric’s score hasn’t linked with the DOI magically
- We should check in a week to see if Altmetric will make the association after one week when they harvest again
@@ -734,7 +734,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspac
- It took about eight minutes to index 784 pages of item views and 268 of downloads, and you can see a clear “sawtooth” pattern in the garbage collection
- I am curious if the GC pattern would be different if I switched from the
-XX:+UseConcMarkSweepGC
to G1GC
-- I switched to G1GC and restarted Tomcat but for some reason I couldn't see the Tomcat PID in VisualVM…
+
- I switched to G1GC and restarted Tomcat but for some reason I couldn’t see the Tomcat PID in VisualVM…
- Anyways, the indexing process took much longer, perhaps twice as long!
@@ -771,10 +771,10 @@ $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspac
- Tag version 1.0.0 and deploy it on DSpace Test
-Pretty annoying to see CGSpace (linode18) with 20–50% CPU steal according to iostat 1 10
, though I haven't had any Linode alerts in a few days
-Abenet sent me a list of ILRI items that don't have CRPs added to them
+ Pretty annoying to see CGSpace (linode18) with 20–50% CPU steal according to iostat 1 10
, though I haven’t had any Linode alerts in a few days
+Abenet sent me a list of ILRI items that don’t have CRPs added to them
-- The spreadsheet only had Handles (no IDs), so I'm experimenting with using Python in OpenRefine to get the IDs
+- The spreadsheet only had Handles (no IDs), so I’m experimenting with using Python in OpenRefine to get the IDs
- I cloned the handle column and then did a transform to get the IDs from the CGSpace REST API:
@@ -795,12 +795,12 @@ item_id = data['id']
return item_id
-- Luckily none of the items already had CRPs, so I didn't have to worry about them getting removed
+
- Luckily none of the items already had CRPs, so I didn’t have to worry about them getting removed
- It would have been much trickier if I had to get the CRPs for the items first, then add the CRPs…
-- I ran a full Discovery indexing on CGSpace because I didn't do it after all the metadata updates last week:
+- I ran a full Discovery indexing on CGSpace because I didn’t do it after all the metadata updates last week:
$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
@@ -809,7 +809,7 @@ user 7m33.446s
sys 2m13.463s
2019-04-16
-- Export IITA's community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something
+- Export IITA’s community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something
2019-04-17
@@ -914,8 +914,8 @@ sys 2m13.463s
- The biggest takeaway I have is that this workload benefits from a larger
filterCache
(for Solr fq parameter), but barely uses the queryResultCache
(for Solr q parameter) at all
-- The number of hits goes up and the time taken decreases when we increase the
filterCache
, and total JVM heap memory doesn't seem to increase much at all
-- I guess the
queryResultCache
size is always 2 because I'm only doing two queries: type:0
and type:2
(downloads and views, respectively)
+- The number of hits goes up and the time taken decreases when we increase the
filterCache
, and total JVM heap memory doesn’t seem to increase much at all
+- I guess the
queryResultCache
size is always 2 because I’m only doing two queries: type:0
and type:2
(downloads and views, respectively)
- Here is the general pattern of running three sequential indexing runs as seen in VisualVM while monitoring the Tomcat process:
@@ -959,7 +959,7 @@ sys 2m13.463s

2019-04-18
-- I've been trying to copy the
statistics-2018
Solr core from CGSpace to DSpace Test since yesterday, but the network speed is like 20KiB/sec
+ - I’ve been trying to copy the
statistics-2018
Solr core from CGSpace to DSpace Test since yesterday, but the network speed is like 20KiB/sec
- I opened a support ticket to ask Linode to investigate
- They asked me to send an
mtr
report from Fremont to Frankfurt and vice versa
@@ -968,10 +968,10 @@ sys 2m13.463s
- Deploy Tomcat 7.0.94 on DSpace Test (linode19)
- Also, I realized that the CMS GC changes I deployed a few days ago were ignored by Tomcat because of something with how Ansible formatted the options string
-- I needed to use the “folded” YAML variable format
>-
(with the dash so it doesn't add a return at the end)
+- I needed to use the “folded” YAML variable format
>-
(with the dash so it doesn’t add a return at the end)
-- UptimeRobot says that CGSpace went “down” this afternoon, but I looked at the CPU steal with
iostat 1 10
and it's in the 50s and 60s
+ - UptimeRobot says that CGSpace went “down” this afternoon, but I looked at the CPU steal with
iostat 1 10
and it’s in the 50s and 60s
- The munin graph shows a lot of CPU steal (red) currently (and over all during the week):
@@ -1009,13 +1009,13 @@ TCP window size: 85.0 KByte (default)
[ 5] 0.0-10.2 sec 172 MBytes 142 Mbits/sec
[ 4] 0.0-10.5 sec 202 MBytes 162 Mbits/sec
-- Even with the software firewalls disabled the rsync speed was low, so it's not a rate limiting issue
+- Even with the software firewalls disabled the rsync speed was low, so it’s not a rate limiting issue
- I also tried to download a file over HTTPS from CGSpace to DSpace Test, but it was capped at 20KiB/sec
- I updated the Linode issue with this information
-- I'm going to try to switch the kernel to the latest upstream (5.0.8) instead of Linode's latest x86_64
+
- I’m going to try to switch the kernel to the latest upstream (5.0.8) instead of Linode’s latest x86_64
- Nope, still 20KiB/sec
@@ -1026,7 +1026,7 @@ TCP window size: 85.0 KByte (default)
- Deploy Solr 4.10.4 on CGSpace (linode18)
- Deploy Tomcat 7.0.94 on CGSpace
- Deploy dspace-statistics-api v1.0.0 on CGSpace
-- Linode support replicated the results I had from the network speed testing and said they don't know why it's so slow
+
- Linode support replicated the results I had from the network speed testing and said they don’t know why it’s so slow
- They offered to live migrate the instance to another host to see if that helps
@@ -1034,7 +1034,7 @@ TCP window size: 85.0 KByte (default)
2019-04-22
-- Abenet pointed out an item that doesn't have an Altmetric score on CGSpace, but has a score of 343 in the CGSpace Altmetric dashboard
+
- Abenet pointed out an item that doesn’t have an Altmetric score on CGSpace, but has a score of 343 in the CGSpace Altmetric dashboard
- I tweeted the Handle to see if it will pick it up…
- Like clockwork, after fifteen minutes there was a donut showing on CGSpace
@@ -1062,7 +1062,7 @@ dspace.log.2019-04-20:1515
-- Perhaps that's why the Azure pricing is so expensive!
+- Perhaps that’s why the Azure pricing is so expensive!
- Add a privacy page to CGSpace
- The work was mostly similar to the About page at
/page/about
, but in addition to adding i18n strings etc, I had to add the logic for the trail to dspace-xmlui-mirage2/src/main/webapp/xsl/preprocess/general.xsl
@@ -1086,7 +1086,7 @@ dspace.log.2019-04-20:1515
- While I was uploading the IITA records I noticed that twenty of the records Sisay uploaded in 2018-09 had double Handles (
dc.identifier.uri
)
-- According to my notes in 2018-09 I had noticed this when he uploaded the records and told him to remove them, but he didn't…
+- According to my notes in 2018-09 I had noticed this when he uploaded the records and told him to remove them, but he didn’t…
- I exported the IITA community as a CSV then used
csvcut
to extract the two URI columns and identify and fix the records:
@@ -1097,14 +1097,14 @@ dspace.log.2019-04-20:1515
- I told him we never finished it, and that he should try to use the
/items/find-by-metadata-field
endpoint, with the caveat that you need to match the language attribute exactly (ie “en”, “en_US”, null, etc)
- I asked him how many terms they are interested in, as we could probably make it easier by normalizing the language attributes of these fields (it would help us anyways)
-- He says he's getting HTTP 401 errors when trying to search for CPWF subject terms, which I can reproduce:
+- He says he’s getting HTTP 401 errors when trying to search for CPWF subject terms, which I can reproduce:
$ curl -f -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
curl: (22) The requested URL returned error: 401
-- Note that curl only shows the HTTP 401 error if you use
-f
(fail), and only then if you don't include -s
+ - Note that curl only shows the HTTP 401 error if you use
-f
(fail), and only then if you don’t include -s
- I see there are about 1,000 items using CPWF subject “WATER MANAGEMENT” in the database, so there should definitely be results
- The breakdown of
text_lang
fields used in those items is 942:
@@ -1129,7 +1129,7 @@ dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AN
417
(1 row)
-- I see that the HTTP 401 issue seems to be a bug due to an item that the user doesn't have permission to access… from the DSpace log:
+- I see that the HTTP 401 issue seems to be a bug due to an item that the user doesn’t have permission to access… from the DSpace log:
2019-04-24 08:11:51,129 INFO org.dspace.rest.ItemsResource @ Looking for item with metadata(key=cg.subject.cpwf,value=WATER MANAGEMENT, language=en_US).
2019-04-24 08:11:51,231 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72448
@@ -1209,7 +1209,7 @@ $ curl -f -H "rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b"
COPY 65752
2019-04-28
-- Still trying to figure out the issue with the items that cause the REST API's
/items/find-by-metadata-value
endpoint to throw an exception
+ - Still trying to figure out the issue with the items that cause the REST API’s
/items/find-by-metadata-value
endpoint to throw an exception
- I made the item private in the UI and then I see in the UI and PostgreSQL that it is no longer discoverable:
@@ -1234,7 +1234,7 @@ COPY 65752
$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
-- Carlos from LandPortal asked if I could export CGSpace in a machine-readable format so I think I'll try to do a CSV
+
- Carlos from LandPortal asked if I could export CGSpace in a machine-readable format so I think I’ll try to do a CSV
- In order to make it easier for him to understand the CSV I will normalize the text languages (minus the provenance field) on my local development instance before exporting:
diff --git a/docs/2019-05/index.html b/docs/2019-05/index.html
index ef0c76b3b..c52fea0a5 100644
--- a/docs/2019-05/index.html
+++ b/docs/2019-05/index.html
@@ -45,7 +45,7 @@ DELETE 1
But after this I tried to delete the item from the XMLUI and it is still present…
"/>
-
+
@@ -75,7 +75,7 @@ But after this I tried to delete the item from the XMLUI and it is still present
-
+
@@ -122,7 +122,7 @@ But after this I tried to delete the item from the XMLUI and it is still present
May, 2019
@@ -146,7 +146,7 @@ DELETE 1
- I managed to delete the problematic item from the database
-- First I deleted the item's bitstream in XMLUI and then ran
dspace cleanup -v
to remove it from the assetstore
+- First I deleted the item’s bitstream in XMLUI and then ran
dspace cleanup -v
to remove it from the assetstore
- Then I ran the following SQL:
@@ -155,7 +155,7 @@ DELETE 1
dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
dspace=# DELETE FROM item WHERE item_id=74648;
-- Now the item is (hopefully) really gone and I can continue to troubleshoot the issue with REST API's
/items/find-by-metadata-value
endpoint
+ - Now the item is (hopefully) really gone and I can continue to troubleshoot the issue with REST API’s
/items/find-by-metadata-value
endpoint
- Of course I run into another HTTP 401 error when I continue trying the LandPortal search from last month:
@@ -177,15 +177,15 @@ curl: (22) The requested URL returned error: 401 Unauthorized
- Some are in the
workspaceitem
table (pre-submission), others are in the workflowitem
table (submitted), and others are actually approved, but withdrawn…
-- This is actually a worthless exercise because the real issue is that the
/items/find-by-metadata-value
endpoint is simply designed flawed and shouldn't be fatally erroring when the search returns items the user doesn't have permission to access
-- It would take way too much time to try to fix the fucked up items that are in limbo by deleting them in SQL, but also, it doesn't actually fix the problem because some items are submitted but withdrawn, so they actually have handles and everything
-- I think the solution is to recommend people don't use the
/items/find-by-metadata-value
endpoint
+- This is actually a worthless exercise because the real issue is that the
/items/find-by-metadata-value
endpoint is simply designed flawed and shouldn’t be fatally erroring when the search returns items the user doesn’t have permission to access
+- It would take way too much time to try to fix the fucked up items that are in limbo by deleting them in SQL, but also, it doesn’t actually fix the problem because some items are submitted but withdrawn, so they actually have handles and everything
+- I think the solution is to recommend people don’t use the
/items/find-by-metadata-value
endpoint
- CIP is asking about embedding PDF thumbnail images in their RSS feeds again
-- They asked in 2018-09 as well and I told them it wasn't possible
-- To make sure, I looked at the documentation for RSS media feeds and tried it, but couldn't get it to work
+- They asked in 2018-09 as well and I told them it wasn’t possible
+- To make sure, I looked at the documentation for RSS media feeds and tried it, but couldn’t get it to work
- It seems to be geared towards iTunes and Podcasts… I dunno
@@ -273,7 +273,7 @@ Please see the DSpace documentation for assistance.


-- The number of unique sessions today is ridiculously high compared to the last few days considering it's only 12:30PM right now:
+- The number of unique sessions today is ridiculously high compared to the last few days considering it’s only 12:30PM right now:
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-06 | sort | uniq | wc -l
101108
@@ -326,7 +326,7 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -
2845 HEAD
98121 GET
-- I'm not exactly sure what happened this morning, but it looks like some legitimate user traffic—perhaps someone launched a new publication and it got a bunch of hits?
+- I’m not exactly sure what happened this morning, but it looks like some legitimate user traffic—perhaps someone launched a new publication and it got a bunch of hits?
- Looking again, I see 84,000 requests to
/handle
this morning (not including logs for library.cgiar.org because those get HTTP 301 redirect to CGSpace and appear here in access.log
):
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -c -o -E " /handle/[0-9]+/[0-9]+"
@@ -413,7 +413,7 @@ Error sending email:
Please see the DSpace documentation for assistance.
- I checked the settings and apparently I had updated it incorrectly last week after ICT reset the password
-- Help Moayad with certbot-auto for Let's Encrypt scripts on the new AReS server (linode20)
+- Help Moayad with certbot-auto for Let’s Encrypt scripts on the new AReS server (linode20)
- Normalize all
text_lang
values for metadata on CGSpace and DSpace Test (as I had tested last month):
UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
@@ -455,7 +455,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
- So this was definitely an attack of some sort… only God knows why
-- I noticed a few new bots that don't use the word “bot” in their user agent and therefore don't match Tomcat's Crawler Session Manager Valve:
+
- I noticed a few new bots that don’t use the word “bot” in their user agent and therefore don’t match Tomcat’s Crawler Session Manager Valve:
Blackboard Safeassign
Unpaywall
@@ -486,7 +486,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
2019-05-15
-- Tezira says she's having issues with email reports for approved submissions, but I received an email about collection subscriptions this morning, and I tested with
dspace test-email
and it's also working…
+- Tezira says she’s having issues with email reports for approved submissions, but I received an email about collection subscriptions this morning, and I tested with
dspace test-email
and it’s also working…
- Send a list of DSpace build tips to Panagis from AgroKnow
- Finally fix the AReS v2 to work via DSpace Test and send it to Peter et al to give their feedback
@@ -501,7 +501,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-05-16-investors.csv WITH CSV HEADER;
COPY 995
-- Fork the ICARDA AReS v1 repository to ILRI's GitHub and give access to CodeObia guys
+
- Fork the ICARDA AReS v1 repository to ILRI’s GitHub and give access to CodeObia guys
- The plan is that we develop the v2 code here
@@ -522,7 +522,7 @@ $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
- I was going to make a new controlled vocabulary of the top 100 terms after these corrections, but I noticed a bunch of duplicates and variations when I sorted them alphabetically
- Instead, I exported a new list and asked Peter to look at it again
-- Apply Peter's new corrections on DSpace Test and CGSpace:
+- Apply Peter’s new corrections on DSpace Test and CGSpace:
$ ./fix-metadata-values.py -i /tmp/2019-05-17-fix-25-Investors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
@@ -581,7 +581,7 @@ COPY 64871
- Run all system updates on DSpace Test (linode19) and reboot it
- Paola from CIAT asked for a way to generate a report of the top keywords for each year of their articles and journals
-- I told them that the best way (even though it's low tech) is to work on a CSV dump of the collection
+- I told them that the best way (even though it’s low tech) is to work on a CSV dump of the collection
@@ -600,7 +600,7 @@ COPY 64871
2019-05-30 07:19:35,166 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A5E0C836AF8F3ABB769FE47107AE1CFF:ip_addr=185.71.4.34:failed_login:no DN found for user sa.saini@cgiar.org
-- For now I just created an eperson with her personal email address until I have time to check LDAP to see what's up with her CGIAR account:
+- For now I just created an eperson with her personal email address until I have time to check LDAP to see what’s up with her CGIAR account:
$ dspace user -a -m blah@blah.com -g Sakshi -s Saini -p 'sknflksnfksnfdls'
diff --git a/docs/2019-06/index.html b/docs/2019-06/index.html
index 1c43af46c..a49fde881 100644
--- a/docs/2019-06/index.html
+++ b/docs/2019-06/index.html
@@ -31,7 +31,7 @@ Run system updates on CGSpace (linode18) and reboot it
Skype with Marie-Angélique and Abenet about CG Core v2
"/>
-
+
@@ -61,7 +61,7 @@ Skype with Marie-Angélique and Abenet about CG Core v2
-
+
@@ -108,7 +108,7 @@ Skype with Marie-Angélique and Abenet about CG Core v2
June, 2019
@@ -172,16 +172,16 @@ Skype with Marie-Angélique and Abenet about CG Core v2
Create a new AReS repository: https://github.com/ilri/AReS
Start looking at the 203 IITA records on DSpace Test from last month (IITA_May_16 aka “20194th.xls”) using OpenRefine
-- Trim leading, trailing, and consecutive whitespace on all columns, but I didn't notice very many issues
+- Trim leading, trailing, and consecutive whitespace on all columns, but I didn’t notice very many issues
- Validate affiliations against latest list of top 1500 terms using reconcile-csv, correcting and standardizing about twenty-seven
- Validate countries against latest list of countries using reconcile-csv, correcting three
-- Convert all DOIs to “https://dx.doi.org" format
+- Convert all DOIs to “https://dx.doi.org” format
- Normalize all
cg.identifier.url
Google book fields to “books.google.com”
- Correct some inconsistencies in IITA subjects
- Correct two incorrect “Peer Review” in
dc.description.version
- About fifteen items have incorrect ISBNs (looks like an Excel error because the values look like scientific numbers)
- Delete one blank item
-- I managed to get to subjects, so I'll continue from there when I start working next
+- I managed to get to subjects, so I’ll continue from there when I start working next
Generate a new list of countries from the database for use with reconcile-csv
@@ -194,7 +194,7 @@ Skype with Marie-Angélique and Abenet about CG Core v2
COPY 192
$ csvcut -l -c 0 /tmp/countries.csv > 2019-06-10-countries.csv
-- Get a list of all the unique AGROVOC subject terms in IITA's data and export it to a text file so I can validate them with my
agrovoc-lookup.py
script:
+- Get a list of all the unique AGROVOC subject terms in IITA’s data and export it to a text file so I can validate them with my
agrovoc-lookup.py
script:
$ csvcut -c dc.subject ~/Downloads/2019-06-10-IITA-20194th-Round-2.csv| sed 's/||/\n/g' | grep -v dc.subject | sort -u > iita-agrovoc.txt
$ ./agrovoc-lookup.py -i iita-agrovoc.txt -om iita-agrovoc-matches.txt -or iita-agrovoc-rejects.txt
@@ -251,9 +251,9 @@ UPDATE 2
Lots of variation in affiliations, for example:
- Université Abomey-Calavi
-- Université d'Abomey
-- Université d'Abomey Calavi
-- Université d'Abomey-Calavi
+- Université d’Abomey
+- Université d’Abomey Calavi
+- Université d’Abomey-Calavi
- University of Abomey-Calavi
diff --git a/docs/2019-07/index.html b/docs/2019-07/index.html
index 07ab039ef..e6f68bb48 100644
--- a/docs/2019-07/index.html
+++ b/docs/2019-07/index.html
@@ -35,7 +35,7 @@ CGSpace
Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community
"/>
-
+
@@ -65,7 +65,7 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
-
+
@@ -112,7 +112,7 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
July, 2019
@@ -129,16 +129,16 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community
-- If I change the parameters to 2019 I see stats, so I'm really thinking it has something to do with the sharded yearly Solr statistics cores
+
- If I change the parameters to 2019 I see stats, so I’m really thinking it has something to do with the sharded yearly Solr statistics cores
-- I checked the Solr admin UI and I see all Solr cores loaded, so I don't know what it could be
+- I checked the Solr admin UI and I see all Solr cores loaded, so I don’t know what it could be
- When I check the Atmire content and usage module it seems obvious that there is a problem with the old cores because I dont have anything before 2019-01

-- I don't see anyone logged in right now so I'm going to try to restart Tomcat and see if the stats are accessible after Solr comes back up
+- I don’t see anyone logged in right now so I’m going to try to restart Tomcat and see if the stats are accessible after Solr comes back up
- I decided to run all system updates on the server (linode18) and reboot it
- After rebooting Tomcat came back up, but the the Solr statistics cores were not all loaded
@@ -166,24 +166,24 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
# find /dspace/solr/statistics* -iname "*.lock" -print -delete
# systemctl start tomcat7
-- But it still didn't work!
+- But it still didn’t work!
- I stopped Tomcat, deleted the old locks, and will try to use the “simple” lock file type in
solr/statistics/conf/solrconfig.xml
:
<lockType>${solr.lock.type:simple}</lockType>
-- And after restarting Tomcat it still doesn't work
-- Now I'll try going back to “native” locking with
unlockAtStartup
:
+- And after restarting Tomcat it still doesn’t work
+- Now I’ll try going back to “native” locking with
unlockAtStartup
:
<unlockOnStartup>true</unlockOnStartup>
-- Now the cores seem to load, but I still see an error in the Solr Admin UI and I still can't access any stats before 2018
-- I filed an issue with Atmire, so let's see if they can help
-- And since I'm annoyed and it's been a few months, I'm going to move the JVM heap settings that I've been testing on DSpace Test to CGSpace
+- Now the cores seem to load, but I still see an error in the Solr Admin UI and I still can’t access any stats before 2018
+- I filed an issue with Atmire, so let’s see if they can help
+- And since I’m annoyed and it’s been a few months, I’m going to move the JVM heap settings that I’ve been testing on DSpace Test to CGSpace
- The old ones were:
-Djava.awt.headless=true -Xms8192m -Xmx8192m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=5400 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
-- And the new ones come from Solr 4.10.x's startup scripts:
+- And the new ones come from Solr 4.10.x’s startup scripts:
-Djava.awt.headless=true
-Xms8192m -Xmx8192m
@@ -253,7 +253,7 @@ $ ./resolve-orcids.py -i /tmp/2019-07-04-orcid-ids.txt -o 2019-07-04-orcid-names
"Mwungu: 0000-0001-6181-8445","Chris Miyinzi Mwungu: 0000-0001-6181-8445"
"Mwungu: 0000-0003-1658-287X","Chris Miyinzi Mwungu: 0000-0003-1658-287X"
-- But when I ran
fix-metadata-values.py
I didn't see any changes:
+- But when I ran
fix-metadata-values.py
I didn’t see any changes:
$ ./fix-metadata-values.py -i 2019-07-04-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
2019-07-06
@@ -328,7 +328,7 @@ dc.identifier.issn
2019-07-10 11:50:27,433 INFO org.dspace.submit.step.CompleteStep @ lewyllie@cta.int:session_id=A920730003BCAECE8A3B31DCDE11A97E:submission_complete:Completed submission with id=106658
-- I'm assuming something happened in his browser (like a refresh) after the item was submitted…
+- I’m assuming something happened in his browser (like a refresh) after the item was submitted…
2019-07-12
@@ -336,7 +336,7 @@ dc.identifier.issn
- Unfortunately there is no concrete feedback yet
- I think we need to upgrade our DSpace Test server so we can fit all the Solr cores…
-- Actually, I looked and there were over 40 GB free on DSpace Test so I copied the Solr statistics cores for the years 2017 to 2010 from CGSpace to DSpace Test because they weren't actually very large
+- Actually, I looked and there were over 40 GB free on DSpace Test so I copied the Solr statistics cores for the years 2017 to 2010 from CGSpace to DSpace Test because they weren’t actually very large
- I re-deployed DSpace for good measure, and I think all Solr cores are loading… I will do more tests later
@@ -353,7 +353,7 @@ $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bits
UPDATE 1
2019-07-16
-- Completely reset the Podman configuration on my laptop because there were some layers that I couldn't delete and it had been some time since I did a cleanup:
+- Completely reset the Podman configuration on my laptop because there were some layers that I couldn’t delete and it had been some time since I did a cleanup:
$ podman system prune -a -f --volumes
$ sudo rm -rf ~/.local/share/containers
@@ -376,7 +376,7 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
- Talk to Moayad about the remaining issues for OpenRXV / AReS
-- He sent a pull request with some changes for the bar chart and documentation about configuration, and said he'd finish the export feature next week
+- He sent a pull request with some changes for the bar chart and documentation about configuration, and said he’d finish the export feature next week
- Sisay said a user was having problems registering on CGSpace and it looks like the email account expired again:
@@ -399,13 +399,13 @@ Please see the DSpace documentation for assistance.
- ICT reset the password for the CGSpace support account and apparently removed the expiry requirement
-- I tested the account and it's working
+- I tested the account and it’s working
2019-07-20
-- Create an account for Lionelle Samnick on CGSpace because the registration isn't working for some reason:
+- Create an account for Lionelle Samnick on CGSpace because the registration isn’t working for some reason:
$ dspace user --add --givenname Lionelle --surname Samnick --email blah@blah.com --password 'blah'
@@ -413,12 +413,12 @@ Please see the DSpace documentation for assistance.
- Start looking at 1429 records for the Bioversity batch import
- Multiple authors should be specified with multi-value separatator (||) instead of ;
-- We don't use “(eds)” as an author
+- We don’t use “(eds)” as an author
- Same issue with dc.publisher using “;” for multiple values
- Some invalid ISSNs in dc.identifier.issn (they look like ISBNs)
- I see some ISSNs in the dc.identifier.isbn field
- I see some invalid ISBNs that look like Excel errors (9,78E+12)
-- For DOI we just use the URL, not “DOI: https://doi.org..."
+- For DOI we just use the URL, not “DOI: https://doi.org…”
- I see an invalid “LEAVE BLANK” in the cg.contributor.crp field
- Country field is using “,” for multiple values instead of “||”
- Region field is using “,” for multiple values instead of “||”
@@ -462,7 +462,7 @@ Please see the DSpace documentation for assistance.
- A few strange publishers after splitting multi-value cells, like “(Belgium)”
- Deleted four ISSNs that are actually ISBNs and are already present in the ISBN field
- Eight invalid ISBNs
-- Convert all DOIs to “https://doi.org" format and fix one invalid DOI
+- Convert all DOIs to “https://doi.org” format and fix one invalid DOI
- Fix a handful of incorrect CRPs that seem to have been split on comma “,”
- Lots of strange values in cg.link.reference, and I normalized all DOIs to https://doi.org format
diff --git a/docs/2019-08/index.html b/docs/2019-08/index.html
index a8767f4ab..025a9e025 100644
--- a/docs/2019-08/index.html
+++ b/docs/2019-08/index.html
@@ -8,7 +8,7 @@
-
+
@@ -73,7 +73,7 @@ Run system updates on DSpace Test (linode19) and reboot it
-
+
@@ -120,14 +120,14 @@ Run system updates on DSpace Test (linode19) and reboot it
August, 2019
2019-08-03
-- Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
+- Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
2019-08-04
@@ -135,7 +135,7 @@ Run system updates on DSpace Test (linode19) and reboot it
- Run system updates on CGSpace (linode18) and reboot it
- Before updating it I checked Solr and verified that all statistics cores were loaded properly…
-- After rebooting, all statistics cores were loaded… wow, that's lucky.
+- After rebooting, all statistics cores were loaded… wow, that’s lucky.
- Run system updates on DSpace Test (linode19) and reboot it
@@ -199,7 +199,7 @@ Run system updates on DSpace Test (linode19) and reboot it
isNotNull(value.match(/^.*û.*$/))
).toString()
-- I tried to extract the filenames and construct a URL to download the PDFs with my
generate-thumbnails.py
script, but there seem to be several paths for PDFs so I can't guess it properly
+- I tried to extract the filenames and construct a URL to download the PDFs with my
generate-thumbnails.py
script, but there seem to be several paths for PDFs so I can’t guess it properly
- I will have to wait for Francesco to respond about the PDFs, or perhaps proceed with a metadata-only upload so we can do other checks on DSpace Test
2019-08-06
@@ -231,7 +231,7 @@ Run system updates on DSpace Test (linode19) and reboot it
# /opt/certbot-auto renew --standalone --pre-hook "/usr/bin/docker stop angular_nginx; /bin/systemctl stop firewalld" --post-hook "/bin/systemctl start firewalld; /usr/bin/docker start angular_nginx"
- It is important that the firewall starts back up before the Docker container or else Docker will complain about missing iptables chains
-- Also, I updated to the latest TLS Intermediate settings as appropriate for Ubuntu 18.04's OpenSSL 1.1.0g with nginx 1.16.0
+- Also, I updated to the latest TLS Intermediate settings as appropriate for Ubuntu 18.04’s OpenSSL 1.1.0g with nginx 1.16.0
- Run all system updates on AReS dev server (linode20) and reboot it
- Get a list of all PDFs from the Bioversity migration that fail to download and save them so I can try again with a different path in the URL:
@@ -253,7 +253,7 @@ $ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d |
-Even so, there are still 52 items with incorrect filenames, so I can't derive their PDF URLs…
+Even so, there are still 52 items with incorrect filenames, so I can’t derive their PDF URLs…
- For example,
Wild_cherry_Prunus_avium_859.pdf
is here (with double underscore): https://www.bioversityinternational.org/fileadmin/_migrated/uploads/tx_news/Wild_cherry__Prunus_avium__859.pdf
@@ -348,7 +348,7 @@ $ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
- I imported the 1,427 Bioversity records into DSpace Test
-- To make sure we didn't have memory issues I reduced Tomcat's JVM heap by 512m, increased the import processes's heap to 512m, and split the input file into two parts with about 700 each
+- To make sure we didn’t have memory issues I reduced Tomcat’s JVM heap by 512m, increased the import processes’s heap to 512m, and split the input file into two parts with about 700 each
- Then I had to create a few new temporary collections on DSpace Test that had been created on CGSpace after our last sync
- After that the import succeeded:
@@ -395,8 +395,8 @@ return os.path.basename(value)
2019-08-21
-- Upload csv-metadata-quality repository to ILRI's GitHub organization
-- Fix a few invalid countries in IITA's July 29 records (aka “20195TH.xls”)
+
- Upload csv-metadata-quality repository to ILRI’s GitHub organization
+- Fix a few invalid countries in IITA’s July 29 records (aka “20195TH.xls”)
- These were not caught by my csv-metadata-quality check script because of a logic error
- Remove
dc.identified.uri
fields from test data, set id
values to “-1”, add collection mappings according to dc.type
, and Upload 126 IITA records to CGSpace
@@ -439,13 +439,13 @@ sys 2m24.715s
-
Peter asked me to add related citation aka cg.link.citation
to the item view
-- I created a pull request with a draft implementation and asked for Peter's feedback
+- I created a pull request with a draft implementation and asked for Peter’s feedback
-
Add the ability to skip certain fields from the csv-metadata-quality script using --exclude-fields
-- For example, when I'm working on the author corrections I want to do the basic checks on the corrected fields, but on the original fields so I would use
--exclude-fields dc.contributor.author
for example
+- For example, when I’m working on the author corrections I want to do the basic checks on the corrected fields, but on the original fields so I would use
--exclude-fields dc.contributor.author
for example
@@ -493,7 +493,7 @@ COPY 65597
- Resume working on the CG Core v2 changes in the
5_x-cgcorev2
branch again
-- I notice that CG Core doesn't currently have a field for CGSpace's “alternative title” (
dc.title.alternative
), but DCTERMS has dcterms.alternative
so I raised an issue about adding it
+- I notice that CG Core doesn’t currently have a field for CGSpace’s “alternative title” (
dc.title.alternative
), but DCTERMS has dcterms.alternative
so I raised an issue about adding it
- Marie responded and said she would add
dcterms.alternative
- I created a sed script file to perform some replacements of metadata on the XMLUI XSL files:
@@ -521,7 +521,7 @@ COPY 65597
"handles":["10986/30568","10568/97825"],"handle":"10986/30568"
-- So this is the same issue we had before, where Altmetric knows this Handle is associated with a DOI that has a score, but the client-side JavaScript code doesn't show it because it seems to a secondary handle or something
+- So this is the same issue we had before, where Altmetric knows this Handle is associated with a DOI that has a score, but the client-side JavaScript code doesn’t show it because it seems to a secondary handle or something
2019-08-31
diff --git a/docs/2019-09/index.html b/docs/2019-09/index.html
index 18871b4a0..16fce6aab 100644
--- a/docs/2019-09/index.html
+++ b/docs/2019-09/index.html
@@ -69,7 +69,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
7249 2a01:7e00::f03c:91ff:fe18:7396
9124 45.5.186.2
"/>
-
+
@@ -99,7 +99,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
-
+
@@ -146,7 +146,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
September, 2019
@@ -197,7 +197,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
2350 discover
71 handle
-- I'm not sure why the outbound traffic rate was so high…
+- I’m not sure why the outbound traffic rate was so high…
2019-09-02
@@ -304,7 +304,7 @@ dspace.log.2019-09-15:808
2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.OREDisseminationCrosswalk", name="ore"
2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.DIMDisseminationCrosswalk", name="dim"
-- I restarted Tomcat and the item views came back, but then the Solr statistics cores didn't all load properly
+
- I restarted Tomcat and the item views came back, but then the Solr statistics cores didn’t all load properly
- After restarting Tomcat once again, both the item views and the Solr statistics cores all came back OK
@@ -312,7 +312,7 @@ dspace.log.2019-09-15:808
2019-09-19
-- For some reason my podman PostgreSQL container isn't working so I had to use Docker to re-create it for my testing work today:
+- For some reason my podman PostgreSQL container isn’t working so I had to use Docker to re-create it for my testing work today:
# docker pull docker.io/library/postgres:9.6-alpine
# docker create volume dspacedb_data
@@ -357,14 +357,14 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
- I ran the same updates on CGSpace and DSpace Test and then started a Discovery re-index to force the search index to update
- Update the PostgreSQL JDBC driver to version 42.2.8 in our Ansible infrastructure scripts
-- There is only one minor fix to a usecase we aren't using so I will deploy this on the servers the next time I do updates
+- There is only one minor fix to a usecase we aren’t using so I will deploy this on the servers the next time I do updates
- Run system updates on DSpace Test (linode19) and reboot it
-- Start looking at IITA's latest round of batch updates that Sisay had uploaded to DSpace Test earlier this month
+
- Start looking at IITA’s latest round of batch updates that Sisay had uploaded to DSpace Test earlier this month
-- For posterity, IITA's original input file was 20196th.xls and Sisay uploaded it as “IITA_Sep_06” to DSpace Test
-- Sisay said he did ran the csv-metadata-quality script on the records, but I assume he didn't run the unsafe fixes or AGROVOC checks because I still see unneccessary Unicode, excessive whitespace, one invalid ISBN, missing dates and a few invalid AGROVOC fields
+- For posterity, IITA’s original input file was 20196th.xls and Sisay uploaded it as “IITA_Sep_06” to DSpace Test
+- Sisay said he did ran the csv-metadata-quality script on the records, but I assume he didn’t run the unsafe fixes or AGROVOC checks because I still see unneccessary Unicode, excessive whitespace, one invalid ISBN, missing dates and a few invalid AGROVOC fields
- In addition, a few records were missing authorship type
- I deleted two invalid AGROVOC terms because they were ambiguous
- Validate and normalize affiliations against our 2019-04 list using reconcile-csv and OpenRefine:
@@ -391,19 +391,19 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
- I created and merged a pull request for the updates
-- This is the first time we've updated this controlled vocabulary since 2018-09
+- This is the first time we’ve updated this controlled vocabulary since 2018-09
2019-09-20
-- Deploy a fresh snapshot of CGSpace's PostgreSQL database on DSpace Test so we can get more accurate duplicate checking with the upcoming Bioversity and IITA migrations
+- Deploy a fresh snapshot of CGSpace’s PostgreSQL database on DSpace Test so we can get more accurate duplicate checking with the upcoming Bioversity and IITA migrations
- Skype with Carol and Francesca to discuss the Bioveristy migration to CGSpace
- They want to do some enrichment of the metadata to add countries and regions
- Also, they noticed that some items have a blank ISSN in the citation like “ISSN:”
-- I told them it's probably best if we have Francesco produce a new export from Typo 3
-- But on second thought I think that I've already done so much work on this file as it is that I should fix what I can here and then do a new import to DSpace Test with the PDFs
+- I told them it’s probably best if we have Francesco produce a new export from Typo 3
+- But on second thought I think that I’ve already done so much work on this file as it is that I should fix what I can here and then do a new import to DSpace Test with the PDFs
- Other corrections would be to replace “Inst.” and “Instit.” with “Institute” and remove those blank ISSNs from the citations
- I will rename the files with multiple underscores so they match the filename column in the CSV using this command:
@@ -415,14 +415,14 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
- There are a few dozen that have completely fucked up names due to some encoding error
- To make matters worse, when I tried to download them, some of the links in the “URL” column that Francesco included are wrong, so I had to go to the permalink and get a link that worked
-- After downloading everything I had to use Ubuntu's version of rename to get rid of all the double and triple underscores:
+- After downloading everything I had to use Ubuntu’s version of rename to get rid of all the double and triple underscores:
$ rename -v 's/___/_/g' *.pdf
$ rename -v 's/__/_/g' *.pdf
-- I'm still waiting to hear what Carol and Francesca want to do with the
1195.pdf.LCK
file (for now I've removed it from the CSV, but for future reference it has the number 630 in its permalink)
+- I’m still waiting to hear what Carol and Francesca want to do with the
1195.pdf.LCK
file (for now I’ve removed it from the CSV, but for future reference it has the number 630 in its permalink)
- I wrote two fairly long GREL expressions to clean up the institutional author names in the
dc.contributor.author
and dc.identifier.citation
fields using OpenRefine
- The first targets acronyms in parentheses like “International Livestock Research Institute (ILRI)":
@@ -469,14 +469,14 @@ $ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bio
- Play with language identification using the langdetect, fasttext, polyglot, and langid libraries
- ployglot requires too many system things to compile
-- langdetect didn't seem as accurate as the others
+- langdetect didn’t seem as accurate as the others
- fasttext is likely the best, but prints a blank link to the console when loading a model
- langid seems to be the best considering the above experiences
- I added very experimental language detection to the csv-metadata-quality module
-- It works by checking the predicted language of the
dc.title
field against the item's dc.language.iso
field
+- It works by checking the predicted language of the
dc.title
field against the item’s dc.language.iso
field
- I tested it on the Bioversity migration data set and it actually helped me correct eleven language fields in their records!
@@ -504,7 +504,7 @@ $ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bio
- I deleted another item that I had previously identified as a duplicate that she had fixed by incorrectly deleting the original (ugh)
-- Get a list of institutions from CCAFS's Clarisa API and try to parse it with
jq
, do some small cleanups and add a header in sed
, and then pass it through csvcut
to add line numbers:
+- Get a list of institutions from CCAFS’s Clarisa API and try to parse it with
jq
, do some small cleanups and add a header in sed
, and then pass it through csvcut
to add line numbers:
$ cat ~/Downloads/institutions.json| jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
$ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institutions-cleaned.csv -u
@@ -516,8 +516,8 @@ $ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institut
- Skype with Peter and Abenet about CGSpace actions
-- Peter will respond to ICARDA's request to deposit items in to CGSpace, with a caveat that we agree on some vocabulary standards for institutions, countries, regions, etc
-- We discussed using ISO 3166 for countries, though Peter doesn't like the formal names like “Moldova, Republic of” and “Tanzania, United Republic of”
+
- Peter will respond to ICARDA’s request to deposit items in to CGSpace, with a caveat that we agree on some vocabulary standards for institutions, countries, regions, etc
+- We discussed using ISO 3166 for countries, though Peter doesn’t like the formal names like “Moldova, Republic of” and “Tanzania, United Republic of”
- The Debian
iso-codes
package has ISO 3166-1 with “common name”, “name”, and “official name” representations, for example:
@@ -528,14 +528,14 @@ $ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institut
- There are still some unfortunate ones there, though:
-- name: Korea, Democratic People's Republic of
-- official_name: Democratic People's Republic of Korea
+- name: Korea, Democratic People’s Republic of
+- official_name: Democratic People’s Republic of Korea
-- And this, which isn't even in English…
+
- And this, which isn’t even in English…
-- name: Côte d'Ivoire
-- official_name: Republic of Côte d'Ivoire
+- name: Côte d’Ivoire
+- official_name: Republic of Côte d’Ivoire
- The other alternative is to just keep using the names we have, which are mostly compliant with AGROVOC
diff --git a/docs/2019-10/index.html b/docs/2019-10/index.html
index 4a12b3521..3d68756da 100644
--- a/docs/2019-10/index.html
+++ b/docs/2019-10/index.html
@@ -6,7 +6,7 @@
-
+
@@ -14,8 +14,8 @@
-
-
+
+
@@ -45,7 +45,7 @@
-
+
@@ -92,7 +92,7 @@
October, 2019
@@ -102,15 +102,15 @@
- Udana from IWMI asked me for a CSV export of their community on CGSpace
- I exported it, but a quick run through the
csv-metadata-quality
tool shows that there are some low-hanging fruits we can fix before I send him the data
-- I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix:
+- I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix:
$ csvcut -c 'id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]' ~/Downloads/10568-16814.csv > /tmp/iwmi-title-region-subregion-river.csv
-- Then I replace them in vim with
:% s/\%u00a0/ /g
because I can't figure out the correct sed syntax to do it directly from the pipe above
+- Then I replace them in vim with
:% s/\%u00a0/ /g
because I can’t figure out the correct sed syntax to do it directly from the pipe above
- I uploaded those to CGSpace and then re-exported the metadata
-- Now that I think about it, I shouldn't be removing non-breaking spaces (U+00A0), I should be replacing them with normal spaces!
+- Now that I think about it, I shouldn’t be removing non-breaking spaces (U+00A0), I should be replacing them with normal spaces!
- I modified the script so it replaces the non-breaking spaces instead of removing them
- Then I ran the csv-metadata-quality script to do some general cleanups (though I temporarily commented out the whitespace fixes because it was too many thousands of rows):
@@ -125,7 +125,7 @@
2019-10-04
-- Create an account for Bioversity's ICT consultant Francesco on DSpace Test:
+- Create an account for Bioversity’s ICT consultant Francesco on DSpace Test:
$ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p 'fffff'
@@ -162,7 +162,7 @@
- Start looking at duplicates in the Bioversity migration data on DSpace Test
-- I'm keeping track of the originals and duplicates in a Google Docs spreadsheet that I will share with Bioversity
+- I’m keeping track of the originals and duplicates in a Google Docs spreadsheet that I will share with Bioversity
@@ -181,7 +181,7 @@
- Felix Shaw from Earlham emailed me to ask about his admin account on DSpace Test
-- His old one got lost when I re-sync'd DSpace Test with CGSpace a few weeks ago
+- His old one got lost when I re-sync’d DSpace Test with CGSpace a few weeks ago
- I added a new account for him and added it to the Administrators group:
@@ -206,7 +206,7 @@ UPDATE 1
- More work on identifying duplicates in the Bioversity migration data on DSpace Test
- I mapped twenty-five more items on CGSpace and deleted them from the migration test collection on DSpace Test
-- After a few hours I think I finished all the duplicates that were identified by Atmire's Duplicate Checker module
+- After a few hours I think I finished all the duplicates that were identified by Atmire’s Duplicate Checker module
- According to my spreadsheet there were fifty-two in total
@@ -234,8 +234,8 @@ International Maize and Wheat Improvement Centre,International Maize and Wheat I
- I would still like to perhaps (re)move institutional authors from
dc.contributor.author
to cg.contributor.affiliation
, but I will have to run that by Francesca, Carol, and Abenet
- I could use a custom text facet like this in OpenRefine to find authors that likely match the “Last, F.” pattern:
isNotNull(value.match(/^.*, \p{Lu}\.?.*$/))
- The
\p{Lu}
is a cool regex character class to make sure this works for letters with accents
-- As cool as that is, it's actually more effective to just search for authors that have “.” in them!
-- I've decided to add a
cg.contributor.affiliation
column to 1,025 items based on the logic above where the author name is not an actual person
+- As cool as that is, it’s actually more effective to just search for authors that have “.” in them!
+- I’ve decided to add a
cg.contributor.affiliation
column to 1,025 items based on the logic above where the author name is not an actual person
@@ -279,7 +279,7 @@ real 82m35.993s
10568/129
(1 row)
-- So I'm still not sure where these weird authors in the “Top Author” stats are coming from
+- So I’m still not sure where these weird authors in the “Top Author” stats are coming from
2019-10-14
@@ -302,12 +302,12 @@ $ mkdir 2019-10-15-Bioversity
$ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-10-15-Bioversity
$ sed -i '/<dcvalue element="identifier" qualifier="uri">/d' 2019-10-15-Bioversity/*/dublin_core.xml
-- It's really stupid, but for some reason the handles are included even though I specified the
-m
option, so after the export I removed the dc.identifier.uri
metadata values from the items
+- It’s really stupid, but for some reason the handles are included even though I specified the
-m
option, so after the export I removed the dc.identifier.uri
metadata values from the items
- Then I imported a test subset of them in my local test environment:
$ ~/dspace/bin/dspace import -a -c 10568/104049 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map -s /tmp/2019-10-15-Bioversity
-- I had forgotten (again) that the
dspace export
command doesn't preserve collection ownership or mappings, so I will have to create a temporary collection on CGSpace to import these to, then do the mappings again after import…
+- I had forgotten (again) that the
dspace export
command doesn’t preserve collection ownership or mappings, so I will have to create a temporary collection on CGSpace to import these to, then do the mappings again after import…
- On CGSpace I will increase the RAM of the command line Java process for good luck before import…
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
@@ -338,8 +338,8 @@ $ dspace import -a -c 10568/104057 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map
Move the CGSpace CG Core v2 notes from a GitHub Gist to a page on this site for archive and searchability sake
Work on the CG Core v2 implementation testing
-- I noticed that the page title is messed up on the item view, and after over an hour of troubleshooting it I couldn't figure out why
-- It seems to be because the
dc.title
→dcterms.title
modifications cause the title metadata to disappear from DRI's <pageMeta>
and therefore the title is not accessible to the XSL transformation
+- I noticed that the page title is messed up on the item view, and after over an hour of troubleshooting it I couldn’t figure out why
+- It seems to be because the
dc.title
→dcterms.title
modifications cause the title metadata to disappear from DRI’s <pageMeta>
and therefore the title is not accessible to the XSL transformation
- Also, I noticed a few places in the Java code where
dc.title
is hard coded so I think this might be one of the fields that we just assume DSpace relies on internally
- I will revert all changes to
dc.title
and dc.title.alternative
- TODO: there are similar issues with the
citation_author
metadata element missing from DRI, so I might have to revert those changes too
diff --git a/docs/2019-11/index.html b/docs/2019-11/index.html
index f25c0ab76..bcf8ff89d 100644
--- a/docs/2019-11/index.html
+++ b/docs/2019-11/index.html
@@ -20,7 +20,7 @@ I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 milli
1277694
So 4.6 million from XMLUI and another 1.2 million from API requests
-Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
+Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
1183456
@@ -48,14 +48,14 @@ I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 milli
1277694
So 4.6 million from XMLUI and another 1.2 million from API requests
-Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
+Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
106781
"/>
-
+
@@ -85,7 +85,7 @@ Let's see how many of the REST API requests were for bitstreams (because the
-
+
@@ -132,7 +132,7 @@ Let's see how many of the REST API requests were for bitstreams (because the
November, 2019
@@ -151,7 +151,7 @@ Let's see how many of the REST API requests were for bitstreams (because the
1277694
- So 4.6 million from XMLUI and another 1.2 million from API requests
-- Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
+- Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
1183456
@@ -173,7 +173,7 @@ Let's see how many of the REST API requests were for bitstreams (because the
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
365288
-- Their user agent is one I've never seen before:
+- Their user agent is one I’ve never seen before:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
@@ -196,7 +196,7 @@ Let's see how many of the REST API requests were for bitstreams (because the
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"
-- On the topic of spiders, I have been wanting to update DSpace's default list of spiders in
config/spiders/agents
, perhaps by dropping a new list in from Atmire's COUNTER-Robots project
+ - On the topic of spiders, I have been wanting to update DSpace’s default list of spiders in
config/spiders/agents
, perhaps by dropping a new list in from Atmire’s COUNTER-Robots project
- First I checked for a user agent that is in COUNTER-Robots, but NOT in the current
dspace/config/spiders/example
list
- Then I made some item and bitstream requests on DSpace Test using that user agent:
@@ -215,25 +215,25 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
<lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">ip:73.178.9.24 AND userAgent:iskanie</str><str name="fq">dateYearMonth:2019-11</str><str name="rows">0</str></lst></lst><result name="response" numFound="3" start="0"></result>
</response>
-- Now I want to make similar requests with a user agent that is included in DSpace's current user agent list:
+- Now I want to make similar requests with a user agent that is included in DSpace’s current user agent list:
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial"
-- After twenty minutes I didn't see any requests in Solr, so I assume they did not get logged because they matched a bot list…
+
- After twenty minutes I didn’t see any requests in Solr, so I assume they did not get logged because they matched a bot list…
-- What's strange is that the Solr spider agent configuration in
dspace/config/modules/solr-statistics.cfg
points to a file that doesn't exist…
+- What’s strange is that the Solr spider agent configuration in
dspace/config/modules/solr-statistics.cfg
points to a file that doesn’t exist…
spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
-- Apparently that is part of Atmire's CUA, despite being in a standard DSpace configuration file…
+- Apparently that is part of Atmire’s CUA, despite being in a standard DSpace configuration file…
- I tried with some other garbage user agents like “fuuuualan” and they were visible in Solr
-- Now I want to try adding “iskanie” and “fuuuualan” to the list of spider regexes in
dspace/config/spiders/example
and then try to use DSpace's “mark spiders” feature to change them to “isBot:true” in Solr
-- I restarted Tomcat and ran
dspace stats-util -m
and it did some stuff for awhile, but I still don't see any items in Solr with isBot:true
+- Now I want to try adding “iskanie” and “fuuuualan” to the list of spider regexes in
dspace/config/spiders/example
and then try to use DSpace’s “mark spiders” feature to change them to “isBot:true” in Solr
+- I restarted Tomcat and ran
dspace stats-util -m
and it did some stuff for awhile, but I still don’t see any items in Solr with isBot:true
- According to
dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java
the patterns for user agents are loaded from any file in the config/spiders/agents
directory
- I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran
dspace stats-util -m
and still there were no new items marked as being bots in Solr, so I think there is still something wrong
- Jesus, the code in
./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java
says that stats-util -m
marks spider requests by their IPs, not by their user agents… WTF:
@@ -267,17 +267,17 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanf
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
<result name="response" numFound="0" start="0"/>
-- So basically it seems like a win to update the example file with the latest one from Atmire's COUNTER-Robots list
+
- So basically it seems like a win to update the example file with the latest one from Atmire’s COUNTER-Robots list
- Even though the “mark by user agent” function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents
-- I'm curious how the special character matching is in Solr, so I will test two requests: one with “www.gnip.com" which is in the spider list, and one with “www.gnyp.com" which isn't:
+- I’m curious how the special character matching is in Solr, so I will test two requests: one with “www.gnip.com” which is in the spider list, and one with “www.gnyp.com” which isn’t:
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com"
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnyp.com"
-- Then commit changes to Solr so we don't have to wait:
+- Then commit changes to Solr so we don’t have to wait:
$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
@@ -352,7 +352,7 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
</lst>
</lst>
-- That answers Peter's question about why the stats jumped in October…
+- That answers Peter’s question about why the stats jumped in October…
2019-11-08
@@ -409,12 +409,12 @@ istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do
2019-11-13
-- The item with a low Altmetric score for its Handle that I tweeted yesterday still hasn't linked with the DOI's score
+
- The item with a low Altmetric score for its Handle that I tweeted yesterday still hasn’t linked with the DOI’s score
- I tweeted it again with the Handle and the DOI
-- Testing modifying some of the COUNTER-Robots patterns to use
[0-9]
instead of \d
digit character type, as Solr's regex search can't use those
+- Testing modifying some of the COUNTER-Robots patterns to use
[0-9]
instead of \d
digit character type, as Solr’s regex search can’t use those
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"Scrapoo/1"
$ http "http://localhost:8081/solr/statistics/update?commit=true"
@@ -424,19 +424,19 @@ $ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
<result name="response" numFound="1" start="0">
- Nice, so searching with regex in Solr with
//
syntax works for those digits!
-- I realized that it's easier to search Solr from curl via POST using this syntax:
+- I realized that it’s easier to search Solr from curl via POST using this syntax:
$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0")
- If the parameters include something like “[0-9]” then curl interprets it as a range and will make ten requests
-- You can disable this using the
-g
option, but there are other benefits to searching with POST, for example it seems that I have less issues with escaping special parameters when using Solr's regex search:
+- You can disable this using the
-g
option, but there are other benefits to searching with POST, for example it seems that I have less issues with escaping special parameters when using Solr’s regex search:
$ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&rows=2'
-- I updated the
check-spider-hits.sh
script to use the POST syntax, and I'm evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling
+- I updated the
check-spider-hits.sh
script to use the POST syntax, and I’m evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling
2019-11-14
@@ -456,14 +456,14 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
- Greatly improve my
check-spider-hits.sh
script to handle regular expressions in the spider agents patterns file
- This allows me to detect and purge many more hits from the Solr statistics core
-- I've tested it quite a bit on DSpace Test, but I need to do a little more before I feel comfortable running the new code on CGSpace's Solr cores
+- I’ve tested it quite a bit on DSpace Test, but I need to do a little more before I feel comfortable running the new code on CGSpace’s Solr cores
2019-11-15
-- Run the new version of
check-spider-hits.sh
on CGSpace's Solr statistics cores one by one, starting from the oldest just in case something goes wrong
-- But then I noticed that some (all?) of the hits weren't actually getting purged, all of which were using regular expressions like:
+
- Run the new version of
check-spider-hits.sh
on CGSpace’s Solr statistics cores one by one, starting from the oldest just in case something goes wrong
+- But then I noticed that some (all?) of the hits weren’t actually getting purged, all of which were using regular expressions like:
MetaURI[\+\s]API\/[0-9]\.[0-9]
FDM(\s|\+)[0-9]
@@ -474,10 +474,10 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
- Upon closer inspection, the plus signs seem to be getting misinterpreted somehow in the delete, but not in the select!
-- Plus signs are special in regular expressions, URLs, and Solr's Lucene query parser, so I'm actually not sure where the issue is
+
- Plus signs are special in regular expressions, URLs, and Solr’s Lucene query parser, so I’m actually not sure where the issue is
- I tried to do URL encoding of the +, double escaping, etc… but nothing worked
-- I'm going to ignore regular expressions that have pluses for now
+- I’m going to ignore regular expressions that have pluses for now
- I think I might also have to ignore patterns that have percent signs, like
^\%?default\%?$
@@ -495,7 +495,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
- statistics: 1043373
-- That's 1.4 million hits in addition to the 2 million I purged earlier this week…
+- That’s 1.4 million hits in addition to the 2 million I purged earlier this week…
- For posterity, the major contributors to the hits on the statistics core were:
- Purging 812429 hits from curl/ in statistics
@@ -512,7 +512,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
2019-11-17
-- Altmetric support responded about our dashboard question, asking if the second “department” (aka WLE's collection) was added recently and might have not been in the last harvesting yet
+
- Altmetric support responded about our dashboard question, asking if the second “department” (aka WLE’s collection) was added recently and might have not been in the last harvesting yet
- I told her no, that the department is several years old, and the item was added in 2017
- Then I looked again at the dashboard for each department and I see the item in both departments now… shit.
@@ -538,7 +538,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
2019-11-19
-- Export IITA's community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something
+
- Export IITA’s community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something
- I had previously sent them an export in 2019-04
@@ -555,15 +555,15 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
- Found 4429 hits from ^User-Agent in statistics-2016
-- Buck is one I've never heard of before, its user agent is:
+- Buck is one I’ve never heard of before, its user agent is:
Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html)
-- All in all that's about 85,000 more hits purged, in addition to the 3.4 million I purged last week
+- All in all that’s about 85,000 more hits purged, in addition to the 3.4 million I purged last week
2019-11-20
-- Email Usman Muchlish from CIFOR to see what he's doing with their DSpace lately
+- Email Usman Muchlish from CIFOR to see what he’s doing with their DSpace lately
2019-11-21
@@ -599,8 +599,8 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
- I rebooted DSpace Test (linode19) and it kernel panicked at boot
-- I looked on the console and saw that it can't mount the root filesystem
-- I switched the boot configuration to use the OS's kernel via GRUB2 instead of Linode's kernel and then it came up after reboot…
+- I looked on the console and saw that it can’t mount the root filesystem
+- I switched the boot configuration to use the OS’s kernel via GRUB2 instead of Linode’s kernel and then it came up after reboot…
- I initiated a migration of the server from the Fremont, CA region to Frankfurt, DE
- The migration is going very slowly, so I assume the network issues from earlier this year are still not fixed
diff --git a/docs/2019-12/index.html b/docs/2019-12/index.html
index 5adabde30..5b2934fd0 100644
--- a/docs/2019-12/index.html
+++ b/docs/2019-12/index.html
@@ -43,7 +43,7 @@ Make sure all packages are up to date and the package manager is up to date, the
# dpkg -C
# reboot
"/>
-
+
@@ -73,7 +73,7 @@ Make sure all packages are up to date and the package manager is up to date, the
-
+
@@ -120,7 +120,7 @@ Make sure all packages are up to date and the package manager is up to date, the
December, 2019
@@ -159,13 +159,13 @@ Make sure all packages are up to date and the package manager is up to date, the
# apt install 'nginx=1.16.1-1~bionic'
# reboot
-- After the server comes back up, remove Python virtualenvs that were created with Python 3.5 and re-run certbot to make sure it's working:
+- After the server comes back up, remove Python virtualenvs that were created with Python 3.5 and re-run certbot to make sure it’s working:
# rm -rf /opt/eff.org/certbot/venv/bin/letsencrypt
# rm -rf /opt/ilri/dspace-statistics-api/venv
# /opt/certbot-auto
-- Clear Ansible's fact cache and re-run the playbooks to update the system's firewalls, SSH config, etc
+- Clear Ansible’s fact cache and re-run the playbooks to update the system’s firewalls, SSH config, etc
- Altmetric finally responded to my question about Dublin Core fields
- They shared a list of fields they use for tracking, but it only mentions HTML meta tags, and not fields considered when harvesting via OAI
@@ -191,8 +191,8 @@ Make sure all packages are up to date and the package manager is up to date, the
$ http 'https://cgspace.cgiar.org/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:cgspace.cgiar.org:10568/104030' > /tmp/cgspace-104030.xml
$ http 'https://dspacetest.cgiar.org/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:cgspace.cgiar.org:10568/104030' > /tmp/dspacetest-104030.xml
-- The DSpace Test ones actually now capture the DOI, where the CGSpace doesn't…
-- And the DSpace Test one doesn't include review status as
dc.description
, but I don't think that's an important field
+- The DSpace Test ones actually now capture the DOI, where the CGSpace doesn’t…
+- And the DSpace Test one doesn’t include review status as
dc.description
, but I don’t think that’s an important field
2019-12-04
@@ -219,7 +219,7 @@ COPY 48
- Enrico noticed that the AReS Explorer on CGSpace (linode18) was down
-- I only see HTTP 502 in the nginx logs on CGSpace… so I assume it's something wrong with the AReS server
+- I only see HTTP 502 in the nginx logs on CGSpace… so I assume it’s something wrong with the AReS server
- I ran all system updates on the AReS server (linode20) and rebooted it
- After rebooting the Explorer was accessible again
@@ -242,11 +242,11 @@ COPY 48
- Post message to Yammer about good practices for thumbnails on CGSpace
-- On the topic of thumbnails, I'm thinking we might want to force regenerate all PDF thumbnails on CGSpace since we upgraded it to Ubuntu 18.04 and got a new ghostscript…
+- On the topic of thumbnails, I’m thinking we might want to force regenerate all PDF thumbnails on CGSpace since we upgraded it to Ubuntu 18.04 and got a new ghostscript…
- More discussion about report formats for AReS
-- Peter noticed that the Atmire reports weren't showing any statistics before 2019
+
- Peter noticed that the Atmire reports weren’t showing any statistics before 2019
- I checked and indeed Solr had an issue loading some core last time it was started
- I restarted Tomcat three times before all cores came up successfully
@@ -278,7 +278,7 @@ COPY 48
- I created an issue for “extended” text reports on the AReS GitHub (#9)
-- I looked into creating RTF documents from HTML in Node.js and there is a library called html-to-rtf that works well, but doesn't support images
+- I looked into creating RTF documents from HTML in Node.js and there is a library called html-to-rtf that works well, but doesn’t support images
- Export a list of all investors (
dc.description.sponsorship
) for Peter to look through and correct:
dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.sponsor", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-12-17-investors.csv WITH CSV HEADER;
@@ -310,7 +310,7 @@ UPDATE 2
- Add three new CCAFS Phase II project tags to CGSpace (#441)
- Linode said DSpace Test (linode19) had an outbound traffic rate of 73Mb/sec for the last two hours
-- I see some Russian bot active in nginx's access logs:
+- I see some Russian bot active in nginx’s access logs:
@@ -349,7 +349,7 @@ UPDATE 1
- DCTERMS says that
dcterms.audience
should be used to describe a A class of entity for whom the resource is intended or useful."
- I will update my notes for this so that we use that field instead
-- I don't see “audience” on the cg-core repository so I filed an issue to raise it with Marie-Angelique
+- I don’t see “audience” on the cg-core repository so I filed an issue to raise it with Marie-Angelique
@@ -357,7 +357,7 @@ UPDATE 1
- Follow up with Altmetric on the issue where an item has a different (lower) score for its Handle despite it having a correct DOI (with a higher score)
-- I've raised this issue three times to Altmetric this year, and a few weeks ago they said they would re-process the item “before Christmas”
+- I’ve raised this issue three times to Altmetric this year, and a few weeks ago they said they would re-process the item “before Christmas”
- Abenet suggested we use
cg.reviewStatus
instead of cg.review-status
and I agree that we should follow other examples like DCTERMS.accessRights
and DCTERMS.isPartOf
@@ -370,7 +370,7 @@ UPDATE 1
- Altmetric responded a few days ago about the item that has a different (lower) score for its Handle despite it having a correct DOI (with a higher score)
-- She tweeted the repository link and agreed that it didn't get picked up by Altmetric
+- She tweeted the repository link and agreed that it didn’t get picked up by Altmetric
- She said she will add this to the existing ticket about the previous items I had raised an issue about
diff --git a/docs/2020-01/index.html b/docs/2020-01/index.html
index 6c1a3e0fb..6087a5de4 100644
--- a/docs/2020-01/index.html
+++ b/docs/2020-01/index.html
@@ -53,7 +53,7 @@ I tweeted the CGSpace repository link
"/>
-
+
@@ -63,7 +63,7 @@ I tweeted the CGSpace repository link
"@type": "BlogPosting",
"headline": "January, 2020",
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-01\/",
- "wordCount": "2117",
+ "wordCount": "2754",
"datePublished": "2020-01-06T10:48:30+02:00",
"dateModified": "2020-01-23T15:56:46+02:00",
"author": {
@@ -83,7 +83,7 @@ I tweeted the CGSpace repository link
-
+
@@ -130,7 +130,7 @@ I tweeted the CGSpace repository link
January, 2020
@@ -185,7 +185,7 @@ $ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
<e> 101, Hex 65, Octal 145 < ́> 769, Hex 0301, Octal 1401
-- If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it's stored incorrectly in the database…
+- If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it’s stored incorrectly in the database…
- Other encodings like
windows-1251
and windows-1257
also fail on different characters like “ž” and “é” that are legitimate UTF-8 characters
- Then there is the issue of Russian, Chinese, etc characters, which are simply not representable in any of those encodings
- I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me
@@ -206,8 +206,8 @@ java.net.SocketTimeoutException: Read timed out
- I am not sure how I will fix that shard…
- I discovered a very interesting tool called ftfy that attempts to fix errors in UTF-8
-- I'm curious to start checking input files with this to see what it highlights
-- I ran it on the authors file from last week and it converted characters like those with Spanish accents from multi-byte sequences (I don't know what it's called?) to digraphs (é→é), which vim identifies as:
+- I’m curious to start checking input files with this to see what it highlights
+- I ran it on the authors file from last week and it converted characters like those with Spanish accents from multi-byte sequences (I don’t know what it’s called?) to digraphs (é→é), which vim identifies as:
<e> 101, Hex 65, Octal 145 < ́> 769, Hex 0301, Octal 1401
<é> 233, Hex 00e9, Oct 351, Digr e'
@@ -283,10 +283,10 @@ COPY 35
- I opened a new pull request on the cg-core repository validate and fix the formatting of the HTML files
- Create more issues for OpenRXV:
-- Based on Peter's feedback on the text for labels and tooltips
-- Based on Peter's feedback for the export icon
-- Based on Peter's feedback for the sort options
-- Based on Abenet's feedback that PDF and Word exports are not working
+- Based on Peter’s feedback on the text for labels and tooltips
+- Based on Peter’s feedback for the export icon
+- Based on Peter’s feedback for the sort options
+- Based on Abenet’s feedback that PDF and Word exports are not working
@@ -352,7 +352,7 @@ $ wc -l hung-nguyen-a*handles.txt
56 hung-nguyen-atmire-handles.txt
102 total
-- Comparing the lists of items, I see that nine of the ten missing items were added less than twenty-four hours ago, and the other was added last week, so they apparently just haven't been indexed yet
+
- Comparing the lists of items, I see that nine of the ten missing items were added less than twenty-four hours ago, and the other was added last week, so they apparently just haven’t been indexed yet
- I am curious to check tomorrow to see if they are there
@@ -383,7 +383,7 @@ $ wc -l hung-nguyen-a*handles.txt
$ convert -density 288 -filter lagrange -thumbnail 25% -background white -alpha remove -sampling-factor 1:1 -colorspace sRGB 10568-97925.pdf\[0\] 10568-97925.jpg
-- Here I'm also explicitly setting the background to white and removing any alpha layers, but I could probably also just keep using
-flatten
like DSpace already does
+- Here I’m also explicitly setting the background to white and removing any alpha layers, but I could probably also just keep using
-flatten
like DSpace already does
- I did some tests with a modified version of above that uses uses
-flatten
and drops the sampling-factor and colorspace, but bumps up the image size to 600px (default on CGSpace is currently 300):
$ convert -density 288 -filter lagrange -resize 25% -flatten 10568-97925.pdf\[0\] 10568-97925-d288-lagrange.pdf.jpg
@@ -391,16 +391,58 @@ $ convert -flatten 10568-97925.pdf\[0\] 10568-97925.pdf.jpg
$ convert -thumbnail x600 10568-97925-d288-lagrange.pdf.jpg 10568-97925-d288-lagrange-thumbnail.pdf.jpg
$ convert -thumbnail x600 10568-97925.pdf.jpg 10568-97925-thumbnail.pdf.jpg
-- This emulate's DSpace's method of generating a high-quality image from the PDF and then creating a thumbnail
-- I put together a proof of concept of this by adding the extra options to dspace-api's
ImageMagickThumbnailFilter.java
and it works
+- This emulate’s DSpace’s method of generating a high-quality image from the PDF and then creating a thumbnail
+- I put together a proof of concept of this by adding the extra options to dspace-api’s
ImageMagickThumbnailFilter.java
and it works
- I need to run tests on a handful of PDFs to see if there are any side effects
-- The file size is about double the old ones, but the quality is very good and the file size is nowhere near ilri.org's 400KiB PNG!
+- The file size is about double the old ones, but the quality is very good and the file size is nowhere near ilri.org’s 400KiB PNG!
- Peter sent me the corrections and deletions for affiliations last night so I imported them into OpenRefine to work around the normal UTF-8 issue, ran them through csv-metadata-quality to make sure all Unicode values were normalized (NFC), then applied them on DSpace Test and CGSpace:
$ csv-metadata-quality -i ~/Downloads/2020-01-22-fix-1113-affiliations.csv -o /tmp/2020-01-22-fix-1113-affiliations.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
$ ./fix-metadata-values.py -i /tmp/2020-01-22-fix-1113-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct
$ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
-
+2020-01-26
+
+- Add “Gender” to controlled vocabulary for CRPs (#442)
+- Deploy the changes on CGSpace and run all updates on the server and reboot it
+
+- I had to restart the
tomcat7
service several times until all Solr statistics cores came up OK
+
+
+- I spent a few hours writing a script (create-thumbnails) to compare the default DSpace thumbnails with the improved parameters above and actually when comparing them at size 600px I don’t really notice much difference, other than the new ones have slightly crisper text
+
+- So that was a waste of time, though I think our 300px thumbnails are a bit small now
+- Another thread on the ImageMagick forum mentions that you need to set the density, then read the image, then set the density again:
+
+
+
+$ convert -density 288 10568-97925.pdf\[0\] -density 72 -filter lagrange -flatten 10568-97925-density.jpg
+
+- One thing worth mentioning was this syntax for extracting bits from JSON in bash using
jq
:
+
+$ RESPONSE=$(curl -s 'https://dspacetest.cgiar.org/rest/handle/10568/103447?expand=bitstreams')
+$ echo $RESPONSE | jq '.bitstreams[] | select(.bundleName=="ORIGINAL") | .retrieveLink'
+"/bitstreams/172559/retrieve"
+
2020-01-27
+
+- Bizu has been having problems when she logs into CGSpace, she can’t see the community list on the front page
+
+- This last happened for another user in 2016-11, and it was related to the Tomcat
maxHttpHeaderSize
being too small because the user was in too many groups
+- I see that it is similar, with this message appearing in the DSpace log just after she logs in:
+
+
+
+2020-01-27 06:02:23,681 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page
+org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'read:(g0 OR e610 OR g0 OR g3 OR g5 OR g4102 OR g9 OR g4105 OR g10 OR g4107 OR g4108 OR g13 OR g4109 OR g14 OR g15 OR g16 OR g18 OR g20 OR g23 OR g24 OR g2072 OR g2074 OR g28 OR g2076 OR g29 OR g2078 OR g2080 OR g34 OR g2082 OR g2084 OR g38 OR g2086 OR g2088 OR g43 OR g2093 OR g2095 OR g2097 OR g50 OR g51 OR g2101 OR g2103 OR g62 OR g65 OR g77 OR g78 OR g2127 OR g2142 OR g2151 OR g2152 OR g2153 OR g2154 OR g2156 OR g2165 OR g2171 OR g2174 OR g2175 OR g129 OR g2178 OR g2182 OR g2186 OR g153 OR g155 OR g158 OR g166 OR g167 OR g168 OR g169 OR g2225 OR g179 OR g2227 OR g2229 OR g183 OR g2231 OR g184 OR g2233 OR g186 OR g2235 OR g2237 OR g191 OR g192 OR g193 OR g2242 OR g2244 OR g2246 OR g2250 OR g204 OR g205 OR g207 OR g208 OR g2262 OR g2265 OR g218 OR g2268 OR g222 OR g223 OR g2271 OR g2274 OR g2277 OR g230 OR g231 OR g2280 OR g2283 OR g238 OR g2286 OR g241 OR g2289 OR g244 OR g2292 OR g2295 OR g2298 OR g2301 OR g254 OR g255 OR g2305 OR g2308 OR g262 OR g2311 OR g265 OR g268 OR g269 OR g273 OR g276 OR g277 OR g279 OR g282 OR g292 OR g293 OR g296 OR g297 OR g301 OR g303 OR g305 OR g2353 OR g310 OR g311 OR g313 OR g321 OR g325 OR g328 OR g333 OR g334 OR g342 OR g343 OR g345 OR g348 OR g2409 [...] ': too many boolean clauses
+
+- Now this appears to be a Solr limit of some kind (“too many boolean clauses”)
+
+- I changed the
maxBooleanClauses
for all Solr cores on DSpace Test from 1024 to 2048 and then she was able to see her communities…
+- I made a pull request and merged it to the
5_x-prod
branch and will deploy on CGSpace later tonight
+- I am curious if anyone on the dspace-tech mailing list has run into this, so I will try to send a message about this there when I get a chance
+
+
+
+
diff --git a/docs/404.html b/docs/404.html
deleted file mode 100644
index 1ed3a122a..000000000
--- a/docs/404.html
+++ /dev/null
@@ -1,144 +0,0 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- CGSpace Notes
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- CGSpace Notes
- Documenting day-to-day work on the CGSpace repository.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Page Not Found
-
- Page not found. Go back home.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
diff --git a/docs/categories/index.html b/docs/categories/index.html
index e3b0c631d..429c32306 100644
--- a/docs/categories/index.html
+++ b/docs/categories/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -95,7 +95,7 @@
January, 2020
@@ -132,7 +132,7 @@
December, 2019
@@ -164,7 +164,7 @@
November, 2019
@@ -183,7 +183,7 @@
1277694
- So 4.6 million from XMLUI and another 1.2 million from API requests
-- Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
+- Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
1183456
@@ -202,10 +202,10 @@
CGSpace CG Core v2 Migration
@@ -223,12 +223,12 @@
October, 2019
- 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc.
+ 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc.
Read more →
@@ -241,7 +241,7 @@
September, 2019
@@ -286,14 +286,14 @@
August, 2019
2019-08-03
-- Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
+- Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
2019-08-04
@@ -301,7 +301,7 @@
- Run system updates on CGSpace (linode18) and reboot it
- Before updating it I checked Solr and verified that all statistics cores were loaded properly…
-- After rebooting, all statistics cores were loaded… wow, that's lucky.
+- After rebooting, all statistics cores were loaded… wow, that’s lucky.
- Run system updates on DSpace Test (linode19) and reboot it
@@ -318,7 +318,7 @@
July, 2019
@@ -346,7 +346,7 @@
June, 2019
@@ -372,7 +372,7 @@
May, 2019
diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html
index cc54c2dd4..e86902a90 100644
--- a/docs/categories/notes/index.html
+++ b/docs/categories/notes/index.html
@@ -14,7 +14,7 @@
-
+
@@ -28,7 +28,7 @@
-
+
@@ -80,7 +80,7 @@
January, 2020
@@ -117,7 +117,7 @@
December, 2019
@@ -149,7 +149,7 @@
November, 2019
@@ -168,7 +168,7 @@
1277694
- So 4.6 million from XMLUI and another 1.2 million from API requests
-- Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
+- Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
1183456
@@ -187,10 +187,10 @@
CGSpace CG Core v2 Migration
@@ -208,12 +208,12 @@
October, 2019
- 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc.
+ 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc.
Read more →
@@ -226,7 +226,7 @@
September, 2019
@@ -271,14 +271,14 @@
August, 2019
2019-08-03
-- Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
+- Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
2019-08-04
@@ -286,7 +286,7 @@
- Run system updates on CGSpace (linode18) and reboot it
- Before updating it I checked Solr and verified that all statistics cores were loaded properly…
-- After rebooting, all statistics cores were loaded… wow, that's lucky.
+- After rebooting, all statistics cores were loaded… wow, that’s lucky.
- Run system updates on DSpace Test (linode19) and reboot it
@@ -303,7 +303,7 @@
July, 2019
@@ -331,7 +331,7 @@
June, 2019
@@ -357,7 +357,7 @@
May, 2019
diff --git a/docs/categories/notes/index.xml b/docs/categories/notes/index.xml
index 2d5395f5e..c08d14222 100644
--- a/docs/categories/notes/index.xml
+++ b/docs/categories/notes/index.xml
@@ -82,7 +82,7 @@
1277694
</code></pre><ul>
<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
-<li>Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
+<li>Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
1183456
@@ -107,7 +107,7 @@
Tue, 01 Oct 2019 13:20:51 +0300
https://alanorth.github.io/cgspace-notes/2019-10/
- 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc.
+ 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc.
-
@@ -154,7 +154,7 @@
https://alanorth.github.io/cgspace-notes/2019-08/
<h2 id="2019-08-03">2019-08-03</h2>
<ul>
-<li>Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…</li>
+<li>Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…</li>
</ul>
<h2 id="2019-08-04">2019-08-04</h2>
<ul>
@@ -162,7 +162,7 @@
<li>Run system updates on CGSpace (linode18) and reboot it
<ul>
<li>Before updating it I checked Solr and verified that all statistics cores were loaded properly…</li>
-<li>After rebooting, all statistics cores were loaded… wow, that's lucky.</li>
+<li>After rebooting, all statistics cores were loaded… wow, that’s lucky.</li>
</ul>
</li>
<li>Run system updates on DSpace Test (linode19) and reboot it</li>
@@ -269,9 +269,9 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
https://alanorth.github.io/cgspace-notes/2019-03/
<h2 id="2019-03-01">2019-03-01</h2>
<ul>
-<li>I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li>
+<li>I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li>
<li>I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…</li>
-<li>Looking at the other half of Udana's WLE records from 2018-11
+<li>Looking at the other half of Udana’s WLE records from 2018-11
<ul>
<li>I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)</li>
<li>I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items</li>
@@ -329,7 +329,7 @@ sys 0m1.979s
<h2 id="2019-01-02">2019-01-02</h2>
<ul>
<li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li>
-<li>I don't see anything interesting in the web server logs around that time though:</li>
+<li>I don’t see anything interesting in the web server logs around that time though:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
@@ -390,7 +390,7 @@ sys 0m1.979s
<h2 id="2018-10-01">2018-10-01</h2>
<ul>
<li>Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items</li>
-<li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I'm super busy in Nairobi right now</li>
+<li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I’m super busy in Nairobi right now</li>
</ul>
@@ -403,9 +403,9 @@ sys 0m1.979s
<h2 id="2018-09-02">2018-09-02</h2>
<ul>
<li>New <a href="https://jdbc.postgresql.org/documentation/changelog.html#version_42.2.5">PostgreSQL JDBC driver version 42.2.5</a></li>
-<li>I'll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li>
-<li>Also, I'll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month</li>
-<li>I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:</li>
+<li>I’ll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li>
+<li>Also, I’ll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month</li>
+<li>I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:</li>
</ul>
@@ -424,10 +424,10 @@ sys 0m1.979s
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>
<li>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</li>
-<li>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat's</li>
-<li>I'm not sure why Tomcat didn't crash with an OutOfMemoryError…</li>
+<li>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat’s</li>
+<li>I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…</li>
<li>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</li>
-<li>The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes</li>
+<li>The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes</li>
<li>I ran all system updates on DSpace Test and rebooted it</li>
</ul>
@@ -460,7 +460,7 @@ sys 0m1.979s
<ul>
<li>Test the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">DSpace 5.8 module upgrades from Atmire</a> (<a href="https://github.com/ilri/DSpace/pull/378">#378</a>)
<ul>
-<li>There seems to be a problem with the CUA and L&R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn't build</li>
+<li>There seems to be a problem with the CUA and L&R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn’t build</li>
</ul>
</li>
<li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li>
@@ -506,7 +506,7 @@ sys 2m7.289s
https://alanorth.github.io/cgspace-notes/2018-04/
<h2 id="2018-04-01">2018-04-01</h2>
<ul>
-<li>I tried to test something on DSpace Test but noticed that it's down since god knows when</li>
+<li>I tried to test something on DSpace Test but noticed that it’s down since god knows when</li>
<li>Catalina logs at least show some memory errors yesterday:</li>
</ul>
@@ -532,9 +532,9 @@ sys 2m7.289s
<h2 id="2018-02-01">2018-02-01</h2>
<ul>
<li>Peter gave feedback on the <code>dc.rights</code> proof of concept that I had sent him last week</li>
-<li>We don't need to distinguish between internal and external works, so that makes it just a simple list</li>
+<li>We don’t need to distinguish between internal and external works, so that makes it just a simple list</li>
<li>Yesterday I figured out how to monitor DSpace sessions using JMX</li>
-<li>I copied the logic in the <code>jmx_tomcat_dbpools</code> provided by Ubuntu's <code>munin-plugins-java</code> package and used the stuff I discovered about JMX <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-01/">in 2018-01</a></li>
+<li>I copied the logic in the <code>jmx_tomcat_dbpools</code> provided by Ubuntu’s <code>munin-plugins-java</code> package and used the stuff I discovered about JMX <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-01/">in 2018-01</a></li>
</ul>
@@ -547,7 +547,7 @@ sys 2m7.289s
<h2 id="2018-01-02">2018-01-02</h2>
<ul>
<li>Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time</li>
-<li>I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary</li>
+<li>I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary</li>
<li>The nginx logs show HTTP 200s until <code>02/Jan/2018:11:27:17 +0000</code> when Uptime Robot got an HTTP 500</li>
<li>In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”</li>
<li>And just before that I see this:</li>
@@ -555,8 +555,8 @@ sys 2m7.289s
<pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
</code></pre><ul>
<li>Ah hah! So the pool was actually empty!</li>
-<li>I need to increase that, let's try to bump it up from 50 to 75</li>
-<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw</li>
+<li>I need to increase that, let’s try to bump it up from 50 to 75</li>
+<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw</li>
<li>I notice this error quite a few times in dspace.log:</li>
</ul>
<pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
@@ -609,7 +609,7 @@ dspace.log.2017-12-31:53
dspace.log.2018-01-01:45
dspace.log.2018-01-02:34
</code></pre><ul>
-<li>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains</li>
+<li>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains</li>
</ul>
@@ -664,7 +664,7 @@ COPY 54701
</ul>
<pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
</code></pre><ul>
-<li>There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
+<li>There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
<li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li>
</ul>
diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html
index 2d6dc71da..89c82dfbd 100644
--- a/docs/categories/notes/page/2/index.html
+++ b/docs/categories/notes/page/2/index.html
@@ -14,7 +14,7 @@
-
+
@@ -28,7 +28,7 @@
-
+
@@ -80,7 +80,7 @@
April, 2019
@@ -121,16 +121,16 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
March, 2019
2019-03-01
-- I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
+- I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
- I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
-- Looking at the other half of Udana's WLE records from 2018-11
+
- Looking at the other half of Udana’s WLE records from 2018-11
- I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
- I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
@@ -153,7 +153,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
February, 2019
@@ -198,7 +198,7 @@ sys 0m1.979s
January, 2019
@@ -206,7 +206,7 @@ sys 0m1.979s
2019-01-02
- Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
-- I don't see anything interesting in the web server logs around that time though:
+- I don’t see anything interesting in the web server logs around that time though:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
@@ -232,7 +232,7 @@ sys 0m1.979s
December, 2018
@@ -259,7 +259,7 @@ sys 0m1.979s
November, 2018
@@ -286,7 +286,7 @@ sys 0m1.979s
October, 2018
@@ -294,7 +294,7 @@ sys 0m1.979s
2018-10-01
- Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
-- I created a GitHub issue to track this #389, because I'm super busy in Nairobi right now
+- I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now
Read more →
@@ -308,7 +308,7 @@ sys 0m1.979s
September, 2018
@@ -316,9 +316,9 @@ sys 0m1.979s
2018-09-02
- New PostgreSQL JDBC driver version 42.2.5
-- I'll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
-- Also, I'll re-run the
postgresql
tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month
-- I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:
+- I’ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
+- Also, I’ll re-run the
postgresql
tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
+- I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
Read more →
@@ -332,7 +332,7 @@ sys 0m1.979s
August, 2018
@@ -346,10 +346,10 @@ sys 0m1.979s
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
- Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
-- From the DSpace log I see that eventually Solr stopped responding, so I guess the
java
process that was OOM killed above was Tomcat's
-- I'm not sure why Tomcat didn't crash with an OutOfMemoryError…
+- From the DSpace log I see that eventually Solr stopped responding, so I guess the
java
process that was OOM killed above was Tomcat’s
+- I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…
- Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
-- The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
+- The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
- I ran all system updates on DSpace Test and rebooted it
Read more →
@@ -364,7 +364,7 @@ sys 0m1.979s
July, 2018
diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html
index 8c555efd9..15e790af3 100644
--- a/docs/categories/notes/page/3/index.html
+++ b/docs/categories/notes/page/3/index.html
@@ -14,7 +14,7 @@
-
+
@@ -28,7 +28,7 @@
-
+
@@ -80,7 +80,7 @@
June, 2018
@@ -89,7 +89,7 @@
- Test the DSpace 5.8 module upgrades from Atmire (#378)
-- There seems to be a problem with the CUA and L&R versions in
pom.xml
because they are using SNAPSHOT and it doesn't build
+- There seems to be a problem with the CUA and L&R versions in
pom.xml
because they are using SNAPSHOT and it doesn’t build
- I added the new CCAFS Phase II Project Tag
PII-FP1_PACCA2
and merged it into the 5_x-prod
branch (#379)
@@ -118,7 +118,7 @@ sys 2m7.289s
May, 2018
@@ -146,14 +146,14 @@ sys 2m7.289s
April, 2018
2018-04-01
-- I tried to test something on DSpace Test but noticed that it's down since god knows when
+- I tried to test something on DSpace Test but noticed that it’s down since god knows when
- Catalina logs at least show some memory errors yesterday:
Read more →
@@ -168,7 +168,7 @@ sys 2m7.289s
March, 2018
@@ -189,7 +189,7 @@ sys 2m7.289s
February, 2018
@@ -197,9 +197,9 @@ sys 2m7.289s
2018-02-01
- Peter gave feedback on the
dc.rights
proof of concept that I had sent him last week
-- We don't need to distinguish between internal and external works, so that makes it just a simple list
+- We don’t need to distinguish between internal and external works, so that makes it just a simple list
- Yesterday I figured out how to monitor DSpace sessions using JMX
-- I copied the logic in the
jmx_tomcat_dbpools
provided by Ubuntu's munin-plugins-java
package and used the stuff I discovered about JMX in 2018-01
+- I copied the logic in the
jmx_tomcat_dbpools
provided by Ubuntu’s munin-plugins-java
package and used the stuff I discovered about JMX in 2018-01
Read more →
@@ -213,7 +213,7 @@ sys 2m7.289s
January, 2018
@@ -221,7 +221,7 @@ sys 2m7.289s
2018-01-02
- Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
-- I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
+- I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
- The nginx logs show HTTP 200s until
02/Jan/2018:11:27:17 +0000
when Uptime Robot got an HTTP 500
- In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
- And just before that I see this:
@@ -229,8 +229,8 @@ sys 2m7.289s
Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
- Ah hah! So the pool was actually empty!
-- I need to increase that, let's try to bump it up from 50 to 75
-- After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
+- I need to increase that, let’s try to bump it up from 50 to 75
+- After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw
- I notice this error quite a few times in dspace.log:
2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
@@ -283,7 +283,7 @@ dspace.log.2017-12-31:53
dspace.log.2018-01-01:45
dspace.log.2018-01-02:34
-- Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains
+- Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains
Read more →
@@ -297,7 +297,7 @@ dspace.log.2018-01-02:34
December, 2017
@@ -321,7 +321,7 @@ dspace.log.2018-01-02:34
November, 2017
@@ -354,7 +354,7 @@ COPY 54701
October, 2017
@@ -365,7 +365,7 @@ COPY 54701
http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
-- There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
+- There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
- Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
Read more →
@@ -380,10 +380,10 @@ COPY 54701
CGIAR Library Migration
diff --git a/docs/categories/page/2/index.html b/docs/categories/page/2/index.html
index 9df738b2d..0b37b00a7 100644
--- a/docs/categories/page/2/index.html
+++ b/docs/categories/page/2/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -95,7 +95,7 @@
April, 2019
@@ -136,16 +136,16 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
March, 2019
2019-03-01
-- I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
+- I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
- I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
-- Looking at the other half of Udana's WLE records from 2018-11
+
- Looking at the other half of Udana’s WLE records from 2018-11
- I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
- I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
@@ -168,7 +168,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
February, 2019
@@ -213,7 +213,7 @@ sys 0m1.979s
January, 2019
@@ -221,7 +221,7 @@ sys 0m1.979s
2019-01-02
- Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
-- I don't see anything interesting in the web server logs around that time though:
+- I don’t see anything interesting in the web server logs around that time though:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
@@ -247,7 +247,7 @@ sys 0m1.979s
December, 2018
@@ -274,7 +274,7 @@ sys 0m1.979s
November, 2018
@@ -301,7 +301,7 @@ sys 0m1.979s
October, 2018
@@ -309,7 +309,7 @@ sys 0m1.979s
2018-10-01
- Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
-- I created a GitHub issue to track this #389, because I'm super busy in Nairobi right now
+- I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now
Read more →
@@ -323,7 +323,7 @@ sys 0m1.979s
September, 2018
@@ -331,9 +331,9 @@ sys 0m1.979s
2018-09-02
- New PostgreSQL JDBC driver version 42.2.5
-- I'll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
-- Also, I'll re-run the
postgresql
tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month
-- I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:
+- I’ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
+- Also, I’ll re-run the
postgresql
tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
+- I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
Read more →
@@ -347,7 +347,7 @@ sys 0m1.979s
August, 2018
@@ -361,10 +361,10 @@ sys 0m1.979s
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
- Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
-- From the DSpace log I see that eventually Solr stopped responding, so I guess the
java
process that was OOM killed above was Tomcat's
-- I'm not sure why Tomcat didn't crash with an OutOfMemoryError…
+- From the DSpace log I see that eventually Solr stopped responding, so I guess the
java
process that was OOM killed above was Tomcat’s
+- I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…
- Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
-- The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
+- The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
- I ran all system updates on DSpace Test and rebooted it
Read more →
@@ -379,7 +379,7 @@ sys 0m1.979s
July, 2018
diff --git a/docs/categories/page/3/index.html b/docs/categories/page/3/index.html
index 00939343d..99f31be73 100644
--- a/docs/categories/page/3/index.html
+++ b/docs/categories/page/3/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -95,7 +95,7 @@
June, 2018
@@ -104,7 +104,7 @@
- Test the DSpace 5.8 module upgrades from Atmire (#378)
-- There seems to be a problem with the CUA and L&R versions in
pom.xml
because they are using SNAPSHOT and it doesn't build
+- There seems to be a problem with the CUA and L&R versions in
pom.xml
because they are using SNAPSHOT and it doesn’t build
- I added the new CCAFS Phase II Project Tag
PII-FP1_PACCA2
and merged it into the 5_x-prod
branch (#379)
@@ -133,7 +133,7 @@ sys 2m7.289s
May, 2018
@@ -161,14 +161,14 @@ sys 2m7.289s
April, 2018
2018-04-01
-- I tried to test something on DSpace Test but noticed that it's down since god knows when
+- I tried to test something on DSpace Test but noticed that it’s down since god knows when
- Catalina logs at least show some memory errors yesterday:
Read more →
@@ -183,7 +183,7 @@ sys 2m7.289s
March, 2018
@@ -204,7 +204,7 @@ sys 2m7.289s
February, 2018
@@ -212,9 +212,9 @@ sys 2m7.289s
2018-02-01
- Peter gave feedback on the
dc.rights
proof of concept that I had sent him last week
-- We don't need to distinguish between internal and external works, so that makes it just a simple list
+- We don’t need to distinguish between internal and external works, so that makes it just a simple list
- Yesterday I figured out how to monitor DSpace sessions using JMX
-- I copied the logic in the
jmx_tomcat_dbpools
provided by Ubuntu's munin-plugins-java
package and used the stuff I discovered about JMX in 2018-01
+- I copied the logic in the
jmx_tomcat_dbpools
provided by Ubuntu’s munin-plugins-java
package and used the stuff I discovered about JMX in 2018-01
Read more →
@@ -228,7 +228,7 @@ sys 2m7.289s
January, 2018
@@ -236,7 +236,7 @@ sys 2m7.289s
2018-01-02
- Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
-- I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
+- I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
- The nginx logs show HTTP 200s until
02/Jan/2018:11:27:17 +0000
when Uptime Robot got an HTTP 500
- In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
- And just before that I see this:
@@ -244,8 +244,8 @@ sys 2m7.289s
Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
- Ah hah! So the pool was actually empty!
-- I need to increase that, let's try to bump it up from 50 to 75
-- After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
+- I need to increase that, let’s try to bump it up from 50 to 75
+- After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw
- I notice this error quite a few times in dspace.log:
2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
@@ -298,7 +298,7 @@ dspace.log.2017-12-31:53
dspace.log.2018-01-01:45
dspace.log.2018-01-02:34
-- Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains
+- Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains
Read more →
@@ -312,7 +312,7 @@ dspace.log.2018-01-02:34
December, 2017
@@ -336,7 +336,7 @@ dspace.log.2018-01-02:34
November, 2017
@@ -369,7 +369,7 @@ COPY 54701
October, 2017
@@ -380,7 +380,7 @@ COPY 54701
http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
-- There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
+- There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
- Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
Read more →
@@ -395,10 +395,10 @@ COPY 54701
CGIAR Library Migration
diff --git a/docs/categories/page/4/index.html b/docs/categories/page/4/index.html
index 25e851434..aea4a4d1a 100644
--- a/docs/categories/page/4/index.html
+++ b/docs/categories/page/4/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -96,7 +96,7 @@
September, 2017
@@ -106,7 +106,7 @@
2017-09-07
-- Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
+- Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
Read more →
@@ -121,7 +121,7 @@
August, 2017
@@ -139,7 +139,7 @@
- The
robots.txt
only blocks the top-level /discover
and /browse
URLs… we will need to find a way to forbid them from accessing these!
- Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
-- It turns out that we're already adding the
X-Robots-Tag "none"
HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
+- It turns out that we’re already adding the
X-Robots-Tag "none"
HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
- Also, the bot has to successfully browse the page first so it can receive the HTTP header…
- We might actually have to block these requests with HTTP 403 depending on the user agent
- Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
@@ -160,7 +160,7 @@
July, 2017
@@ -171,8 +171,8 @@
2017-07-04
- Merge changes for WLE Phase II theme rename (#329)
-- Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace
-- We can use PostgreSQL's extended output format (
-x
) plus sed
to format the output into quasi XML:
+- Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
+- We can use PostgreSQL’s extended output format (
-x
) plus sed
to format the output into quasi XML:
Read more →
@@ -187,11 +187,11 @@
June, 2017
- 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg.
+ 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg.
Read more →
@@ -205,11 +205,11 @@
May, 2017
- 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.
+ 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.
Read more →
@@ -223,7 +223,7 @@
April, 2017
@@ -252,7 +252,7 @@
March, 2017
@@ -270,7 +270,7 @@
- Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
- Filed an issue on DSpace issue tracker for the
filter-media
bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
- Discovered that the ImageMagic
filter-media
plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
-- Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
+- Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
@@ -288,7 +288,7 @@
February, 2017
@@ -307,7 +307,7 @@ dspace=# delete from collection2item where id = 92551 and item_id = 80278;
DELETE 1
- Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
-- Looks like we'll be using
cg.identifier.ccafsprojectpii
as the field name
+- Looks like we’ll be using
cg.identifier.ccafsprojectpii
as the field name
Read more →
@@ -322,15 +322,15 @@ DELETE 1
January, 2017
2017-01-02
- I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
-- I tested on DSpace Test as well and it doesn't work there either
-- I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years
+- I tested on DSpace Test as well and it doesn’t work there either
+- I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
Read more →
@@ -345,7 +345,7 @@ DELETE 1
December, 2016
@@ -360,8 +360,8 @@ DELETE 1
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
-- I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade
-- I've raised a ticket with Atmire to ask
+- I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
+- I’ve raised a ticket with Atmire to ask
- Another worrying error from dspace.log is:
Read more →
diff --git a/docs/categories/page/5/index.html b/docs/categories/page/5/index.html
index 3d7f69049..d411321b3 100644
--- a/docs/categories/page/5/index.html
+++ b/docs/categories/page/5/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -96,13 +96,13 @@
November, 2016
2016-11-01
-- Add
dc.type
to the output options for Atmire's Listings and Reports module (#286)
+- Add
dc.type
to the output options for Atmire’s Listings and Reports module (#286)

Read more →
@@ -118,7 +118,7 @@
October, 2016
@@ -131,7 +131,7 @@
ORCIDs plus normal authors
-I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author
with the following random ORCIDs from the ORCID registry:
+I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author
with the following random ORCIDs from the ORCID registry:
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
@@ -148,14 +148,14 @@
September, 2016
2016-09-01
- Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
-- Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace
+- Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
- We had been using
DC=ILRI
to determine whether a user was ILRI or not
- It looks like we might be able to use OUs now, instead of DCs:
@@ -174,7 +174,7 @@
August, 2016
@@ -204,7 +204,7 @@ $ git rebase -i dspace-5.5
July, 2016
@@ -235,14 +235,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
June, 2016
2016-06-01
- Experimenting with IFPRI OAI (we want to harvest their publications)
-- After reading the ContentDM documentation I found IFPRI's OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
+- After reading the ContentDM documentation I found IFPRI’s OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
- After reading the OAI documentation and testing with an OAI validator I found out how to get their publications
- This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&from=2016-01-01&set=p15738coll2&metadataPrefix=oai_dc
- You can see the others by using the OAI
ListSets
verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
@@ -261,7 +261,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
May, 2016
@@ -287,7 +287,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
April, 2016
@@ -295,8 +295,8 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
- Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit
- We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc
-- After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!
-- This will save us a few gigs of backup space we're paying for on S3
+- After running DSpace for over five years I’ve never needed to look in any other log file than dspace.log, leave alone one from last year!
+- This will save us a few gigs of backup space we’re paying for on S3
- Also, I noticed the
checker
log has some errors we should pay attention to:
Read more →
@@ -312,14 +312,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
March, 2016
2016-03-02
- Looking at issues with author authorities on CGSpace
-- For some reason we still have the
index-lucene-update
cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module
+- For some reason we still have the
index-lucene-update
cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module
- Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
Read more →
@@ -335,7 +335,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
February, 2016
diff --git a/docs/categories/page/6/index.html b/docs/categories/page/6/index.html
index 9982f5817..0ee3d3153 100644
--- a/docs/categories/page/6/index.html
+++ b/docs/categories/page/6/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -96,7 +96,7 @@
January, 2016
@@ -119,7 +119,7 @@
December, 2015
@@ -146,7 +146,7 @@
November, 2015
diff --git a/docs/cgiar-library-migration/index.html b/docs/cgiar-library-migration/index.html
index 3b461491c..6edb9b537 100644
--- a/docs/cgiar-library-migration/index.html
+++ b/docs/cgiar-library-migration/index.html
@@ -15,7 +15,7 @@
-
+
@@ -46,7 +46,7 @@
-
+
@@ -93,10 +93,10 @@
CGIAR Library Migration
@@ -122,8 +122,8 @@
SELECT * FROM pg_stat_activity;
seems to show ~6 extra connections used by the command line tools during import
-Temporarily disable nightly index-discovery
cron job because the import process will be taking place during some of this time and I don't want them to be competing to update the Solr index
-Copy HTTPS certificate key pair from CGIAR Library server's Tomcat keystore:
+Temporarily disable nightly index-discovery
cron job because the import process will be taking place during some of this time and I don’t want them to be competing to update the Solr index
+Copy HTTPS certificate key pair from CGIAR Library server’s Tomcat keystore:
$ keytool -list -keystore tomcat.keystore
$ keytool -importkeystore -srckeystore tomcat.keystore -destkeystore library.cgiar.org.p12 -deststoretype PKCS12 -srcalias tomcat
@@ -172,7 +172,7 @@ $ for item in 10947-2527/ITEM@10947-*; do dspace packager -r -f -u -t AIP -e aor
$ dspace packager -s -t AIP -o ignoreHandle=false -e aorth@mjanja.ch -p 10568/83389 10947-1/10947-1.zip
$ for collection in 10947-1/COLLECTION@10947-*; do dspace packager -s -o ignoreHandle=false -t AIP -e aorth@mjanja.ch -p 10947/1 $collection; done
$ for item in 10947-1/ITEM@10947-*; do dspace packager -r -f -u -t AIP -e aorth@mjanja.ch $item; done
-
This submits AIP hierarchies recursively (-r) and suppresses errors when an item's parent collection hasn't been created yet—for example, if the item is mapped. The large historic archive (10947/1) is created in several steps because it requires a lot of memory and often crashes.
+This submits AIP hierarchies recursively (-r) and suppresses errors when an item’s parent collection hasn’t been created yet—for example, if the item is mapped. The large historic archive (10947/1) is created in several steps because it requires a lot of memory and often crashes.
Create new subcommunities and collections for content we reorganized into new hierarchies from the original:
- Create CGIAR System Management Board sub-community:
10568/83536
@@ -205,11 +205,11 @@ $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e
$ for item in 10947-latest/*.zip; do dspace packager -r -u -t AIP -e aorth@mjanja.ch $item; done
Post Migration
-- Shut down Tomcat and run
update-sequences.sql
as the system's postgres
user
+- Shut down Tomcat and run
update-sequences.sql
as the system’s postgres
user
- Remove ingestion overrides from
dspace.cfg
- Reset PostgreSQL
max_connections
to 183
- Enable nightly
index-discovery
cron job
-- Adjust CGSpace's
handle-server/config.dct
to add the new prefix alongside our existing 10568, ie:
+- Adjust CGSpace’s
handle-server/config.dct
to add the new prefix alongside our existing 10568, ie:
"server_admins" = (
"300:0.NA/10568"
@@ -225,7 +225,7 @@ $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e
"300:0.NA/10568"
"300:0.NA/10947"
)
-
I had been regenerated the sitebndl.zip
file on the CGIAR Library server and sent it to the Handle.net admins but they said that there were mismatches between the public and private keys, which I suspect is due to make-handle-config
not being very flexible. After discussing our scenario with the Handle.net admins they said we actually don't need to send an updated sitebndl.zip
for this type of change, and the above config.dct
edits are all that is required. I guess they just did something on their end by setting the authoritative IP address for the 10947 prefix to be the same as ours…
+I had been regenerated the sitebndl.zip
file on the CGIAR Library server and sent it to the Handle.net admins but they said that there were mismatches between the public and private keys, which I suspect is due to make-handle-config
not being very flexible. After discussing our scenario with the Handle.net admins they said we actually don’t need to send an updated sitebndl.zip
for this type of change, and the above config.dct
edits are all that is required. I guess they just did something on their end by setting the authoritative IP address for the 10947 prefix to be the same as ours…
- Update DNS records:
@@ -235,7 +235,7 @@ $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e
- Re-deploy DSpace from freshly built
5_x-prod
branch
- Merge
cgiar-library
branch to master
and re-run ansible nginx templates
- Run system updates and reboot server
-- Switch to Let's Encrypt HTTPS certificates (after DNS is updated and server isn't busy):
+- Switch to Let’s Encrypt HTTPS certificates (after DNS is updated and server isn’t busy):
$ sudo systemctl stop nginx
$ /opt/certbot-auto certonly --standalone -d library.cgiar.org
@@ -251,7 +251,7 @@ $ sudo systemctl start nginx
After a few rounds of ingesting—possibly with failures—you might end up with inconsistent IDs in the database. In this case, during AIP ingest of a single collection in submit mode (-s):
org.dspace.content.packager.PackageValidationException: Exception while ingesting 10947-2527/10947-2527.zip, Reason: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "handle_pkey"
Detail: Key (handle_id)=(86227) already exists.
-
The normal solution is to run the update-sequences.sql
script (with Tomcat shut down) but it doesn't seem to work in this case. Finding the maximum handle_id
and manually updating the sequence seems to work:
+
The normal solution is to run the update-sequences.sql
script (with Tomcat shut down) but it doesn’t seem to work in this case. Finding the maximum handle_id
and manually updating the sequence seems to work:
dspace=# select * from handle where handle_id=(select max(handle_id) from handle);
dspace=# select setval('handle_seq',86873);
diff --git a/docs/cgspace-cgcorev2-migration/index.html b/docs/cgspace-cgcorev2-migration/index.html
index 8164adbc7..1380969d6 100644
--- a/docs/cgspace-cgcorev2-migration/index.html
+++ b/docs/cgspace-cgcorev2-migration/index.html
@@ -15,7 +15,7 @@
-
+
@@ -46,7 +46,7 @@
-
+
@@ -93,10 +93,10 @@
CGSpace CG Core v2 Migration
@@ -424,7 +424,7 @@
- There is potentially a lot of work in the OAI metadata formats like DIM, METS, and QDC (see
dspace/config/crosswalks/oai/*.xsl
)
-¹ Not committed yet because I don't want to have to make minor adjustments in multiple commits. Re-apply the gauntlet of fixes with the sed script:
+¹ Not committed yet because I don’t want to have to make minor adjustments in multiple commits. Re-apply the gauntlet of fixes with the sed script:
$ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname "*.xsl" -exec sed -i -f ./cgcore-xsl-replacements.sed {} \;
diff --git a/docs/css/style.a20c1a4367639632cdb341d23c27ca44fedcc75b0f8b3cbea6203010da153d3c.css b/docs/css/style.23e2c3298bcc8c1136c19aba330c211ec94c36f7c4454ea15cf4d3548370042a.css
similarity index 97%
rename from docs/css/style.a20c1a4367639632cdb341d23c27ca44fedcc75b0f8b3cbea6203010da153d3c.css
rename to docs/css/style.23e2c3298bcc8c1136c19aba330c211ec94c36f7c4454ea15cf4d3548370042a.css
index b905f91ce..19a9d118e 100644
--- a/docs/css/style.a20c1a4367639632cdb341d23c27ca44fedcc75b0f8b3cbea6203010da153d3c.css
+++ b/docs/css/style.23e2c3298bcc8c1136c19aba330c211ec94c36f7c4454ea15cf4d3548370042a.css
@@ -1,7 +1,13 @@
-@charset "UTF-8";/*!
- * Font Awesome 4.7.0 by @davegandy - http://fontawesome.io - @fontawesome
- * License - http://fontawesome.io/license (Font: SIL OFL 1.1, CSS: MIT License)
- */@font-face{font-family:FontAwesome;src:url(../fonts/fontawesome-webfont.eot?v=4.7.0);src:url(../fonts/fontawesome-webfont.eot?#iefix&v=4.7.0) format("embedded-opentype"),url(../fonts/fontawesome-webfont.woff2?v=4.7.0) format("woff2"),url(../fonts/fontawesome-webfont.woff?v=4.7.0) format("woff"),url(../fonts/fontawesome-webfont.ttf?v=4.7.0) format("truetype"),url(../fonts/fontawesome-webfont.svg?v=4.7.0#fontawesomeregular) format("svg");font-weight:400;font-style:normal}.fa{display:inline-block;font:normal normal normal 14px/1 FontAwesome;font-size:inherit;text-rendering:auto;-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale}.fa-lg{font-size:1.333333em;line-height:.75em;vertical-align:-15%}.fa-2x{font-size:2em}.fa-3x{font-size:3em}.fa-4x{font-size:4em}.fa-5x{font-size:5em}.fa-fw{width:1.285714em;text-align:center}.fa-tag:before{content:""}.fa-folder:before{content:""}.fa-facebook:before{content:""}.fa-google-plus:before{content:""}.fa-linkedin:before{content:""}.fa-rss:before{content:""}.fa-rss-square:before{content:""}.fa-twitter:before{content:""}.fa-hacker-news:before,.fa-y-combinator-square:before,.fa-yc-square:before{content:""}.fa-reddit:before{content:""}.fa-reddit-square:before{content:""}.fa-stumbleupon-circle:before{content:""}.fa-stumbleupon:before{content:""}.sr-only{position:absolute;width:1px;height:1px;padding:0;margin:-1px;overflow:hidden;clip:rect(0,0,0,0);border:0}.sr-only-focusable:active,.sr-only-focusable:focus{position:static;width:auto;height:auto;margin:0;overflow:visible;clip:auto}/*!
+/*!
+ * Font Awesome Free 5.12.0 by @fontawesome - https://fontawesome.com
+ * License - https://fontawesome.com/license/free (Icons: CC BY 4.0, Fonts: SIL OFL 1.1, Code: MIT License)
+ */.fa,.fab,.fad,.fal,.far,.fas{-moz-osx-font-smoothing:grayscale;-webkit-font-smoothing:antialiased;display:inline-block;font-style:normal;font-variant:normal;text-rendering:auto;line-height:1}.fa-lg{font-size:1.333333em;line-height:.75em;vertical-align:-.0667em}.fa-xs{font-size:.75em}.fa-sm{font-size:.875em}.fa-1x{font-size:1em}.fa-2x{font-size:2em}.fa-3x{font-size:3em}.fa-4x{font-size:4em}.fa-5x{font-size:5em}.fa-6x{font-size:6em}.fa-7x{font-size:7em}.fa-8x{font-size:8em}.fa-9x{font-size:9em}.fa-10x{font-size:10em}.fa-tag:before{content:"\f02b"}.fa-folder:before{content:"\f07b"}.fa-facebook:before{content:"\f09a"}.fa-facebook-f:before{content:"\f39e"}.fa-linkedin:before{content:"\f08c"}.fa-linkedin-in:before{content:"\f0e1"}.fa-rss:before{content:"\f09e"}.fa-rss-square:before{content:"\f143"}.fa-twitch:before{content:"\f1e8"}.fa-twitter:before{content:"\f099"}.fa-hacker-news:before,.fa-y-combinator-square:before,.fa-yc-square:before{content:"\f1d4"}.fa-reddit:before{content:"\f1a1"}.fa-reddit-square:before{content:"\f1a2"}.fa-stumbleupon-circle:before{content:"\f1a3"}.fa-stumbleupon:before{content:"\f1a4"}/*!
+ * Font Awesome Free 5.12.0 by @fontawesome - https://fontawesome.com
+ * License - https://fontawesome.com/license/free (Icons: CC BY 4.0, Fonts: SIL OFL 1.1, Code: MIT License)
+ */@font-face{font-family:'Font Awesome 5 Free';font-style:normal;font-weight:900;font-display:auto;src:url(../webfonts/fa-solid-900.eot);src:url(../webfonts/fa-solid-900.eot?#iefix) format("embedded-opentype"),url(../webfonts/fa-solid-900.woff2) format("woff2"),url(../webfonts/fa-solid-900.woff) format("woff"),url(../webfonts/fa-solid-900.ttf) format("truetype"),url(../webfonts/fa-solid-900.svg#fontawesome) format("svg")}.fa,.fas{font-family:'Font Awesome 5 Free';font-weight:900}/*!
+ * Font Awesome Free 5.12.0 by @fontawesome - https://fontawesome.com
+ * License - https://fontawesome.com/license/free (Icons: CC BY 4.0, Fonts: SIL OFL 1.1, Code: MIT License)
+ */@font-face{font-family:'Font Awesome 5 Brands';font-style:normal;font-weight:400;font-display:auto;src:url(../webfonts/fa-brands-400.eot);src:url(../webfonts/fa-brands-400.eot?#iefix) format("embedded-opentype"),url(../webfonts/fa-brands-400.woff2) format("woff2"),url(../webfonts/fa-brands-400.woff) format("woff"),url(../webfonts/fa-brands-400.ttf) format("truetype"),url(../webfonts/fa-brands-400.svg#fontawesome) format("svg")}.fab{font-family:'Font Awesome 5 Brands'}/*!
* Bootstrap v4.4.1 (https://getbootstrap.com/)
* Copyright 2011-2019 The Bootstrap Authors
* Copyright 2011-2019 Twitter, Inc.
diff --git a/docs/fonts/FontAwesome.otf b/docs/fonts/FontAwesome.otf
deleted file mode 100644
index 401ec0f36..000000000
Binary files a/docs/fonts/FontAwesome.otf and /dev/null differ
diff --git a/docs/fonts/fontawesome-webfont.eot b/docs/fonts/fontawesome-webfont.eot
deleted file mode 100644
index e9f60ca95..000000000
Binary files a/docs/fonts/fontawesome-webfont.eot and /dev/null differ
diff --git a/docs/fonts/fontawesome-webfont.svg b/docs/fonts/fontawesome-webfont.svg
deleted file mode 100644
index 855c845e5..000000000
--- a/docs/fonts/fontawesome-webfont.svg
+++ /dev/null
@@ -1,2671 +0,0 @@
-
-
-
diff --git a/docs/fonts/fontawesome-webfont.ttf b/docs/fonts/fontawesome-webfont.ttf
deleted file mode 100644
index 35acda2fa..000000000
Binary files a/docs/fonts/fontawesome-webfont.ttf and /dev/null differ
diff --git a/docs/fonts/fontawesome-webfont.woff b/docs/fonts/fontawesome-webfont.woff
deleted file mode 100644
index 400014a4b..000000000
Binary files a/docs/fonts/fontawesome-webfont.woff and /dev/null differ
diff --git a/docs/fonts/fontawesome-webfont.woff2 b/docs/fonts/fontawesome-webfont.woff2
deleted file mode 100644
index 4d13fc604..000000000
Binary files a/docs/fonts/fontawesome-webfont.woff2 and /dev/null differ
diff --git a/docs/index.html b/docs/index.html
index c07b7d881..2af3f59ec 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -95,7 +95,7 @@
January, 2020
@@ -132,7 +132,7 @@
December, 2019
@@ -164,7 +164,7 @@
November, 2019
@@ -183,7 +183,7 @@
1277694
- So 4.6 million from XMLUI and another 1.2 million from API requests
-- Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
+- Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
1183456
@@ -202,10 +202,10 @@
CGSpace CG Core v2 Migration
@@ -223,12 +223,12 @@
October, 2019
- 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc.
+ 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc.
Read more →
@@ -241,7 +241,7 @@
September, 2019
@@ -286,14 +286,14 @@
August, 2019
2019-08-03
-- Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
+- Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
2019-08-04
@@ -301,7 +301,7 @@
- Run system updates on CGSpace (linode18) and reboot it
- Before updating it I checked Solr and verified that all statistics cores were loaded properly…
-- After rebooting, all statistics cores were loaded… wow, that's lucky.
+- After rebooting, all statistics cores were loaded… wow, that’s lucky.
- Run system updates on DSpace Test (linode19) and reboot it
@@ -318,7 +318,7 @@
July, 2019
@@ -346,7 +346,7 @@
June, 2019
@@ -372,7 +372,7 @@
May, 2019
diff --git a/docs/index.xml b/docs/index.xml
index e81b9f2f3..38f4331bc 100644
--- a/docs/index.xml
+++ b/docs/index.xml
@@ -82,7 +82,7 @@
1277694
</code></pre><ul>
<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
-<li>Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
+<li>Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
1183456
@@ -107,7 +107,7 @@
Tue, 01 Oct 2019 13:20:51 +0300
https://alanorth.github.io/cgspace-notes/2019-10/
- 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc.
+ 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc.
-
@@ -154,7 +154,7 @@
https://alanorth.github.io/cgspace-notes/2019-08/
<h2 id="2019-08-03">2019-08-03</h2>
<ul>
-<li>Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…</li>
+<li>Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…</li>
</ul>
<h2 id="2019-08-04">2019-08-04</h2>
<ul>
@@ -162,7 +162,7 @@
<li>Run system updates on CGSpace (linode18) and reboot it
<ul>
<li>Before updating it I checked Solr and verified that all statistics cores were loaded properly…</li>
-<li>After rebooting, all statistics cores were loaded… wow, that's lucky.</li>
+<li>After rebooting, all statistics cores were loaded… wow, that’s lucky.</li>
</ul>
</li>
<li>Run system updates on DSpace Test (linode19) and reboot it</li>
@@ -269,9 +269,9 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
https://alanorth.github.io/cgspace-notes/2019-03/
<h2 id="2019-03-01">2019-03-01</h2>
<ul>
-<li>I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li>
+<li>I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li>
<li>I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…</li>
-<li>Looking at the other half of Udana's WLE records from 2018-11
+<li>Looking at the other half of Udana’s WLE records from 2018-11
<ul>
<li>I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)</li>
<li>I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items</li>
@@ -329,7 +329,7 @@ sys 0m1.979s
<h2 id="2019-01-02">2019-01-02</h2>
<ul>
<li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li>
-<li>I don't see anything interesting in the web server logs around that time though:</li>
+<li>I don’t see anything interesting in the web server logs around that time though:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
@@ -390,7 +390,7 @@ sys 0m1.979s
<h2 id="2018-10-01">2018-10-01</h2>
<ul>
<li>Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items</li>
-<li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I'm super busy in Nairobi right now</li>
+<li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I’m super busy in Nairobi right now</li>
</ul>
@@ -403,9 +403,9 @@ sys 0m1.979s
<h2 id="2018-09-02">2018-09-02</h2>
<ul>
<li>New <a href="https://jdbc.postgresql.org/documentation/changelog.html#version_42.2.5">PostgreSQL JDBC driver version 42.2.5</a></li>
-<li>I'll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li>
-<li>Also, I'll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month</li>
-<li>I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:</li>
+<li>I’ll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li>
+<li>Also, I’ll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month</li>
+<li>I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:</li>
</ul>
@@ -424,10 +424,10 @@ sys 0m1.979s
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>
<li>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</li>
-<li>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat's</li>
-<li>I'm not sure why Tomcat didn't crash with an OutOfMemoryError…</li>
+<li>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat’s</li>
+<li>I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…</li>
<li>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</li>
-<li>The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes</li>
+<li>The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes</li>
<li>I ran all system updates on DSpace Test and rebooted it</li>
</ul>
@@ -460,7 +460,7 @@ sys 0m1.979s
<ul>
<li>Test the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">DSpace 5.8 module upgrades from Atmire</a> (<a href="https://github.com/ilri/DSpace/pull/378">#378</a>)
<ul>
-<li>There seems to be a problem with the CUA and L&R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn't build</li>
+<li>There seems to be a problem with the CUA and L&R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn’t build</li>
</ul>
</li>
<li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li>
@@ -506,7 +506,7 @@ sys 2m7.289s
https://alanorth.github.io/cgspace-notes/2018-04/
<h2 id="2018-04-01">2018-04-01</h2>
<ul>
-<li>I tried to test something on DSpace Test but noticed that it's down since god knows when</li>
+<li>I tried to test something on DSpace Test but noticed that it’s down since god knows when</li>
<li>Catalina logs at least show some memory errors yesterday:</li>
</ul>
@@ -532,9 +532,9 @@ sys 2m7.289s
<h2 id="2018-02-01">2018-02-01</h2>
<ul>
<li>Peter gave feedback on the <code>dc.rights</code> proof of concept that I had sent him last week</li>
-<li>We don't need to distinguish between internal and external works, so that makes it just a simple list</li>
+<li>We don’t need to distinguish between internal and external works, so that makes it just a simple list</li>
<li>Yesterday I figured out how to monitor DSpace sessions using JMX</li>
-<li>I copied the logic in the <code>jmx_tomcat_dbpools</code> provided by Ubuntu's <code>munin-plugins-java</code> package and used the stuff I discovered about JMX <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-01/">in 2018-01</a></li>
+<li>I copied the logic in the <code>jmx_tomcat_dbpools</code> provided by Ubuntu’s <code>munin-plugins-java</code> package and used the stuff I discovered about JMX <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-01/">in 2018-01</a></li>
</ul>
@@ -547,7 +547,7 @@ sys 2m7.289s
<h2 id="2018-01-02">2018-01-02</h2>
<ul>
<li>Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time</li>
-<li>I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary</li>
+<li>I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary</li>
<li>The nginx logs show HTTP 200s until <code>02/Jan/2018:11:27:17 +0000</code> when Uptime Robot got an HTTP 500</li>
<li>In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”</li>
<li>And just before that I see this:</li>
@@ -555,8 +555,8 @@ sys 2m7.289s
<pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
</code></pre><ul>
<li>Ah hah! So the pool was actually empty!</li>
-<li>I need to increase that, let's try to bump it up from 50 to 75</li>
-<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw</li>
+<li>I need to increase that, let’s try to bump it up from 50 to 75</li>
+<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw</li>
<li>I notice this error quite a few times in dspace.log:</li>
</ul>
<pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
@@ -609,7 +609,7 @@ dspace.log.2017-12-31:53
dspace.log.2018-01-01:45
dspace.log.2018-01-02:34
</code></pre><ul>
-<li>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains</li>
+<li>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains</li>
</ul>
@@ -664,7 +664,7 @@ COPY 54701
</ul>
<pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
</code></pre><ul>
-<li>There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
+<li>There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
<li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li>
</ul>
@@ -690,7 +690,7 @@ COPY 54701
</ul>
<h2 id="2017-09-07">2017-09-07</h2>
<ul>
-<li>Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group</li>
+<li>Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group</li>
</ul>
@@ -714,7 +714,7 @@ COPY 54701
</li>
<li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs… we will need to find a way to forbid them from accessing these!</li>
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
-<li>It turns out that we're already adding the <code>X-Robots-Tag "none"</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
+<li>It turns out that we’re already adding the <code>X-Robots-Tag "none"</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header…</li>
<li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li>
<li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li>
@@ -737,8 +737,8 @@ COPY 54701
<h2 id="2017-07-04">2017-07-04</h2>
<ul>
<li>Merge changes for WLE Phase II theme rename (<a href="https://github.com/ilri/DSpace/pull/329">#329</a>)</li>
-<li>Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace</li>
-<li>We can use PostgreSQL's extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li>
+<li>Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace</li>
+<li>We can use PostgreSQL’s extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li>
</ul>
@@ -748,7 +748,7 @@ COPY 54701
Thu, 01 Jun 2017 10:14:52 +0300
https://alanorth.github.io/cgspace-notes/2017-06/
- 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg.
+ 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg.
-
@@ -757,7 +757,7 @@ COPY 54701
Mon, 01 May 2017 16:21:52 +0200
https://alanorth.github.io/cgspace-notes/2017-05/
- 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.
+ 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.
-
@@ -800,7 +800,7 @@ COPY 54701
<li>Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI</li>
<li>Filed an issue on DSpace issue tracker for the <code>filter-media</code> bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: <a href="https://jira.duraspace.org/browse/DS-3516">DS-3516</a></li>
<li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li>
-<li>Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li>
+<li>Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li>
</ul>
<pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
@@ -828,7 +828,7 @@ dspace=# delete from collection2item where id = 92551 and item_id = 80278;
DELETE 1
</code></pre><ul>
<li>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</li>
-<li>Looks like we'll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li>
+<li>Looks like we’ll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li>
</ul>
@@ -841,8 +841,8 @@ DELETE 1
<h2 id="2017-01-02">2017-01-02</h2>
<ul>
<li>I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error</li>
-<li>I tested on DSpace Test as well and it doesn't work there either</li>
-<li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years</li>
+<li>I tested on DSpace Test as well and it doesn’t work there either</li>
+<li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years</li>
</ul>
@@ -863,8 +863,8 @@ DELETE 1
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
</code></pre><ul>
-<li>I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade</li>
-<li>I've raised a ticket with Atmire to ask</li>
+<li>I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade</li>
+<li>I’ve raised a ticket with Atmire to ask</li>
<li>Another worrying error from dspace.log is:</li>
</ul>
@@ -877,7 +877,7 @@ DELETE 1
https://alanorth.github.io/cgspace-notes/2016-11/
<h2 id="2016-11-01">2016-11-01</h2>
<ul>
-<li>Add <code>dc.type</code> to the output options for Atmire's Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li>
+<li>Add <code>dc.type</code> to the output options for Atmire’s Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li>
</ul>
<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/11/listings-and-reports.png" alt="Listings and Reports with output type"></p>
@@ -897,7 +897,7 @@ DELETE 1
<li>ORCIDs plus normal authors</li>
</ul>
</li>
-<li>I exported a random item's metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li>
+<li>I exported a random item’s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li>
</ul>
<pre><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
</code></pre>
@@ -912,7 +912,7 @@ DELETE 1
<h2 id="2016-09-01">2016-09-01</h2>
<ul>
<li>Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors</li>
-<li>Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace</li>
+<li>Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace</li>
<li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li>
<li>It looks like we might be able to use OUs now, instead of DCs:</li>
</ul>
@@ -972,7 +972,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<h2 id="2016-06-01">2016-06-01</h2>
<ul>
<li>Experimenting with IFPRI OAI (we want to harvest their publications)</li>
-<li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI's OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li>
+<li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI’s OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li>
<li>After reading the <a href="https://www.openarchives.org/OAI/openarchivesprotocol.html">OAI documentation</a> and testing with an <a href="http://validator.oaipmh.com/">OAI validator</a> I found out how to get their publications</li>
<li>This is their publications set: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&from=2016-01-01&set=p15738coll2&metadataPrefix=oai_dc">http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&from=2016-01-01&set=p15738coll2&metadataPrefix=oai_dc</a></li>
<li>You can see the others by using the OAI <code>ListSets</code> verb: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListSets">http://ebrary.ifpri.org/oai/oai.php?verb=ListSets</a></li>
@@ -1007,8 +1007,8 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<ul>
<li>Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit</li>
<li>We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc</li>
-<li>After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!</li>
-<li>This will save us a few gigs of backup space we're paying for on S3</li>
+<li>After running DSpace for over five years I’ve never needed to look in any other log file than dspace.log, leave alone one from last year!</li>
+<li>This will save us a few gigs of backup space we’re paying for on S3</li>
<li>Also, I noticed the <code>checker</code> log has some errors we should pay attention to:</li>
</ul>
@@ -1022,7 +1022,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<h2 id="2016-03-02">2016-03-02</h2>
<ul>
<li>Looking at issues with author authorities on CGSpace</li>
-<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module</li>
+<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module</li>
<li>Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server</li>
</ul>
diff --git a/docs/page/2/index.html b/docs/page/2/index.html
index dc8db6153..c4c4744c6 100644
--- a/docs/page/2/index.html
+++ b/docs/page/2/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -95,7 +95,7 @@
April, 2019
@@ -136,16 +136,16 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
March, 2019
2019-03-01
-- I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
+- I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
- I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
-- Looking at the other half of Udana's WLE records from 2018-11
+
- Looking at the other half of Udana’s WLE records from 2018-11
- I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
- I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
@@ -168,7 +168,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
February, 2019
@@ -213,7 +213,7 @@ sys 0m1.979s
January, 2019
@@ -221,7 +221,7 @@ sys 0m1.979s
2019-01-02
- Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
-- I don't see anything interesting in the web server logs around that time though:
+- I don’t see anything interesting in the web server logs around that time though:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
@@ -247,7 +247,7 @@ sys 0m1.979s
December, 2018
@@ -274,7 +274,7 @@ sys 0m1.979s
November, 2018
@@ -301,7 +301,7 @@ sys 0m1.979s
October, 2018
@@ -309,7 +309,7 @@ sys 0m1.979s
2018-10-01
- Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
-- I created a GitHub issue to track this #389, because I'm super busy in Nairobi right now
+- I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now
Read more →
@@ -323,7 +323,7 @@ sys 0m1.979s
September, 2018
@@ -331,9 +331,9 @@ sys 0m1.979s
2018-09-02
- New PostgreSQL JDBC driver version 42.2.5
-- I'll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
-- Also, I'll re-run the
postgresql
tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month
-- I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:
+- I’ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
+- Also, I’ll re-run the
postgresql
tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
+- I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
Read more →
@@ -347,7 +347,7 @@ sys 0m1.979s
August, 2018
@@ -361,10 +361,10 @@ sys 0m1.979s
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
- Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
-- From the DSpace log I see that eventually Solr stopped responding, so I guess the
java
process that was OOM killed above was Tomcat's
-- I'm not sure why Tomcat didn't crash with an OutOfMemoryError…
+- From the DSpace log I see that eventually Solr stopped responding, so I guess the
java
process that was OOM killed above was Tomcat’s
+- I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…
- Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
-- The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
+- The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
- I ran all system updates on DSpace Test and rebooted it
Read more →
@@ -379,7 +379,7 @@ sys 0m1.979s
July, 2018
diff --git a/docs/page/3/index.html b/docs/page/3/index.html
index bdc888424..4dd65db75 100644
--- a/docs/page/3/index.html
+++ b/docs/page/3/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -95,7 +95,7 @@
June, 2018
@@ -104,7 +104,7 @@
- Test the DSpace 5.8 module upgrades from Atmire (#378)
-- There seems to be a problem with the CUA and L&R versions in
pom.xml
because they are using SNAPSHOT and it doesn't build
+- There seems to be a problem with the CUA and L&R versions in
pom.xml
because they are using SNAPSHOT and it doesn’t build
- I added the new CCAFS Phase II Project Tag
PII-FP1_PACCA2
and merged it into the 5_x-prod
branch (#379)
@@ -133,7 +133,7 @@ sys 2m7.289s
May, 2018
@@ -161,14 +161,14 @@ sys 2m7.289s
April, 2018
2018-04-01
-- I tried to test something on DSpace Test but noticed that it's down since god knows when
+- I tried to test something on DSpace Test but noticed that it’s down since god knows when
- Catalina logs at least show some memory errors yesterday:
Read more →
@@ -183,7 +183,7 @@ sys 2m7.289s
March, 2018
@@ -204,7 +204,7 @@ sys 2m7.289s
February, 2018
@@ -212,9 +212,9 @@ sys 2m7.289s
2018-02-01
- Peter gave feedback on the
dc.rights
proof of concept that I had sent him last week
-- We don't need to distinguish between internal and external works, so that makes it just a simple list
+- We don’t need to distinguish between internal and external works, so that makes it just a simple list
- Yesterday I figured out how to monitor DSpace sessions using JMX
-- I copied the logic in the
jmx_tomcat_dbpools
provided by Ubuntu's munin-plugins-java
package and used the stuff I discovered about JMX in 2018-01
+- I copied the logic in the
jmx_tomcat_dbpools
provided by Ubuntu’s munin-plugins-java
package and used the stuff I discovered about JMX in 2018-01
Read more →
@@ -228,7 +228,7 @@ sys 2m7.289s
January, 2018
@@ -236,7 +236,7 @@ sys 2m7.289s
2018-01-02
- Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
-- I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
+- I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
- The nginx logs show HTTP 200s until
02/Jan/2018:11:27:17 +0000
when Uptime Robot got an HTTP 500
- In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
- And just before that I see this:
@@ -244,8 +244,8 @@ sys 2m7.289s
Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
- Ah hah! So the pool was actually empty!
-- I need to increase that, let's try to bump it up from 50 to 75
-- After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
+- I need to increase that, let’s try to bump it up from 50 to 75
+- After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw
- I notice this error quite a few times in dspace.log:
2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
@@ -298,7 +298,7 @@ dspace.log.2017-12-31:53
dspace.log.2018-01-01:45
dspace.log.2018-01-02:34
-- Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains
+- Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains
Read more →
@@ -312,7 +312,7 @@ dspace.log.2018-01-02:34
December, 2017
@@ -336,7 +336,7 @@ dspace.log.2018-01-02:34
November, 2017
@@ -369,7 +369,7 @@ COPY 54701
October, 2017
@@ -380,7 +380,7 @@ COPY 54701
http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
-- There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
+- There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
- Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
Read more →
@@ -395,10 +395,10 @@ COPY 54701
CGIAR Library Migration
diff --git a/docs/page/4/index.html b/docs/page/4/index.html
index cbd4e683d..3ff8aa57a 100644
--- a/docs/page/4/index.html
+++ b/docs/page/4/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -96,7 +96,7 @@
September, 2017
@@ -106,7 +106,7 @@
2017-09-07
-- Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
+- Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
Read more →
@@ -121,7 +121,7 @@
August, 2017
@@ -139,7 +139,7 @@
- The
robots.txt
only blocks the top-level /discover
and /browse
URLs… we will need to find a way to forbid them from accessing these!
- Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
-- It turns out that we're already adding the
X-Robots-Tag "none"
HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
+- It turns out that we’re already adding the
X-Robots-Tag "none"
HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
- Also, the bot has to successfully browse the page first so it can receive the HTTP header…
- We might actually have to block these requests with HTTP 403 depending on the user agent
- Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
@@ -160,7 +160,7 @@
July, 2017
@@ -171,8 +171,8 @@
2017-07-04
- Merge changes for WLE Phase II theme rename (#329)
-- Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace
-- We can use PostgreSQL's extended output format (
-x
) plus sed
to format the output into quasi XML:
+- Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
+- We can use PostgreSQL’s extended output format (
-x
) plus sed
to format the output into quasi XML:
Read more →
@@ -187,11 +187,11 @@
June, 2017
- 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg.
+ 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg.
Read more →
@@ -205,11 +205,11 @@
May, 2017
- 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.
+ 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.
Read more →
@@ -223,7 +223,7 @@
April, 2017
@@ -252,7 +252,7 @@
March, 2017
@@ -270,7 +270,7 @@
- Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
- Filed an issue on DSpace issue tracker for the
filter-media
bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
- Discovered that the ImageMagic
filter-media
plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
-- Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
+- Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
@@ -288,7 +288,7 @@
February, 2017
@@ -307,7 +307,7 @@ dspace=# delete from collection2item where id = 92551 and item_id = 80278;
DELETE 1
- Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
-- Looks like we'll be using
cg.identifier.ccafsprojectpii
as the field name
+- Looks like we’ll be using
cg.identifier.ccafsprojectpii
as the field name
Read more →
@@ -322,15 +322,15 @@ DELETE 1
January, 2017
2017-01-02
- I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
-- I tested on DSpace Test as well and it doesn't work there either
-- I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years
+- I tested on DSpace Test as well and it doesn’t work there either
+- I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
Read more →
@@ -345,7 +345,7 @@ DELETE 1
December, 2016
@@ -360,8 +360,8 @@ DELETE 1
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
-- I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade
-- I've raised a ticket with Atmire to ask
+- I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
+- I’ve raised a ticket with Atmire to ask
- Another worrying error from dspace.log is:
Read more →
diff --git a/docs/page/5/index.html b/docs/page/5/index.html
index f24e27931..f619065d8 100644
--- a/docs/page/5/index.html
+++ b/docs/page/5/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -96,13 +96,13 @@
November, 2016
2016-11-01
-- Add
dc.type
to the output options for Atmire's Listings and Reports module (#286)
+- Add
dc.type
to the output options for Atmire’s Listings and Reports module (#286)

Read more →
@@ -118,7 +118,7 @@
October, 2016
@@ -131,7 +131,7 @@
ORCIDs plus normal authors
-I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author
with the following random ORCIDs from the ORCID registry:
+I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author
with the following random ORCIDs from the ORCID registry:
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
@@ -148,14 +148,14 @@
September, 2016
2016-09-01
- Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
-- Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace
+- Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
- We had been using
DC=ILRI
to determine whether a user was ILRI or not
- It looks like we might be able to use OUs now, instead of DCs:
@@ -174,7 +174,7 @@
August, 2016
@@ -204,7 +204,7 @@ $ git rebase -i dspace-5.5
July, 2016
@@ -235,14 +235,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
June, 2016
2016-06-01
- Experimenting with IFPRI OAI (we want to harvest their publications)
-- After reading the ContentDM documentation I found IFPRI's OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
+- After reading the ContentDM documentation I found IFPRI’s OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
- After reading the OAI documentation and testing with an OAI validator I found out how to get their publications
- This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&from=2016-01-01&set=p15738coll2&metadataPrefix=oai_dc
- You can see the others by using the OAI
ListSets
verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
@@ -261,7 +261,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
May, 2016
@@ -287,7 +287,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
April, 2016
@@ -295,8 +295,8 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
- Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit
- We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc
-- After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!
-- This will save us a few gigs of backup space we're paying for on S3
+- After running DSpace for over five years I’ve never needed to look in any other log file than dspace.log, leave alone one from last year!
+- This will save us a few gigs of backup space we’re paying for on S3
- Also, I noticed the
checker
log has some errors we should pay attention to:
Read more →
@@ -312,14 +312,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
March, 2016
2016-03-02
- Looking at issues with author authorities on CGSpace
-- For some reason we still have the
index-lucene-update
cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module
+- For some reason we still have the
index-lucene-update
cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module
- Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
Read more →
@@ -335,7 +335,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
February, 2016
diff --git a/docs/page/6/index.html b/docs/page/6/index.html
index 5f92fe671..8aa15d50c 100644
--- a/docs/page/6/index.html
+++ b/docs/page/6/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -96,7 +96,7 @@
January, 2016
@@ -119,7 +119,7 @@
December, 2015
@@ -146,7 +146,7 @@
November, 2015
diff --git a/docs/posts/index.html b/docs/posts/index.html
index 4e7d38b1f..08b96f158 100644
--- a/docs/posts/index.html
+++ b/docs/posts/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -95,7 +95,7 @@
January, 2020
@@ -132,7 +132,7 @@
December, 2019
@@ -164,7 +164,7 @@
November, 2019
@@ -183,7 +183,7 @@
1277694
- So 4.6 million from XMLUI and another 1.2 million from API requests
-- Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
+- Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
1183456
@@ -202,10 +202,10 @@
CGSpace CG Core v2 Migration
@@ -223,12 +223,12 @@
October, 2019
- 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc.
+ 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc.
Read more →
@@ -241,7 +241,7 @@
September, 2019
@@ -286,14 +286,14 @@
August, 2019
2019-08-03
-- Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
+- Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
2019-08-04
@@ -301,7 +301,7 @@
- Run system updates on CGSpace (linode18) and reboot it
- Before updating it I checked Solr and verified that all statistics cores were loaded properly…
-- After rebooting, all statistics cores were loaded… wow, that's lucky.
+- After rebooting, all statistics cores were loaded… wow, that’s lucky.
- Run system updates on DSpace Test (linode19) and reboot it
@@ -318,7 +318,7 @@
July, 2019
@@ -346,7 +346,7 @@
June, 2019
@@ -372,7 +372,7 @@
May, 2019
diff --git a/docs/posts/index.xml b/docs/posts/index.xml
index a5abc1a10..2f06a2baa 100644
--- a/docs/posts/index.xml
+++ b/docs/posts/index.xml
@@ -82,7 +82,7 @@
1277694
</code></pre><ul>
<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
-<li>Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
+<li>Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
1183456
@@ -107,7 +107,7 @@
Tue, 01 Oct 2019 13:20:51 +0300
https://alanorth.github.io/cgspace-notes/2019-10/
- 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc.
+ 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc.
-
@@ -154,7 +154,7 @@
https://alanorth.github.io/cgspace-notes/2019-08/
<h2 id="2019-08-03">2019-08-03</h2>
<ul>
-<li>Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…</li>
+<li>Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…</li>
</ul>
<h2 id="2019-08-04">2019-08-04</h2>
<ul>
@@ -162,7 +162,7 @@
<li>Run system updates on CGSpace (linode18) and reboot it
<ul>
<li>Before updating it I checked Solr and verified that all statistics cores were loaded properly…</li>
-<li>After rebooting, all statistics cores were loaded… wow, that's lucky.</li>
+<li>After rebooting, all statistics cores were loaded… wow, that’s lucky.</li>
</ul>
</li>
<li>Run system updates on DSpace Test (linode19) and reboot it</li>
@@ -269,9 +269,9 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
https://alanorth.github.io/cgspace-notes/2019-03/
<h2 id="2019-03-01">2019-03-01</h2>
<ul>
-<li>I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li>
+<li>I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li>
<li>I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…</li>
-<li>Looking at the other half of Udana's WLE records from 2018-11
+<li>Looking at the other half of Udana’s WLE records from 2018-11
<ul>
<li>I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)</li>
<li>I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items</li>
@@ -329,7 +329,7 @@ sys 0m1.979s
<h2 id="2019-01-02">2019-01-02</h2>
<ul>
<li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li>
-<li>I don't see anything interesting in the web server logs around that time though:</li>
+<li>I don’t see anything interesting in the web server logs around that time though:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
@@ -390,7 +390,7 @@ sys 0m1.979s
<h2 id="2018-10-01">2018-10-01</h2>
<ul>
<li>Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items</li>
-<li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I'm super busy in Nairobi right now</li>
+<li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I’m super busy in Nairobi right now</li>
</ul>
@@ -403,9 +403,9 @@ sys 0m1.979s
<h2 id="2018-09-02">2018-09-02</h2>
<ul>
<li>New <a href="https://jdbc.postgresql.org/documentation/changelog.html#version_42.2.5">PostgreSQL JDBC driver version 42.2.5</a></li>
-<li>I'll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li>
-<li>Also, I'll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month</li>
-<li>I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:</li>
+<li>I’ll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li>
+<li>Also, I’ll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month</li>
+<li>I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:</li>
</ul>
@@ -424,10 +424,10 @@ sys 0m1.979s
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>
<li>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</li>
-<li>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat's</li>
-<li>I'm not sure why Tomcat didn't crash with an OutOfMemoryError…</li>
+<li>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat’s</li>
+<li>I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…</li>
<li>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</li>
-<li>The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes</li>
+<li>The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes</li>
<li>I ran all system updates on DSpace Test and rebooted it</li>
</ul>
@@ -460,7 +460,7 @@ sys 0m1.979s
<ul>
<li>Test the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">DSpace 5.8 module upgrades from Atmire</a> (<a href="https://github.com/ilri/DSpace/pull/378">#378</a>)
<ul>
-<li>There seems to be a problem with the CUA and L&R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn't build</li>
+<li>There seems to be a problem with the CUA and L&R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn’t build</li>
</ul>
</li>
<li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li>
@@ -506,7 +506,7 @@ sys 2m7.289s
https://alanorth.github.io/cgspace-notes/2018-04/
<h2 id="2018-04-01">2018-04-01</h2>
<ul>
-<li>I tried to test something on DSpace Test but noticed that it's down since god knows when</li>
+<li>I tried to test something on DSpace Test but noticed that it’s down since god knows when</li>
<li>Catalina logs at least show some memory errors yesterday:</li>
</ul>
@@ -532,9 +532,9 @@ sys 2m7.289s
<h2 id="2018-02-01">2018-02-01</h2>
<ul>
<li>Peter gave feedback on the <code>dc.rights</code> proof of concept that I had sent him last week</li>
-<li>We don't need to distinguish between internal and external works, so that makes it just a simple list</li>
+<li>We don’t need to distinguish between internal and external works, so that makes it just a simple list</li>
<li>Yesterday I figured out how to monitor DSpace sessions using JMX</li>
-<li>I copied the logic in the <code>jmx_tomcat_dbpools</code> provided by Ubuntu's <code>munin-plugins-java</code> package and used the stuff I discovered about JMX <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-01/">in 2018-01</a></li>
+<li>I copied the logic in the <code>jmx_tomcat_dbpools</code> provided by Ubuntu’s <code>munin-plugins-java</code> package and used the stuff I discovered about JMX <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-01/">in 2018-01</a></li>
</ul>
@@ -547,7 +547,7 @@ sys 2m7.289s
<h2 id="2018-01-02">2018-01-02</h2>
<ul>
<li>Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time</li>
-<li>I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary</li>
+<li>I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary</li>
<li>The nginx logs show HTTP 200s until <code>02/Jan/2018:11:27:17 +0000</code> when Uptime Robot got an HTTP 500</li>
<li>In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”</li>
<li>And just before that I see this:</li>
@@ -555,8 +555,8 @@ sys 2m7.289s
<pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
</code></pre><ul>
<li>Ah hah! So the pool was actually empty!</li>
-<li>I need to increase that, let's try to bump it up from 50 to 75</li>
-<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw</li>
+<li>I need to increase that, let’s try to bump it up from 50 to 75</li>
+<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw</li>
<li>I notice this error quite a few times in dspace.log:</li>
</ul>
<pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
@@ -609,7 +609,7 @@ dspace.log.2017-12-31:53
dspace.log.2018-01-01:45
dspace.log.2018-01-02:34
</code></pre><ul>
-<li>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains</li>
+<li>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains</li>
</ul>
@@ -664,7 +664,7 @@ COPY 54701
</ul>
<pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
</code></pre><ul>
-<li>There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
+<li>There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
<li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li>
</ul>
@@ -690,7 +690,7 @@ COPY 54701
</ul>
<h2 id="2017-09-07">2017-09-07</h2>
<ul>
-<li>Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group</li>
+<li>Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group</li>
</ul>
@@ -714,7 +714,7 @@ COPY 54701
</li>
<li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs… we will need to find a way to forbid them from accessing these!</li>
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
-<li>It turns out that we're already adding the <code>X-Robots-Tag "none"</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
+<li>It turns out that we’re already adding the <code>X-Robots-Tag "none"</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header…</li>
<li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li>
<li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li>
@@ -737,8 +737,8 @@ COPY 54701
<h2 id="2017-07-04">2017-07-04</h2>
<ul>
<li>Merge changes for WLE Phase II theme rename (<a href="https://github.com/ilri/DSpace/pull/329">#329</a>)</li>
-<li>Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace</li>
-<li>We can use PostgreSQL's extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li>
+<li>Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace</li>
+<li>We can use PostgreSQL’s extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li>
</ul>
@@ -748,7 +748,7 @@ COPY 54701
Thu, 01 Jun 2017 10:14:52 +0300
https://alanorth.github.io/cgspace-notes/2017-06/
- 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg.
+ 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg.
-
@@ -757,7 +757,7 @@ COPY 54701
Mon, 01 May 2017 16:21:52 +0200
https://alanorth.github.io/cgspace-notes/2017-05/
- 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.
+ 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.
-
@@ -800,7 +800,7 @@ COPY 54701
<li>Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI</li>
<li>Filed an issue on DSpace issue tracker for the <code>filter-media</code> bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: <a href="https://jira.duraspace.org/browse/DS-3516">DS-3516</a></li>
<li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li>
-<li>Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li>
+<li>Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li>
</ul>
<pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
@@ -828,7 +828,7 @@ dspace=# delete from collection2item where id = 92551 and item_id = 80278;
DELETE 1
</code></pre><ul>
<li>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</li>
-<li>Looks like we'll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li>
+<li>Looks like we’ll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li>
</ul>
@@ -841,8 +841,8 @@ DELETE 1
<h2 id="2017-01-02">2017-01-02</h2>
<ul>
<li>I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error</li>
-<li>I tested on DSpace Test as well and it doesn't work there either</li>
-<li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years</li>
+<li>I tested on DSpace Test as well and it doesn’t work there either</li>
+<li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years</li>
</ul>
@@ -863,8 +863,8 @@ DELETE 1
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
</code></pre><ul>
-<li>I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade</li>
-<li>I've raised a ticket with Atmire to ask</li>
+<li>I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade</li>
+<li>I’ve raised a ticket with Atmire to ask</li>
<li>Another worrying error from dspace.log is:</li>
</ul>
@@ -877,7 +877,7 @@ DELETE 1
https://alanorth.github.io/cgspace-notes/2016-11/
<h2 id="2016-11-01">2016-11-01</h2>
<ul>
-<li>Add <code>dc.type</code> to the output options for Atmire's Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li>
+<li>Add <code>dc.type</code> to the output options for Atmire’s Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li>
</ul>
<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/11/listings-and-reports.png" alt="Listings and Reports with output type"></p>
@@ -897,7 +897,7 @@ DELETE 1
<li>ORCIDs plus normal authors</li>
</ul>
</li>
-<li>I exported a random item's metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li>
+<li>I exported a random item’s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li>
</ul>
<pre><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
</code></pre>
@@ -912,7 +912,7 @@ DELETE 1
<h2 id="2016-09-01">2016-09-01</h2>
<ul>
<li>Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors</li>
-<li>Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace</li>
+<li>Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace</li>
<li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li>
<li>It looks like we might be able to use OUs now, instead of DCs:</li>
</ul>
@@ -972,7 +972,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<h2 id="2016-06-01">2016-06-01</h2>
<ul>
<li>Experimenting with IFPRI OAI (we want to harvest their publications)</li>
-<li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI's OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li>
+<li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI’s OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li>
<li>After reading the <a href="https://www.openarchives.org/OAI/openarchivesprotocol.html">OAI documentation</a> and testing with an <a href="http://validator.oaipmh.com/">OAI validator</a> I found out how to get their publications</li>
<li>This is their publications set: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&from=2016-01-01&set=p15738coll2&metadataPrefix=oai_dc">http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&from=2016-01-01&set=p15738coll2&metadataPrefix=oai_dc</a></li>
<li>You can see the others by using the OAI <code>ListSets</code> verb: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListSets">http://ebrary.ifpri.org/oai/oai.php?verb=ListSets</a></li>
@@ -1007,8 +1007,8 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<ul>
<li>Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit</li>
<li>We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc</li>
-<li>After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!</li>
-<li>This will save us a few gigs of backup space we're paying for on S3</li>
+<li>After running DSpace for over five years I’ve never needed to look in any other log file than dspace.log, leave alone one from last year!</li>
+<li>This will save us a few gigs of backup space we’re paying for on S3</li>
<li>Also, I noticed the <code>checker</code> log has some errors we should pay attention to:</li>
</ul>
@@ -1022,7 +1022,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<h2 id="2016-03-02">2016-03-02</h2>
<ul>
<li>Looking at issues with author authorities on CGSpace</li>
-<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module</li>
+<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module</li>
<li>Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server</li>
</ul>
diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html
index 8f6375717..931ad202b 100644
--- a/docs/posts/page/2/index.html
+++ b/docs/posts/page/2/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -95,7 +95,7 @@
April, 2019
@@ -136,16 +136,16 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
March, 2019
2019-03-01
-- I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
+- I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
- I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
-- Looking at the other half of Udana's WLE records from 2018-11
+
- Looking at the other half of Udana’s WLE records from 2018-11
- I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
- I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
@@ -168,7 +168,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
February, 2019
@@ -213,7 +213,7 @@ sys 0m1.979s
January, 2019
@@ -221,7 +221,7 @@ sys 0m1.979s
2019-01-02
- Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
-- I don't see anything interesting in the web server logs around that time though:
+- I don’t see anything interesting in the web server logs around that time though:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
@@ -247,7 +247,7 @@ sys 0m1.979s
December, 2018
@@ -274,7 +274,7 @@ sys 0m1.979s
November, 2018
@@ -301,7 +301,7 @@ sys 0m1.979s
October, 2018
@@ -309,7 +309,7 @@ sys 0m1.979s
2018-10-01
- Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
-- I created a GitHub issue to track this #389, because I'm super busy in Nairobi right now
+- I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now
Read more →
@@ -323,7 +323,7 @@ sys 0m1.979s
September, 2018
@@ -331,9 +331,9 @@ sys 0m1.979s
2018-09-02
- New PostgreSQL JDBC driver version 42.2.5
-- I'll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
-- Also, I'll re-run the
postgresql
tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month
-- I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:
+- I’ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
+- Also, I’ll re-run the
postgresql
tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
+- I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
Read more →
@@ -347,7 +347,7 @@ sys 0m1.979s
August, 2018
@@ -361,10 +361,10 @@ sys 0m1.979s
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
- Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
-- From the DSpace log I see that eventually Solr stopped responding, so I guess the
java
process that was OOM killed above was Tomcat's
-- I'm not sure why Tomcat didn't crash with an OutOfMemoryError…
+- From the DSpace log I see that eventually Solr stopped responding, so I guess the
java
process that was OOM killed above was Tomcat’s
+- I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…
- Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
-- The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
+- The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
- I ran all system updates on DSpace Test and rebooted it
Read more →
@@ -379,7 +379,7 @@ sys 0m1.979s
July, 2018
diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html
index 0b049ac8d..63c6e429f 100644
--- a/docs/posts/page/3/index.html
+++ b/docs/posts/page/3/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -95,7 +95,7 @@
June, 2018
@@ -104,7 +104,7 @@
- Test the DSpace 5.8 module upgrades from Atmire (#378)
-- There seems to be a problem with the CUA and L&R versions in
pom.xml
because they are using SNAPSHOT and it doesn't build
+- There seems to be a problem with the CUA and L&R versions in
pom.xml
because they are using SNAPSHOT and it doesn’t build
- I added the new CCAFS Phase II Project Tag
PII-FP1_PACCA2
and merged it into the 5_x-prod
branch (#379)
@@ -133,7 +133,7 @@ sys 2m7.289s
May, 2018
@@ -161,14 +161,14 @@ sys 2m7.289s
April, 2018
2018-04-01
-- I tried to test something on DSpace Test but noticed that it's down since god knows when
+- I tried to test something on DSpace Test but noticed that it’s down since god knows when
- Catalina logs at least show some memory errors yesterday:
Read more →
@@ -183,7 +183,7 @@ sys 2m7.289s
March, 2018
@@ -204,7 +204,7 @@ sys 2m7.289s
February, 2018
@@ -212,9 +212,9 @@ sys 2m7.289s
2018-02-01
- Peter gave feedback on the
dc.rights
proof of concept that I had sent him last week
-- We don't need to distinguish between internal and external works, so that makes it just a simple list
+- We don’t need to distinguish between internal and external works, so that makes it just a simple list
- Yesterday I figured out how to monitor DSpace sessions using JMX
-- I copied the logic in the
jmx_tomcat_dbpools
provided by Ubuntu's munin-plugins-java
package and used the stuff I discovered about JMX in 2018-01
+- I copied the logic in the
jmx_tomcat_dbpools
provided by Ubuntu’s munin-plugins-java
package and used the stuff I discovered about JMX in 2018-01
Read more →
@@ -228,7 +228,7 @@ sys 2m7.289s
January, 2018
@@ -236,7 +236,7 @@ sys 2m7.289s
2018-01-02
- Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
-- I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
+- I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
- The nginx logs show HTTP 200s until
02/Jan/2018:11:27:17 +0000
when Uptime Robot got an HTTP 500
- In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
- And just before that I see this:
@@ -244,8 +244,8 @@ sys 2m7.289s
Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
- Ah hah! So the pool was actually empty!
-- I need to increase that, let's try to bump it up from 50 to 75
-- After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
+- I need to increase that, let’s try to bump it up from 50 to 75
+- After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw
- I notice this error quite a few times in dspace.log:
2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
@@ -298,7 +298,7 @@ dspace.log.2017-12-31:53
dspace.log.2018-01-01:45
dspace.log.2018-01-02:34
-- Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains
+- Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains
Read more →
@@ -312,7 +312,7 @@ dspace.log.2018-01-02:34
December, 2017
@@ -336,7 +336,7 @@ dspace.log.2018-01-02:34
November, 2017
@@ -369,7 +369,7 @@ COPY 54701
October, 2017
@@ -380,7 +380,7 @@ COPY 54701
http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
-- There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
+- There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
- Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
Read more →
@@ -395,10 +395,10 @@ COPY 54701
CGIAR Library Migration
diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html
index 8739170b4..e2c05d378 100644
--- a/docs/posts/page/4/index.html
+++ b/docs/posts/page/4/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -96,7 +96,7 @@
September, 2017
@@ -106,7 +106,7 @@
2017-09-07
-- Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
+- Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
Read more →
@@ -121,7 +121,7 @@
August, 2017
@@ -139,7 +139,7 @@
- The
robots.txt
only blocks the top-level /discover
and /browse
URLs… we will need to find a way to forbid them from accessing these!
- Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
-- It turns out that we're already adding the
X-Robots-Tag "none"
HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
+- It turns out that we’re already adding the
X-Robots-Tag "none"
HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
- Also, the bot has to successfully browse the page first so it can receive the HTTP header…
- We might actually have to block these requests with HTTP 403 depending on the user agent
- Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
@@ -160,7 +160,7 @@
July, 2017
@@ -171,8 +171,8 @@
2017-07-04
- Merge changes for WLE Phase II theme rename (#329)
-- Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace
-- We can use PostgreSQL's extended output format (
-x
) plus sed
to format the output into quasi XML:
+- Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
+- We can use PostgreSQL’s extended output format (
-x
) plus sed
to format the output into quasi XML:
Read more →
@@ -187,11 +187,11 @@
June, 2017
- 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg.
+ 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg.
Read more →
@@ -205,11 +205,11 @@
May, 2017
- 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.
+ 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.
Read more →
@@ -223,7 +223,7 @@
April, 2017
@@ -252,7 +252,7 @@
March, 2017
@@ -270,7 +270,7 @@
- Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
- Filed an issue on DSpace issue tracker for the
filter-media
bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
- Discovered that the ImageMagic
filter-media
plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
-- Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
+- Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
@@ -288,7 +288,7 @@
February, 2017
@@ -307,7 +307,7 @@ dspace=# delete from collection2item where id = 92551 and item_id = 80278;
DELETE 1
- Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
-- Looks like we'll be using
cg.identifier.ccafsprojectpii
as the field name
+- Looks like we’ll be using
cg.identifier.ccafsprojectpii
as the field name
Read more →
@@ -322,15 +322,15 @@ DELETE 1
January, 2017
2017-01-02
- I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
-- I tested on DSpace Test as well and it doesn't work there either
-- I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years
+- I tested on DSpace Test as well and it doesn’t work there either
+- I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
Read more →
@@ -345,7 +345,7 @@ DELETE 1
December, 2016
@@ -360,8 +360,8 @@ DELETE 1
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
-- I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade
-- I've raised a ticket with Atmire to ask
+- I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
+- I’ve raised a ticket with Atmire to ask
- Another worrying error from dspace.log is:
Read more →
diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html
index 4a5c3b17f..b145e5997 100644
--- a/docs/posts/page/5/index.html
+++ b/docs/posts/page/5/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -96,13 +96,13 @@
November, 2016
2016-11-01
-- Add
dc.type
to the output options for Atmire's Listings and Reports module (#286)
+- Add
dc.type
to the output options for Atmire’s Listings and Reports module (#286)

Read more →
@@ -118,7 +118,7 @@
October, 2016
@@ -131,7 +131,7 @@
ORCIDs plus normal authors
-I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author
with the following random ORCIDs from the ORCID registry:
+I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author
with the following random ORCIDs from the ORCID registry:
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
@@ -148,14 +148,14 @@
September, 2016
2016-09-01
- Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
-- Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace
+- Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
- We had been using
DC=ILRI
to determine whether a user was ILRI or not
- It looks like we might be able to use OUs now, instead of DCs:
@@ -174,7 +174,7 @@
August, 2016
@@ -204,7 +204,7 @@ $ git rebase -i dspace-5.5
July, 2016
@@ -235,14 +235,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
June, 2016
2016-06-01
- Experimenting with IFPRI OAI (we want to harvest their publications)
-- After reading the ContentDM documentation I found IFPRI's OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
+- After reading the ContentDM documentation I found IFPRI’s OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
- After reading the OAI documentation and testing with an OAI validator I found out how to get their publications
- This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&from=2016-01-01&set=p15738coll2&metadataPrefix=oai_dc
- You can see the others by using the OAI
ListSets
verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
@@ -261,7 +261,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
May, 2016
@@ -287,7 +287,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
April, 2016
@@ -295,8 +295,8 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
- Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit
- We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc
-- After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!
-- This will save us a few gigs of backup space we're paying for on S3
+- After running DSpace for over five years I’ve never needed to look in any other log file than dspace.log, leave alone one from last year!
+- This will save us a few gigs of backup space we’re paying for on S3
- Also, I noticed the
checker
log has some errors we should pay attention to:
Read more →
@@ -312,14 +312,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
March, 2016
2016-03-02
- Looking at issues with author authorities on CGSpace
-- For some reason we still have the
index-lucene-update
cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module
+- For some reason we still have the
index-lucene-update
cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module
- Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
Read more →
@@ -335,7 +335,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
February, 2016
diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html
index 9272c6fec..dcef508ce 100644
--- a/docs/posts/page/6/index.html
+++ b/docs/posts/page/6/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -96,7 +96,7 @@
January, 2016
@@ -119,7 +119,7 @@
December, 2015
@@ -146,7 +146,7 @@
November, 2015
diff --git a/docs/tags/index.html b/docs/tags/index.html
index 2401a8f18..8d653d639 100644
--- a/docs/tags/index.html
+++ b/docs/tags/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -95,7 +95,7 @@
January, 2020
@@ -132,7 +132,7 @@
December, 2019
@@ -164,7 +164,7 @@
November, 2019
@@ -183,7 +183,7 @@
1277694
- So 4.6 million from XMLUI and another 1.2 million from API requests
-- Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
+- Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
1183456
@@ -202,10 +202,10 @@
CGSpace CG Core v2 Migration
@@ -223,12 +223,12 @@
October, 2019
- 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc.
+ 2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc.
Read more →
@@ -241,7 +241,7 @@
September, 2019
@@ -286,14 +286,14 @@
August, 2019
2019-08-03
-- Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
+- Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
2019-08-04
@@ -301,7 +301,7 @@
- Run system updates on CGSpace (linode18) and reboot it
- Before updating it I checked Solr and verified that all statistics cores were loaded properly…
-- After rebooting, all statistics cores were loaded… wow, that's lucky.
+- After rebooting, all statistics cores were loaded… wow, that’s lucky.
- Run system updates on DSpace Test (linode19) and reboot it
@@ -318,7 +318,7 @@
July, 2019
@@ -346,7 +346,7 @@
June, 2019
@@ -372,7 +372,7 @@
May, 2019
diff --git a/docs/tags/migration/index.html b/docs/tags/migration/index.html
index 7449b80c1..11d1f0eb0 100644
--- a/docs/tags/migration/index.html
+++ b/docs/tags/migration/index.html
@@ -14,7 +14,7 @@
-
+
@@ -28,7 +28,7 @@
-
+
@@ -80,10 +80,10 @@
CGSpace CG Core v2 Migration
@@ -101,10 +101,10 @@
CGIAR Library Migration
diff --git a/docs/tags/notes/index.html b/docs/tags/notes/index.html
index ff5f2be40..076a27ea8 100644
--- a/docs/tags/notes/index.html
+++ b/docs/tags/notes/index.html
@@ -14,7 +14,7 @@
-
+
@@ -28,7 +28,7 @@
-
+
@@ -81,7 +81,7 @@
September, 2017
@@ -91,7 +91,7 @@
2017-09-07
-- Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
+- Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
Read more →
@@ -106,7 +106,7 @@
August, 2017
@@ -124,7 +124,7 @@
- The
robots.txt
only blocks the top-level /discover
and /browse
URLs… we will need to find a way to forbid them from accessing these!
- Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
-- It turns out that we're already adding the
X-Robots-Tag "none"
HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
+- It turns out that we’re already adding the
X-Robots-Tag "none"
HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
- Also, the bot has to successfully browse the page first so it can receive the HTTP header…
- We might actually have to block these requests with HTTP 403 depending on the user agent
- Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
@@ -145,7 +145,7 @@
July, 2017
@@ -156,8 +156,8 @@
2017-07-04
- Merge changes for WLE Phase II theme rename (#329)
-- Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace
-- We can use PostgreSQL's extended output format (
-x
) plus sed
to format the output into quasi XML:
+- Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
+- We can use PostgreSQL’s extended output format (
-x
) plus sed
to format the output into quasi XML:
Read more →
@@ -172,11 +172,11 @@
June, 2017
- 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg.
+ 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg.
Read more →
@@ -190,11 +190,11 @@
May, 2017
- 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.
+ 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.
Read more →
@@ -208,7 +208,7 @@
April, 2017
@@ -237,7 +237,7 @@
March, 2017
@@ -255,7 +255,7 @@
- Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
- Filed an issue on DSpace issue tracker for the
filter-media
bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
- Discovered that the ImageMagic
filter-media
plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
-- Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
+- Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
@@ -273,7 +273,7 @@
February, 2017
@@ -292,7 +292,7 @@ dspace=# delete from collection2item where id = 92551 and item_id = 80278;
DELETE 1
- Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
-- Looks like we'll be using
cg.identifier.ccafsprojectpii
as the field name
+- Looks like we’ll be using
cg.identifier.ccafsprojectpii
as the field name
Read more →
@@ -307,15 +307,15 @@ DELETE 1
January, 2017
2017-01-02
- I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
-- I tested on DSpace Test as well and it doesn't work there either
-- I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years
+- I tested on DSpace Test as well and it doesn’t work there either
+- I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
Read more →
@@ -330,7 +330,7 @@ DELETE 1
December, 2016
@@ -345,8 +345,8 @@ DELETE 1
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
-- I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade
-- I've raised a ticket with Atmire to ask
+- I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
+- I’ve raised a ticket with Atmire to ask
- Another worrying error from dspace.log is:
Read more →
diff --git a/docs/tags/notes/index.xml b/docs/tags/notes/index.xml
index d1210445e..fb1a87cc0 100644
--- a/docs/tags/notes/index.xml
+++ b/docs/tags/notes/index.xml
@@ -23,7 +23,7 @@
</ul>
<h2 id="2017-09-07">2017-09-07</h2>
<ul>
-<li>Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group</li>
+<li>Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group</li>
</ul>
@@ -47,7 +47,7 @@
</li>
<li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs… we will need to find a way to forbid them from accessing these!</li>
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
-<li>It turns out that we're already adding the <code>X-Robots-Tag "none"</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
+<li>It turns out that we’re already adding the <code>X-Robots-Tag "none"</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header…</li>
<li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li>
<li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li>
@@ -70,8 +70,8 @@
<h2 id="2017-07-04">2017-07-04</h2>
<ul>
<li>Merge changes for WLE Phase II theme rename (<a href="https://github.com/ilri/DSpace/pull/329">#329</a>)</li>
-<li>Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace</li>
-<li>We can use PostgreSQL's extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li>
+<li>Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace</li>
+<li>We can use PostgreSQL’s extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li>
</ul>
@@ -81,7 +81,7 @@
Thu, 01 Jun 2017 10:14:52 +0300
https://alanorth.github.io/cgspace-notes/2017-06/
- 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg.
+ 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg.
-
@@ -90,7 +90,7 @@
Mon, 01 May 2017 16:21:52 +0200
https://alanorth.github.io/cgspace-notes/2017-05/
- 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.
+ 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.
-
@@ -133,7 +133,7 @@
<li>Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI</li>
<li>Filed an issue on DSpace issue tracker for the <code>filter-media</code> bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: <a href="https://jira.duraspace.org/browse/DS-3516">DS-3516</a></li>
<li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li>
-<li>Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li>
+<li>Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li>
</ul>
<pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
@@ -161,7 +161,7 @@ dspace=# delete from collection2item where id = 92551 and item_id = 80278;
DELETE 1
</code></pre><ul>
<li>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</li>
-<li>Looks like we'll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li>
+<li>Looks like we’ll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li>
</ul>
@@ -174,8 +174,8 @@ DELETE 1
<h2 id="2017-01-02">2017-01-02</h2>
<ul>
<li>I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error</li>
-<li>I tested on DSpace Test as well and it doesn't work there either</li>
-<li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years</li>
+<li>I tested on DSpace Test as well and it doesn’t work there either</li>
+<li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years</li>
</ul>
@@ -196,8 +196,8 @@ DELETE 1
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
</code></pre><ul>
-<li>I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade</li>
-<li>I've raised a ticket with Atmire to ask</li>
+<li>I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade</li>
+<li>I’ve raised a ticket with Atmire to ask</li>
<li>Another worrying error from dspace.log is:</li>
</ul>
@@ -210,7 +210,7 @@ DELETE 1
https://alanorth.github.io/cgspace-notes/2016-11/
<h2 id="2016-11-01">2016-11-01</h2>
<ul>
-<li>Add <code>dc.type</code> to the output options for Atmire's Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li>
+<li>Add <code>dc.type</code> to the output options for Atmire’s Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li>
</ul>
<p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/11/listings-and-reports.png" alt="Listings and Reports with output type"></p>
@@ -230,7 +230,7 @@ DELETE 1
<li>ORCIDs plus normal authors</li>
</ul>
</li>
-<li>I exported a random item's metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li>
+<li>I exported a random item’s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li>
</ul>
<pre><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
</code></pre>
@@ -245,7 +245,7 @@ DELETE 1
<h2 id="2016-09-01">2016-09-01</h2>
<ul>
<li>Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors</li>
-<li>Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace</li>
+<li>Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace</li>
<li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li>
<li>It looks like we might be able to use OUs now, instead of DCs:</li>
</ul>
@@ -305,7 +305,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<h2 id="2016-06-01">2016-06-01</h2>
<ul>
<li>Experimenting with IFPRI OAI (we want to harvest their publications)</li>
-<li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI's OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li>
+<li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI’s OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li>
<li>After reading the <a href="https://www.openarchives.org/OAI/openarchivesprotocol.html">OAI documentation</a> and testing with an <a href="http://validator.oaipmh.com/">OAI validator</a> I found out how to get their publications</li>
<li>This is their publications set: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&from=2016-01-01&set=p15738coll2&metadataPrefix=oai_dc">http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&from=2016-01-01&set=p15738coll2&metadataPrefix=oai_dc</a></li>
<li>You can see the others by using the OAI <code>ListSets</code> verb: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListSets">http://ebrary.ifpri.org/oai/oai.php?verb=ListSets</a></li>
@@ -340,8 +340,8 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<ul>
<li>Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit</li>
<li>We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc</li>
-<li>After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!</li>
-<li>This will save us a few gigs of backup space we're paying for on S3</li>
+<li>After running DSpace for over five years I’ve never needed to look in any other log file than dspace.log, leave alone one from last year!</li>
+<li>This will save us a few gigs of backup space we’re paying for on S3</li>
<li>Also, I noticed the <code>checker</code> log has some errors we should pay attention to:</li>
</ul>
@@ -355,7 +355,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<h2 id="2016-03-02">2016-03-02</h2>
<ul>
<li>Looking at issues with author authorities on CGSpace</li>
-<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module</li>
+<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module</li>
<li>Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server</li>
</ul>
diff --git a/docs/tags/notes/page/2/index.html b/docs/tags/notes/page/2/index.html
index ee46ae5f9..7e0f55936 100644
--- a/docs/tags/notes/page/2/index.html
+++ b/docs/tags/notes/page/2/index.html
@@ -14,7 +14,7 @@
-
+
@@ -28,7 +28,7 @@
-
+
@@ -81,13 +81,13 @@
November, 2016
2016-11-01
-- Add
dc.type
to the output options for Atmire's Listings and Reports module (#286)
+- Add
dc.type
to the output options for Atmire’s Listings and Reports module (#286)

Read more →
@@ -103,7 +103,7 @@
October, 2016
@@ -116,7 +116,7 @@
ORCIDs plus normal authors
-I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author
with the following random ORCIDs from the ORCID registry:
+I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author
with the following random ORCIDs from the ORCID registry:
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
@@ -133,14 +133,14 @@
September, 2016
2016-09-01
- Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
-- Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace
+- Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
- We had been using
DC=ILRI
to determine whether a user was ILRI or not
- It looks like we might be able to use OUs now, instead of DCs:
@@ -159,7 +159,7 @@
August, 2016
@@ -189,7 +189,7 @@ $ git rebase -i dspace-5.5
July, 2016
@@ -220,14 +220,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
June, 2016
2016-06-01
- Experimenting with IFPRI OAI (we want to harvest their publications)
-- After reading the ContentDM documentation I found IFPRI's OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
+- After reading the ContentDM documentation I found IFPRI’s OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
- After reading the OAI documentation and testing with an OAI validator I found out how to get their publications
- This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&from=2016-01-01&set=p15738coll2&metadataPrefix=oai_dc
- You can see the others by using the OAI
ListSets
verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
@@ -246,7 +246,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
May, 2016
@@ -272,7 +272,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
April, 2016
@@ -280,8 +280,8 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
- Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit
- We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc
-- After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!
-- This will save us a few gigs of backup space we're paying for on S3
+- After running DSpace for over five years I’ve never needed to look in any other log file than dspace.log, leave alone one from last year!
+- This will save us a few gigs of backup space we’re paying for on S3
- Also, I noticed the
checker
log has some errors we should pay attention to:
Read more →
@@ -297,14 +297,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
March, 2016
2016-03-02
- Looking at issues with author authorities on CGSpace
-- For some reason we still have the
index-lucene-update
cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module
+- For some reason we still have the
index-lucene-update
cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module
- Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
Read more →
@@ -320,7 +320,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
February, 2016
diff --git a/docs/tags/notes/page/3/index.html b/docs/tags/notes/page/3/index.html
index 2259afadf..a8d537694 100644
--- a/docs/tags/notes/page/3/index.html
+++ b/docs/tags/notes/page/3/index.html
@@ -14,7 +14,7 @@
-
+
@@ -28,7 +28,7 @@
-
+
@@ -81,7 +81,7 @@
January, 2016
@@ -104,7 +104,7 @@
December, 2015
@@ -131,7 +131,7 @@
November, 2015
diff --git a/docs/tags/page/2/index.html b/docs/tags/page/2/index.html
index 772ff68d2..9b88ad184 100644
--- a/docs/tags/page/2/index.html
+++ b/docs/tags/page/2/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -95,7 +95,7 @@
April, 2019
@@ -136,16 +136,16 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
March, 2019
2019-03-01
-- I checked IITA's 259 Feb 14 records from last month for duplicates using Atmire's Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
+- I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
- I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
-- Looking at the other half of Udana's WLE records from 2018-11
+
- Looking at the other half of Udana’s WLE records from 2018-11
- I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
- I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
@@ -168,7 +168,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
February, 2019
@@ -213,7 +213,7 @@ sys 0m1.979s
January, 2019
@@ -221,7 +221,7 @@ sys 0m1.979s
2019-01-02
- Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
-- I don't see anything interesting in the web server logs around that time though:
+- I don’t see anything interesting in the web server logs around that time though:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
@@ -247,7 +247,7 @@ sys 0m1.979s
December, 2018
@@ -274,7 +274,7 @@ sys 0m1.979s
November, 2018
@@ -301,7 +301,7 @@ sys 0m1.979s
October, 2018
@@ -309,7 +309,7 @@ sys 0m1.979s
2018-10-01
- Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
-- I created a GitHub issue to track this #389, because I'm super busy in Nairobi right now
+- I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now
Read more →
@@ -323,7 +323,7 @@ sys 0m1.979s
September, 2018
@@ -331,9 +331,9 @@ sys 0m1.979s
2018-09-02
- New PostgreSQL JDBC driver version 42.2.5
-- I'll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
-- Also, I'll re-run the
postgresql
tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month
-- I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:
+- I’ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
+- Also, I’ll re-run the
postgresql
tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
+- I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
Read more →
@@ -347,7 +347,7 @@ sys 0m1.979s
August, 2018
@@ -361,10 +361,10 @@ sys 0m1.979s
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
- Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
-- From the DSpace log I see that eventually Solr stopped responding, so I guess the
java
process that was OOM killed above was Tomcat's
-- I'm not sure why Tomcat didn't crash with an OutOfMemoryError…
+- From the DSpace log I see that eventually Solr stopped responding, so I guess the
java
process that was OOM killed above was Tomcat’s
+- I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…
- Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
-- The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
+- The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
- I ran all system updates on DSpace Test and rebooted it
Read more →
@@ -379,7 +379,7 @@ sys 0m1.979s
July, 2018
diff --git a/docs/tags/page/3/index.html b/docs/tags/page/3/index.html
index 8f124aa5e..8829a83b3 100644
--- a/docs/tags/page/3/index.html
+++ b/docs/tags/page/3/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -95,7 +95,7 @@
June, 2018
@@ -104,7 +104,7 @@
- Test the DSpace 5.8 module upgrades from Atmire (#378)
-- There seems to be a problem with the CUA and L&R versions in
pom.xml
because they are using SNAPSHOT and it doesn't build
+- There seems to be a problem with the CUA and L&R versions in
pom.xml
because they are using SNAPSHOT and it doesn’t build
- I added the new CCAFS Phase II Project Tag
PII-FP1_PACCA2
and merged it into the 5_x-prod
branch (#379)
@@ -133,7 +133,7 @@ sys 2m7.289s
May, 2018
@@ -161,14 +161,14 @@ sys 2m7.289s
April, 2018
2018-04-01
-- I tried to test something on DSpace Test but noticed that it's down since god knows when
+- I tried to test something on DSpace Test but noticed that it’s down since god knows when
- Catalina logs at least show some memory errors yesterday:
Read more →
@@ -183,7 +183,7 @@ sys 2m7.289s
March, 2018
@@ -204,7 +204,7 @@ sys 2m7.289s
February, 2018
@@ -212,9 +212,9 @@ sys 2m7.289s
2018-02-01
- Peter gave feedback on the
dc.rights
proof of concept that I had sent him last week
-- We don't need to distinguish between internal and external works, so that makes it just a simple list
+- We don’t need to distinguish between internal and external works, so that makes it just a simple list
- Yesterday I figured out how to monitor DSpace sessions using JMX
-- I copied the logic in the
jmx_tomcat_dbpools
provided by Ubuntu's munin-plugins-java
package and used the stuff I discovered about JMX in 2018-01
+- I copied the logic in the
jmx_tomcat_dbpools
provided by Ubuntu’s munin-plugins-java
package and used the stuff I discovered about JMX in 2018-01
Read more →
@@ -228,7 +228,7 @@ sys 2m7.289s
January, 2018
@@ -236,7 +236,7 @@ sys 2m7.289s
2018-01-02
- Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
-- I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
+- I didn’t get any load alerts from Linode and the REST and XMLUI logs don’t show anything out of the ordinary
- The nginx logs show HTTP 200s until
02/Jan/2018:11:27:17 +0000
when Uptime Robot got an HTTP 500
- In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
- And just before that I see this:
@@ -244,8 +244,8 @@ sys 2m7.289s
Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
- Ah hah! So the pool was actually empty!
-- I need to increase that, let's try to bump it up from 50 to 75
-- After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
+- I need to increase that, let’s try to bump it up from 50 to 75
+- After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw
- I notice this error quite a few times in dspace.log:
2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
@@ -298,7 +298,7 @@ dspace.log.2017-12-31:53
dspace.log.2018-01-01:45
dspace.log.2018-01-02:34
-- Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains
+- Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains
Read more →
@@ -312,7 +312,7 @@ dspace.log.2018-01-02:34
December, 2017
@@ -336,7 +336,7 @@ dspace.log.2018-01-02:34
November, 2017
@@ -369,7 +369,7 @@ COPY 54701
October, 2017
@@ -380,7 +380,7 @@ COPY 54701
http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
-- There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
+- There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
- Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
Read more →
@@ -395,10 +395,10 @@ COPY 54701
CGIAR Library Migration
diff --git a/docs/tags/page/4/index.html b/docs/tags/page/4/index.html
index 581514b3e..2a6800130 100644
--- a/docs/tags/page/4/index.html
+++ b/docs/tags/page/4/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -96,7 +96,7 @@
September, 2017
@@ -106,7 +106,7 @@
2017-09-07
-- Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
+- Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
Read more →
@@ -121,7 +121,7 @@
August, 2017
@@ -139,7 +139,7 @@
- The
robots.txt
only blocks the top-level /discover
and /browse
URLs… we will need to find a way to forbid them from accessing these!
- Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
-- It turns out that we're already adding the
X-Robots-Tag "none"
HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
+- It turns out that we’re already adding the
X-Robots-Tag "none"
HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
- Also, the bot has to successfully browse the page first so it can receive the HTTP header…
- We might actually have to block these requests with HTTP 403 depending on the user agent
- Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
@@ -160,7 +160,7 @@
July, 2017
@@ -171,8 +171,8 @@
2017-07-04
- Merge changes for WLE Phase II theme rename (#329)
-- Looking at extracting the metadata registries from ICARDA's MEL DSpace database so we can compare fields with CGSpace
-- We can use PostgreSQL's extended output format (
-x
) plus sed
to format the output into quasi XML:
+- Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
+- We can use PostgreSQL’s extended output format (
-x
) plus sed
to format the output into quasi XML:
Read more →
@@ -187,11 +187,11 @@
June, 2017
- 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we'll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg.
+ 2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg.
Read more →
@@ -205,11 +205,11 @@
May, 2017
- 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.
+ 2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.
Read more →
@@ -223,7 +223,7 @@
April, 2017
@@ -252,7 +252,7 @@
March, 2017
@@ -270,7 +270,7 @@
- Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
- Filed an issue on DSpace issue tracker for the
filter-media
bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
- Discovered that the ImageMagic
filter-media
plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
-- Interestingly, it seems DSpace 4.x's thumbnails were sRGB, but forcing regeneration using DSpace 5.x's ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
+- Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
@@ -288,7 +288,7 @@
February, 2017
@@ -307,7 +307,7 @@ dspace=# delete from collection2item where id = 92551 and item_id = 80278;
DELETE 1
- Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
-- Looks like we'll be using
cg.identifier.ccafsprojectpii
as the field name
+- Looks like we’ll be using
cg.identifier.ccafsprojectpii
as the field name
Read more →
@@ -322,15 +322,15 @@ DELETE 1
January, 2017
2017-01-02
- I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
-- I tested on DSpace Test as well and it doesn't work there either
-- I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years
+- I tested on DSpace Test as well and it doesn’t work there either
+- I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
Read more →
@@ -345,7 +345,7 @@ DELETE 1
December, 2016
@@ -360,8 +360,8 @@ DELETE 1
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
-- I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade
-- I've raised a ticket with Atmire to ask
+- I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
+- I’ve raised a ticket with Atmire to ask
- Another worrying error from dspace.log is:
Read more →
diff --git a/docs/tags/page/5/index.html b/docs/tags/page/5/index.html
index b1a9fa521..12cae9196 100644
--- a/docs/tags/page/5/index.html
+++ b/docs/tags/page/5/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -96,13 +96,13 @@
November, 2016
2016-11-01
-- Add
dc.type
to the output options for Atmire's Listings and Reports module (#286)
+- Add
dc.type
to the output options for Atmire’s Listings and Reports module (#286)

Read more →
@@ -118,7 +118,7 @@
October, 2016
@@ -131,7 +131,7 @@
ORCIDs plus normal authors
-I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author
with the following random ORCIDs from the ORCID registry:
+I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author
with the following random ORCIDs from the ORCID registry:
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
@@ -148,14 +148,14 @@
September, 2016
2016-09-01
- Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
-- Discuss how the migration of CGIAR's Active Directory to a flat structure will break our LDAP groups in DSpace
+- Discuss how the migration of CGIAR’s Active Directory to a flat structure will break our LDAP groups in DSpace
- We had been using
DC=ILRI
to determine whether a user was ILRI or not
- It looks like we might be able to use OUs now, instead of DCs:
@@ -174,7 +174,7 @@
August, 2016
@@ -204,7 +204,7 @@ $ git rebase -i dspace-5.5
July, 2016
@@ -235,14 +235,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
June, 2016
2016-06-01
- Experimenting with IFPRI OAI (we want to harvest their publications)
-- After reading the ContentDM documentation I found IFPRI's OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
+- After reading the ContentDM documentation I found IFPRI’s OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
- After reading the OAI documentation and testing with an OAI validator I found out how to get their publications
- This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&from=2016-01-01&set=p15738coll2&metadataPrefix=oai_dc
- You can see the others by using the OAI
ListSets
verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
@@ -261,7 +261,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
May, 2016
@@ -287,7 +287,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
April, 2016
@@ -295,8 +295,8 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
- Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit
- We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc
-- After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!
-- This will save us a few gigs of backup space we're paying for on S3
+- After running DSpace for over five years I’ve never needed to look in any other log file than dspace.log, leave alone one from last year!
+- This will save us a few gigs of backup space we’re paying for on S3
- Also, I noticed the
checker
log has some errors we should pay attention to:
Read more →
@@ -312,14 +312,14 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
March, 2016
2016-03-02
- Looking at issues with author authorities on CGSpace
-- For some reason we still have the
index-lucene-update
cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module
+- For some reason we still have the
index-lucene-update
cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module
- Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
Read more →
@@ -335,7 +335,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
February, 2016
diff --git a/docs/tags/page/6/index.html b/docs/tags/page/6/index.html
index 98533d0e8..15ddf498f 100644
--- a/docs/tags/page/6/index.html
+++ b/docs/tags/page/6/index.html
@@ -14,7 +14,7 @@
-
+
@@ -42,7 +42,7 @@
-
+
@@ -96,7 +96,7 @@
January, 2016
@@ -119,7 +119,7 @@
December, 2015
@@ -146,7 +146,7 @@
November, 2015
diff --git a/docs/webfonts/fa-brands-400.eot b/docs/webfonts/fa-brands-400.eot
new file mode 100644
index 000000000..baf40576d
Binary files /dev/null and b/docs/webfonts/fa-brands-400.eot differ
diff --git a/docs/webfonts/fa-brands-400.svg b/docs/webfonts/fa-brands-400.svg
new file mode 100644
index 000000000..843c1c785
--- /dev/null
+++ b/docs/webfonts/fa-brands-400.svg
@@ -0,0 +1,3535 @@
+
+
+
+
diff --git a/docs/webfonts/fa-brands-400.ttf b/docs/webfonts/fa-brands-400.ttf
new file mode 100644
index 000000000..991632871
Binary files /dev/null and b/docs/webfonts/fa-brands-400.ttf differ
diff --git a/docs/webfonts/fa-brands-400.woff b/docs/webfonts/fa-brands-400.woff
new file mode 100644
index 000000000..f9e3bcd00
Binary files /dev/null and b/docs/webfonts/fa-brands-400.woff differ
diff --git a/docs/webfonts/fa-brands-400.woff2 b/docs/webfonts/fa-brands-400.woff2
new file mode 100644
index 000000000..51c07aef3
Binary files /dev/null and b/docs/webfonts/fa-brands-400.woff2 differ
diff --git a/docs/webfonts/fa-regular-400.eot b/docs/webfonts/fa-regular-400.eot
new file mode 100644
index 000000000..04e25cbaa
Binary files /dev/null and b/docs/webfonts/fa-regular-400.eot differ
diff --git a/docs/webfonts/fa-regular-400.svg b/docs/webfonts/fa-regular-400.svg
new file mode 100644
index 000000000..f1f7e6cb0
--- /dev/null
+++ b/docs/webfonts/fa-regular-400.svg
@@ -0,0 +1,803 @@
+
+
+
+
diff --git a/docs/webfonts/fa-regular-400.ttf b/docs/webfonts/fa-regular-400.ttf
new file mode 100644
index 000000000..9c6249c02
Binary files /dev/null and b/docs/webfonts/fa-regular-400.ttf differ
diff --git a/docs/webfonts/fa-regular-400.woff b/docs/webfonts/fa-regular-400.woff
new file mode 100644
index 000000000..2873e4389
Binary files /dev/null and b/docs/webfonts/fa-regular-400.woff differ
diff --git a/docs/webfonts/fa-regular-400.woff2 b/docs/webfonts/fa-regular-400.woff2
new file mode 100644
index 000000000..a34bd6524
Binary files /dev/null and b/docs/webfonts/fa-regular-400.woff2 differ
diff --git a/docs/webfonts/fa-solid-900.eot b/docs/webfonts/fa-solid-900.eot
new file mode 100644
index 000000000..39716a7b0
Binary files /dev/null and b/docs/webfonts/fa-solid-900.eot differ
diff --git a/docs/webfonts/fa-solid-900.svg b/docs/webfonts/fa-solid-900.svg
new file mode 100644
index 000000000..cfd0e2f44
--- /dev/null
+++ b/docs/webfonts/fa-solid-900.svg
@@ -0,0 +1,4700 @@
+
+
+
+
diff --git a/docs/webfonts/fa-solid-900.ttf b/docs/webfonts/fa-solid-900.ttf
new file mode 100644
index 000000000..ac4baa21f
Binary files /dev/null and b/docs/webfonts/fa-solid-900.ttf differ
diff --git a/docs/webfonts/fa-solid-900.woff b/docs/webfonts/fa-solid-900.woff
new file mode 100644
index 000000000..23002f8a6
Binary files /dev/null and b/docs/webfonts/fa-solid-900.woff differ
diff --git a/docs/webfonts/fa-solid-900.woff2 b/docs/webfonts/fa-solid-900.woff2
new file mode 100644
index 000000000..b37f209d1
Binary files /dev/null and b/docs/webfonts/fa-solid-900.woff2 differ