diff --git a/content/posts/2021-09.md b/content/posts/2021-09.md index c7b9b0671..8c7ab65d8 100644 --- a/content/posts/2021-09.md +++ b/content/posts/2021-09.md @@ -29,4 +29,46 @@ $ docker-compose build - Then run system updates and reboot the server - After the system came back up I started a fresh re-harvesting +## 2021-09-07 + +- Checking last month's Solr statistics to see if there are any new bots that I need to purge and add to the list + - 78.203.225.68 made 50,000 requests on one day in August, and it is using this user agent: `Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36` + - It's a fixed line ISP in Montpellier according to AbuseIPDB.com, and has not been flagged as abusive, so it must be some CGIAR SMO person doing some web application harvesting from the browser + - 130.255.162.154 is in Sweden and made 46,000 requests in August and it is using this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0` + - 35.174.144.154 is on Amazon and made 28,000 requests with this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36` + - 192.121.135.6 is in Sweden and made 9,000 requests with this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0` + - 185.38.40.66 is in Germany and made 6,000 requests with this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0 BoldBrains SC/1.10.2.4` + - 3.225.28.105 is in Amazon and made 3,000 requests with this user agent: `Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36` + - I also noticed that we still have tons (25,000) of requests by MSNbot using this normal-looking user agent: `Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko` + - I can identify them by their reverse DNS: msnbot-40-77-167-105.search.msn.com. + - I had already purged a bunch of these by their IPs in 2021-06, so it looks like I have to do that again + - While looking at the MSN requests I noticed tons of requests from another strange host using reverse IP DNS: malta2095.startdedicated.com., astra5139.startdedicated.com., and many others + - They must be related, because I see them all using the exact same user agent: `Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko` + - So this startdedicated.com DNS is some Bing bot also... +- I extracted all the IPs and purged them using my `check-spider-ip-hits.sh` script + - In total I purged 225,000 hits... + +## 2021-09-12 + +- Start a harvest on AReS + +## 2021-09-13 + +- Mishell Portilla asked me about thumbnails on CGSpace being small + - For example, [10568/114576](https://cgspace.cgiar.org/handle/10568/114576) has a lot of white space on the left side + - I created a new thumbnail with vipsthumbnail: + +```console +$ vipsthumbnail ARRTB2020ST.pdf -s x600 -o '%s.jpg[Q=85,optimize_coding,strip]' +``` + +- Looking at the PDF's metadata I see: + - Producer: iLovePDF + - Creator: Adobe InDesign 15.0 (Windows) + - Format: PDF-1.7 +- Eventually I should do more tests on this and perhaps file a bug with DSpace... +- Some Alliance people contacted me about getting access to the CGSpace API to deposit with their TIP tool + - I told them I can give them access to DSpace Test and that we should have a meeting soon + - We need to figure out what controlled vocabularies they should use + diff --git a/docs/2015-11/index.html b/docs/2015-11/index.html index 3cb6f6312..c39a250c6 100644 --- a/docs/2015-11/index.html +++ b/docs/2015-11/index.html @@ -34,7 +34,7 @@ Last week I had increased the limit from 30 to 60, which seemed to help, but now $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 78 "/> - + @@ -126,7 +126,7 @@ $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspac
$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
+$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
78
- For now I have increased the limit from 60 to 90, run updates, and rebooted the server
@@ -137,7 +137,7 @@ $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspac
- Getting emails from uptimeRobot and uptimeButler that it’s down, and Google Webmaster Tools is sending emails that there is an increase in crawl errors
- Looks like there are still a bunch of idle PostgreSQL connections:
-$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
+$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
96
- For some reason the number of idle connections is very high since we upgraded to DSpace 5
@@ -147,7 +147,7 @@ $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspac
- Troubleshoot the DSpace 5 OAI breakage caused by nginx routing config
- The OAI application requests stylesheets and javascript files with the path
/oai/static/css
, which gets matched here:
-# static assets we can load from the file system directly with nginx
+# static assets we can load from the file system directly with nginx
location ~ /(themes|static|aspects/ReportingSuite) {
try_files $uri @tomcat;
...
@@ -158,21 +158,21 @@ location ~ /(themes|static|aspects/ReportingSuite) {
We simply need to add include extra-security.conf;
to the above location block (but research and test first)
We should add WOFF assets to the list of things to set expires for:
-location ~* \.(?:ico|css|js|gif|jpe?g|png|woff)$ {
+location ~* \.(?:ico|css|js|gif|jpe?g|png|woff)$ {
- We should also add
aspects/Statistics
to the location block for static assets (minus static
from above):
-location ~ /(themes|aspects/ReportingSuite|aspects/Statistics) {
+location ~ /(themes|aspects/ReportingSuite|aspects/Statistics) {
- Need to check
/about
on CGSpace, as it’s blank on my local test server and we might need to add something there
- CGSpace has been up and down all day due to PostgreSQL idle connections (current DSpace pool is 90):
-$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
+$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
93
- I looked closer at the idle connections and saw that many have been idle for hours (current time on server is
2015-11-25T20:20:42+0000
):
-$ psql -c 'SELECT * from pg_stat_activity;' | less -S
+$ psql -c 'SELECT * from pg_stat_activity;' | less -S
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start |
-------+----------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+---
20951 | cgspace | 10966 | 18205 | cgspace | | 127.0.0.1 | | 37731 | 2015-11-25 13:13:02.837624+00 | | 20
@@ -191,13 +191,13 @@ datid | datname | pid | usesysid | usename | application_name | client_addr
CCAFS colleagues mentioned that the REST API is very slow, 24 seconds for one item
Not as bad for me, but still unsustainable if you have to get many:
-$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
+$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
8.415
- Monitoring e-mailed in the evening to say CGSpace was down
- Idle connections in PostgreSQL again:
-$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
+$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
66
- At the time, the current DSpace pool size was 50…
@@ -208,14 +208,14 @@ datid | datname | pid | usesysid | usename | application_name | client_addr
- Still more alerts that CGSpace has been up and down all day
- Current database settings for DSpace:
-db.maxconnections = 30
+db.maxconnections = 30
db.maxwait = 5000
db.maxidle = 8
db.statementpool = true
- And idle connections:
-$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
+$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
49
- Perhaps I need to start drastically increasing the connection limits—like to 300—to see if DSpace’s thirst can ever be quenched
diff --git a/docs/2015-12/index.html b/docs/2015-12/index.html
index 5831fb5b9..587298078 100644
--- a/docs/2015-12/index.html
+++ b/docs/2015-12/index.html
@@ -36,7 +36,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
-rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
"/>
-
+
@@ -126,7 +126,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
- Replace
lzop
with xz
in log compression cron jobs on DSpace Test—it uses less space:
-
# cd /home/dspacetest.cgiar.org/log
+# cd /home/dspacetest.cgiar.org/log
# ls -lh dspace.log.2015-11-18*
-rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
@@ -137,20 +137,20 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
CGSpace went down again (due to PostgreSQL idle connections of course)
Current database settings for DSpace are db.maxconnections = 30
and db.maxidle = 8
, yet idle connections are exceeding this:
-$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
+$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
39
- I restarted PostgreSQL and Tomcat and it’s back
- On a related note of why CGSpace is so slow, I decided to finally try the
pgtune
script to tune the postgres settings:
-# apt-get install pgtune
+# apt-get install pgtune
# pgtune -i /etc/postgresql/9.3/main/postgresql.conf -o postgresql.conf-pgtune
# mv /etc/postgresql/9.3/main/postgresql.conf /etc/postgresql/9.3/main/postgresql.conf.orig
# mv postgresql.conf-pgtune /etc/postgresql/9.3/main/postgresql.conf
- It introduced the following new settings:
-default_statistics_target = 50
+default_statistics_target = 50
maintenance_work_mem = 480MB
constraint_exclusion = on
checkpoint_completion_target = 0.9
@@ -164,7 +164,7 @@ max_connections = 80
Now I need to go read PostgreSQL docs about these options, and watch memory settings in munin etc
For what it’s worth, now the REST API should be faster (because of these PostgreSQL tweaks):
-$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
+$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.474
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
2.141
@@ -189,7 +189,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
CGSpace very slow, and monitoring emailing me to say its down, even though I can load the page (very slowly)
Idle postgres connections look like this (with no change in DSpace db settings lately):
-$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
+$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
29
- I restarted Tomcat and postgres…
@@ -197,7 +197,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
- We weren’t out of heap yet, but it’s probably fair enough that the DSpace 5 upgrade (and new Atmire modules) requires more memory so it’s ok
- A possible side effect is that I see that the REST API is twice as fast for the request above now:
-$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
+$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.368
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.968
@@ -214,7 +214,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
CGSpace has been up and down all day and REST API is completely unresponsive
PostgreSQL idle connections are currently:
-postgres@linode01:~$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
+postgres@linode01:~$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
28
- I have reverted all the pgtune tweaks from the other day, as they didn’t fix the stability issues, so I’d rather not have them introducing more variables into the equation
@@ -229,7 +229,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
- Atmire sent some fixes to DSpace’s REST API code that was leaving contexts open (causing the slow performance and database issues)
- After deploying the fix to CGSpace the REST API is consistently faster:
-$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
+$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.675
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.599
diff --git a/docs/2016-01/index.html b/docs/2016-01/index.html
index 3b3c1261b..79b5b7694 100644
--- a/docs/2016-01/index.html
+++ b/docs/2016-01/index.html
@@ -28,7 +28,7 @@ Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_
I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
Update GitHub wiki for documentation of maintenance tasks.
"/>
-
+
diff --git a/docs/2016-02/index.html b/docs/2016-02/index.html
index 6d6755b5a..0f22cef25 100644
--- a/docs/2016-02/index.html
+++ b/docs/2016-02/index.html
@@ -38,7 +38,7 @@ I noticed we have a very interesting list of countries on CGSpace:
Not only are there 49,000 countries, we have some blanks (25)…
Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE”
"/>
-
+
@@ -140,20 +140,20 @@ Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE&r
Found a way to get items with null/empty metadata values from SQL
First, find the metadata_field_id
for the field you want from the metadatafieldregistry
table:
-dspacetest=# select * from metadatafieldregistry;
+dspacetest=# select * from metadatafieldregistry;
- In this case our country field is 78
- Now find all resources with type 2 (item) that have null/empty values for that field:
-dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
+dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
- Then you can find the handle that owns it from its
resource_id
:
-dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
+dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
- It’s 25 items so editing in the web UI is annoying, let’s try SQL!
-dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
+dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
DELETE 25
- After that perhaps a regular
dspace index-discovery
(no -b) should suffice…
@@ -171,7 +171,7 @@ DELETE 25
- I need to start running DSpace in Mac OS X instead of a Linux VM
- Install PostgreSQL from homebrew, then configure and import CGSpace database dump:
-$ postgres -D /opt/brew/var/postgres
+$ postgres -D /opt/brew/var/postgres
$ createuser --superuser postgres
$ createuser --pwprompt dspacetest
$ createdb -O dspacetest --encoding=UNICODE dspacetest
@@ -187,7 +187,7 @@ $ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sq
- After building and running a
fresh_install
I symlinked the webapps into Tomcat’s webapps folder:
-$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig
+$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig
$ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT
$ ln -sfv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/rest
$ ln -sfv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/jspui
@@ -198,11 +198,11 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
Add CATALINA_OPTS in /opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh
, as this script is sourced by the catalina
startup script
For example:
-CATALINA_OPTS="-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8"
+CATALINA_OPTS="-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8"
- After verifying that the site is working, start a full index:
-$ ~/dspace/bin/dspace index-discovery -b
+$ ~/dspace/bin/dspace index-discovery -b
2016-02-08
- Finish cleaning up and importing ~400 DAGRIS items into CGSpace
@@ -216,7 +216,7 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
- Help Sisay with OpenRefine
- Enable HTTPS on DSpace Test using Let’s Encrypt:
-$ cd ~/src/git
+$ cd ~/src/git
$ git clone https://github.com/letsencrypt/letsencrypt
$ cd letsencrypt
$ sudo service nginx stop
@@ -231,15 +231,15 @@ $ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-becom
Getting more and more hangs on DSpace Test, seemingly random but also during CSV import
Logs don’t always show anything right when it fails, but eventually one of these appears:
-org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
+org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
- or
-Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
+Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
- Right now DSpace Test’s Tomcat heap is set to 1536m and we have quite a bit of free RAM:
-# free -m
+# free -m
total used free shared buffers cached
Mem: 3950 3902 48 9 37 1311
-/+ buffers/cache: 2552 1397
@@ -253,11 +253,11 @@ Swap: 255 57 198
There are 1200 records that have PDFs, and will need to be imported into CGSpace
I created a filename
column based on the dc.identifier.url
column using the following transform:
-value.split('/')[-1]
+value.split('/')[-1]
- Then I wrote a tool called
generate-thumbnails.py
to download the PDFs and generate thumbnails for them, for example:
-$ ./generate-thumbnails.py ciat-reports.csv
+$ ./generate-thumbnails.py ciat-reports.csv
Processing 64661.pdf
> Downloading 64661.pdf
> Creating thumbnail for 64661.pdf
@@ -278,13 +278,13 @@ Processing 64195.pdf
Looking at CIAT’s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I’m not sure if we can use those
265 items have dirty, URL-encoded filenames:
-$ ls | grep -c -E "%"
+$ ls | grep -c -E "%"
265
- I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames
- This python2 snippet seems to work in the CLI, but not so well in OpenRefine:
-$ python -c "import urllib, sys; print urllib.unquote(sys.argv[1])" CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
+$ python -c "import urllib, sys; print urllib.unquote(sys.argv[1])" CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
- Merge pull requests for submission form theming (#178) and missing center subjects in XMLUI item views (#176)
@@ -294,7 +294,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
- Turns out OpenRefine has an unescape function!
-
value.unescape("url")
+value.unescape("url")
- This turns the URLs into human-readable versions that we can use as proper filenames
- Run web server and system updates on DSpace Test and reboot
@@ -316,7 +316,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
- Turns out the “bug” in SAFBuilder isn’t a bug, it’s a feature that allows you to encode extra information like the destintion bundle in the filename
- Also, it seems DSpace’s SAF import tool doesn’t like importing filenames that have accents in them:
-java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory)
+java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory)
- Need to rename files to have no accents or umlauts, etc…
- Useful custom text facet for URLs ending with “.pdf”:
value.endsWith(".pdf")
@@ -325,12 +325,12 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
- To change Spanish accents to ASCII in OpenRefine:
-
value.replace('ó','o').replace('í','i').replace('á','a').replace('é','e').replace('ñ','n')
+value.replace('ó','o').replace('í','i').replace('á','a').replace('é','e').replace('ñ','n')
- But actually, the accents might not be an issue, as I can successfully import files containing Spanish accents on my Mac
- On closer inspection, I can import files with the following names on Linux (DSpace Test):
-Bitstream: tést.pdf
+Bitstream: tést.pdf
Bitstream: tést señora.pdf
Bitstream: tést señora alimentación.pdf
@@ -353,7 +353,7 @@ Bitstream: tést señora alimentación.pdf
- Looking at the filenames for the CIAT Reports, some have some really ugly characters, like:
'
or ,
or =
or [
or ]
or (
or )
or _.pdf
or ._
etc
- It’s tricky to parse those things in some programming languages so I’d rather just get rid of the weird stuff now in OpenRefine:
-value.replace("'",'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_')
+value.replace("'",'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_')
- Finally import the 1127 CIAT items into CGSpace: https://cgspace.cgiar.org/handle/10568/35710
- Re-deploy CGSpace with the Google Scholar fix, but I’m waiting on the Atmire fixes for now, as the branch history is ugly
diff --git a/docs/2016-03/index.html b/docs/2016-03/index.html
index ed99e8b29..f1dfaf0e2 100644
--- a/docs/2016-03/index.html
+++ b/docs/2016-03/index.html
@@ -28,7 +28,7 @@ Looking at issues with author authorities on CGSpace
For some reason we still have the index-lucene-update cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module
Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
"/>
-
+
@@ -128,7 +128,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
- I identified one commit that causes the issue and let them know
- Restart DSpace Test, as it seems to have crashed after Sisay tried to import some CSV or zip or something:
-Exception in thread "Lucene Merge Thread #19" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
+Exception in thread "Lucene Merge Thread #19" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
2016-03-08
- Add a few new filters to Atmire’s Listings and Reports module (#180)
@@ -175,7 +175,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
- Help Sisay with some PostgreSQL queries to clean up the incorrect
dc.contributor.corporateauthor
field
- I noticed that we have some weird values in
dc.language
:
-# select * from metadatavalue where metadata_field_id=37;
+# select * from metadatavalue where metadata_field_id=37;
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
1942571 | 35342 | 37 | hi | | 1 | | -1 | 2
@@ -215,7 +215,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
- Command used:
-$ gm convert -trim -quality 82 -thumbnail x300 -flatten Descriptor\ for\ Butia_EN-2015_2021.pdf\[0\] cover.jpg
+$ gm convert -trim -quality 82 -thumbnail x300 -flatten Descriptor\ for\ Butia_EN-2015_2021.pdf\[0\] cover.jpg
- Also, it looks like adding
-sharpen 0x1.0
really improves the quality of the image for only a few KB
@@ -261,7 +261,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
- Abenet is having problems saving group memberships, and she gets this error: https://gist.github.com/alanorth/87281c061c2de57b773e
-Can't find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967)
+Can't find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967)
- I can reproduce the same error on DSpace Test and on my Mac
- Looks to be an issue with the Atmire modules, I’ve submitted a ticket to their tracker.
diff --git a/docs/2016-04/index.html b/docs/2016-04/index.html
index e8ba1e05f..31d488aa9 100644
--- a/docs/2016-04/index.html
+++ b/docs/2016-04/index.html
@@ -32,7 +32,7 @@ After running DSpace for over five years I’ve never needed to look in any
This will save us a few gigs of backup space we’re paying for on S3
Also, I noticed the checker log has some errors we should pay attention to:
"/>
-
+
@@ -126,7 +126,7 @@ Also, I noticed the checker log has some errors we should pay attention to:
- This will save us a few gigs of backup space we’re paying for on S3
- Also, I noticed the
checker
log has some errors we should pay attention to:
-Run start time: 03/06/2016 04:00:22
+Run start time: 03/06/2016 04:00:22
Error retrieving bitstream ID 71274 from asset store.
java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290601546459645925328536011917633626 (Too many open files)
at java.io.FileInputStream.open(Native Method)
@@ -158,7 +158,7 @@ java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290
- Reduce Amazon S3 storage used for logs from 46 GB to 6GB by deleting a bunch of logs we don’t need!
-# s3cmd ls s3://cgspace.cgiar.org/log/ > /tmp/s3-logs.txt
+# s3cmd ls s3://cgspace.cgiar.org/log/ > /tmp/s3-logs.txt
# grep checker.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
# grep cocoon.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
# grep handle-plugin.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
@@ -171,7 +171,7 @@ java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290
- A better way to move metadata on this scale is via SQL, for example
dc.type.output
→ dc.type
(their IDs in the metadatafieldregistry are 66 and 109, respectively):
-dspacetest=# update metadatavalue set metadata_field_id=109 where metadata_field_id=66;
+dspacetest=# update metadatavalue set metadata_field_id=109 where metadata_field_id=66;
UPDATE 40852
- After that an
index-discovery -bf
is required
@@ -182,7 +182,7 @@ UPDATE 40852
- Write shell script to do the migration of fields: https://gist.github.com/alanorth/72a70aca856d76f24c127a6e67b3342b
- Testing with a few fields it seems to work well:
-$ ./migrate-fields.sh
+$ ./migrate-fields.sh
UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
UPDATE 40883
UPDATE metadatavalue SET metadata_field_id=202 WHERE metadata_field_id=72
@@ -199,7 +199,7 @@ UPDATE 51258
Looking at the DOI issue reported by Leroy from CIAT a few weeks ago
It seems the dx.doi.org
URLs are much more proper in our repository!
-dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%';
+dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%';
count
-------
5638
@@ -221,7 +221,7 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and t
- Looking at quality of WLE data (
cg.subject.iwmi
) in SQL:
-dspacetest=# select text_value, count(*) from metadatavalue where metadata_field_id=217 group by text_value order by count(*) desc;
+dspacetest=# select text_value, count(*) from metadatavalue where metadata_field_id=217 group by text_value order by count(*) desc;
- Listings and Reports is still not returning reliable data for
dc.type
- I think we need to ask Atmire, as their documentation isn’t too clear on the format of the filter configs
@@ -231,11 +231,11 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and t
- I decided to keep the set of subjects that had
FMD
and RANGELANDS
added, as it appears to have been requested to have been added, and might be the newer list
- I found 226 blank metadatavalues:
-dspacetest# select * from metadatavalue where resource_type_id=2 and text_value='';
+dspacetest# select * from metadatavalue where resource_type_id=2 and text_value='';
- I think we should delete them and do a full re-index:
-dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value='';
+dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value='';
DELETE 226
- I deleted them on CGSpace but I’ll wait to do the re-index as we’re going to be doing one in a few days for the metadata changes anyways
@@ -281,7 +281,7 @@ DELETE 226
- Test metadata migration on local instance again:
-$ ./migrate-fields.sh
+$ ./migrate-fields.sh
UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
UPDATE 40885
UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76
@@ -298,7 +298,7 @@ $ JAVA_OPTS="-Xms512m -Xmx512m -Dfile.encoding=UTF-8" ~/dspace/bin/dsp
- CGSpace was down but I’m not sure why, this was in
catalina.out
:
-Apr 18, 2016 7:32:26 PM com.sun.jersey.spi.container.ContainerResponse logException
+Apr 18, 2016 7:32:26 PM com.sun.jersey.spi.container.ContainerResponse logException
SEVERE: Mapped exception to response: 500 (Internal Server Error)
javax.ws.rs.WebApplicationException
at org.dspace.rest.Resource.processFinally(Resource.java:163)
@@ -328,7 +328,7 @@ javax.ws.rs.WebApplicationException
- Get handles for items that are using a given metadata field, ie
dc.Species.animal
(105):
-# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=105);
+# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=105);
handle
-------------
10568/10298
@@ -338,26 +338,26 @@ javax.ws.rs.WebApplicationException
- Delete metadata values for
dc.GRP
and dc.icsubject.icrafsubject
:
-# delete from metadatavalue where resource_type_id=2 and metadata_field_id=96;
+# delete from metadatavalue where resource_type_id=2 and metadata_field_id=96;
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=83;
- They are old ICRAF fields and we haven’t used them since 2011 or so
- Also delete them from the metadata registry
- CGSpace went down again,
dspace.log
had this:
-2016-04-19 15:02:17,025 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
+2016-04-19 15:02:17,025 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
- I restarted Tomcat and PostgreSQL and now it’s back up
- I bet this is the same crash as yesterday, but I only saw the errors in
catalina.out
- Looks to be related to this, from
dspace.log
:
-2016-04-19 15:16:34,670 ERROR org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
+2016-04-19 15:16:34,670 ERROR org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
- We have 18,000 of these errors right now…
- Delete a few more old metadata values:
dc.Species.animal
, dc.type.journal
, and dc.publicationcategory
:
-# delete from metadatavalue where resource_type_id=2 and metadata_field_id=105;
+# delete from metadatavalue where resource_type_id=2 and metadata_field_id=105;
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=85;
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=95;
@@ -369,7 +369,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
- Migrate fields and re-deploy CGSpace with the new subject and type fields, run all system updates, and reboot the server
- Field migration went well:
-$ ./migrate-fields.sh
+$ ./migrate-fields.sh
UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
UPDATE 40909
UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76
@@ -387,7 +387,7 @@ UPDATE 46075
Basically, this gives us the ability to use the latest upstream stable 9.3.x release (currently 9.3.12)
Looking into the REST API errors again, it looks like these started appearing a few days ago in the tens of thousands:
-$ grep -c "Aborting context in finally statement" dspace.log.2016-04-20
+$ grep -c "Aborting context in finally statement" dspace.log.2016-04-20
21252
- I found a recent discussion on the DSpace mailing list and I’ve asked for advice there
@@ -423,7 +423,7 @@ UPDATE 46075
- Looks like the last one was “down” from about four hours ago
- I think there must be something with this REST stuff:
-# grep -c "Aborting context in finally statement" dspace.log.2016-04-*
+# grep -c "Aborting context in finally statement" dspace.log.2016-04-*
dspace.log.2016-04-01:0
dspace.log.2016-04-02:0
dspace.log.2016-04-03:0
@@ -468,7 +468,7 @@ dspace.log.2016-04-27:7271
- Logs for today and yesterday have zero references to this REST error, so I’m going to open back up the REST API but log all requests
-location /rest {
+location /rest {
access_log /var/log/nginx/rest.log;
proxy_pass http://127.0.0.1:8443;
}
diff --git a/docs/2016-05/index.html b/docs/2016-05/index.html
index 12b0bfeee..071bef4f0 100644
--- a/docs/2016-05/index.html
+++ b/docs/2016-05/index.html
@@ -34,7 +34,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
# awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l
3168
"/>
-
+
@@ -126,13 +126,13 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
I have blocked access to the API now
There are 3,000 IPs accessing the REST API in a 24-hour period!
-# awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l
+# awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l
3168
- The two most often requesters are in Ethiopia and Colombia: 213.55.99.121 and 181.118.144.29
- 100% of the requests coming from Ethiopia are like this and result in an HTTP 500:
-GET /rest/handle/10568/NaN?expand=parentCommunityList,metadata HTTP/1.1
+GET /rest/handle/10568/NaN?expand=parentCommunityList,metadata HTTP/1.1
- For now I’ll block just the Ethiopian IP
- The owner of that application has said that the
NaN
(not a number) is an error in his code and he’ll fix it
@@ -152,7 +152,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
- I will re-generate the Discovery indexes after re-deploying
- Testing
renew-letsencrypt.sh
script for nginx
-#!/usr/bin/env bash
+#!/usr/bin/env bash
readonly SERVICE_BIN=/usr/sbin/service
readonly LETSENCRYPT_BIN=/opt/letsencrypt/letsencrypt-auto
@@ -214,7 +214,7 @@ fi
After completing the rebase I tried to build with the module versions Atmire had indicated as being 5.5 ready but I got this error:
-[ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:5.5: Could not find artifact com.atmire:atmire-metadata-quality-api:jar:5.5-2.10.1-0 in sonatype-releases (https://oss.sonatype.org/content/repositories/releases/) -> [Help 1]
+[ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:5.5: Could not find artifact com.atmire:atmire-metadata-quality-api:jar:5.5-2.10.1-0 in sonatype-releases (https://oss.sonatype.org/content/repositories/releases/) -> [Help 1]
- I’ve sent them a question about it
- A user mentioned having problems with uploading a 33 MB PDF
@@ -240,7 +240,7 @@ fi
- Found ~200 messed up CIAT values in
dc.publisher
:
-# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to "% %";
+# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to "% %";
2016-05-13
- More theorizing about CGcore
@@ -259,7 +259,7 @@ fi
- They have thumbnails on Flickr and elsewhere
- In OpenRefine I created a new
filename
column based on the thumbnail
column with the following GREL:
-if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1])
+if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1])
- Because ~400 records had the same filename on Flickr (hqdefault.jpg) but different UUIDs in the URL
- So for the
hqdefault.jpg
ones I just take the UUID (-2) and use it as the filename
@@ -269,7 +269,7 @@ fi
- More quality control on
filename
field of CCAFS records to make processing in shell and SAFBuilder more reliable:
-
value.replace('_','').replace('-','')
+value.replace('_','').replace('-','')
- We need to hold off on moving
dc.Species
to cg.species
because it is only used for plants, and might be better to move it to something like cg.species.plant
- And
dc.identifier.fund
is MOSTLY used for CPWF project identifier but has some other sponsorship things
@@ -281,17 +281,17 @@ fi
-# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
+# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
2016-05-20
- More work on CCAFS Video and Images records
- For SAFBuilder we need to modify filename column to have the thumbnail bundle:
-value + "__bundle:THUMBNAIL"
+value + "__bundle:THUMBNAIL"
- Also, I fixed some weird characters using OpenRefine’s transform with the following GREL:
-value.replace(/\u0081/,'')
+value.replace(/\u0081/,'')
- Write shell script to resize thumbnails with height larger than 400: https://gist.github.com/alanorth/131401dcd39d00e0ce12e1be3ed13256
- Upload 707 CCAFS records to DSpace Test
@@ -309,12 +309,12 @@ fi
- Export CCAFS video and image records from DSpace Test using the migrate option (
-m
):
-
$ mkdir ~/ccafs-images
+$ mkdir ~/ccafs-images
$ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~/ccafs-images -n 0 -m
- And then import to CGSpace:
-$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &> /tmp/ccafs-images-may30.log
+$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &> /tmp/ccafs-images-may30.log
- But now we have double authors for “CGIAR Research Program on Climate Change, Agriculture and Food Security” in the authority
- I’m trying to do a Discovery index before messing with the authority index
@@ -322,19 +322,19 @@ $ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~
- Run system updates on DSpace Test, re-deploy code, and reboot the server
- Clean up and import ~200 CTA records to CGSpace via CSV like:
-$ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
+$ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
$ /home/cgspace.cgiar.org/bin/dspace metadata-import -e aorth@mjanja.ch -f ~/CTA-May30/CTA-42229.csv &> ~/CTA-May30/CTA-42229.log
- Discovery indexing took a few hours for some reason, and after that I started the
index-authority
script
-$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace index-authority
+$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace index-authority
2016-05-31
- The
index-authority
script ran over night and was finished in the morning
- Hopefully this was because we haven’t been running it regularly and it will speed up next time
- I am running it again with a timer to see:
-$ time /home/cgspace.cgiar.org/bin/dspace index-authority
+$ time /home/cgspace.cgiar.org/bin/dspace index-authority
Retrieving all data
Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
Cleaning the old index
diff --git a/docs/2016-06/index.html b/docs/2016-06/index.html
index 4fee2b001..c965e7389 100644
--- a/docs/2016-06/index.html
+++ b/docs/2016-06/index.html
@@ -34,7 +34,7 @@ This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRec
You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund to cg.identifier.cpwfproject and then the rest to dc.description.sponsorship
"/>
-
+
@@ -129,7 +129,7 @@ Working on second phase of metadata migration, looks like this will work for mov
You can see the others by using the OAI ListSets
verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund
to cg.identifier.cpwfproject
and then the rest to dc.description.sponsorship
-dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
+dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
UPDATE 497
dspacetest=# update metadatavalue set metadata_field_id=29 where metadata_field_id=75;
UPDATE 14
@@ -141,7 +141,7 @@ UPDATE 14
Testing the configuration and theme changes for the upcoming metadata migration and I found some issues with cg.coverage.admin-unit
Seems that the Browse configuration in dspace.cfg
can’t handle the ‘-’ in the field name:
-webui.browse.index.12 = subregion:metadata:cg.coverage.admin-unit:text
+webui.browse.index.12 = subregion:metadata:cg.coverage.admin-unit:text
- But actually, I think since DSpace 4 or 5 (we are 5.1) the Browse indexes come from Discovery (defined in discovery.xml) so this is really just a parsing error
- I’ve sent a message to the DSpace mailing list to ask about the Browse index definition
@@ -154,13 +154,13 @@ UPDATE 14
- Investigating the CCAFS authority issue, I exported the metadata for the Videos collection
- The top two authors are:
-CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::500
+CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::500
CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::600
- So the only difference is the “confidence”
- Ok, well THAT is interesting:
-dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like '%Orth, %';
+dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like '%Orth, %';
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1
@@ -180,7 +180,7 @@ CGIAR Research Program on Climate Change, Agriculture and Food Security::acd0076
- And now an actually relevent example:
-dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence = 500;
+dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence = 500;
count
-------
707
@@ -194,14 +194,14 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and te
- Trying something experimental:
-dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
+dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
UPDATE 960
- And then re-indexing authority and Discovery…?
- After Discovery reindex the CCAFS authors are all together in the Authors sidebar facet
- The docs for the ORCiD and Authority stuff for DSpace 5 mention changing the browse indexes to use the Authority as well:
-webui.browse.index.2 = author:metadataAuthority:dc.contributor.author:authority
+webui.browse.index.2 = author:metadataAuthority:dc.contributor.author:authority
- That would only be for the “Browse by” function… so we’ll have to see what effect that has later
@@ -215,7 +215,7 @@ UPDATE 960
- Figured out how to export a list of the unique values from a metadata field ordered by count:
-dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=29 group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
+dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=29 group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
-
Identified the next round of fields to migrate:
@@ -244,7 +244,7 @@ UPDATE 960
- Looks like this is all we need: https://wiki.lyrasis.org/display/DSDOC5x/Submission+User+Interface#SubmissionUserInterface-ConfiguringControlledVocabularies
- I wrote an XPath expression to extract the ILRI subjects from
input-forms.xml
(from the xmlstarlet package):
-$ xml sel -t -m '//value-pairs[@value-pairs-name="ilrisubject"]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
+$ xml sel -t -m '//value-pairs[@value-pairs-name="ilrisubject"]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
- Write to Atmire about the use of
atmire.orcid.id
to see if we can change it
- Seems to be a virtual field that is queried from the authority cache… hmm
@@ -263,7 +263,7 @@ UPDATE 960
- It looks like the values are documented in
Choices.java
- Experiment with setting all 960 CCAFS author values to be 500:
-dspacetest=# SELECT authority, confidence FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
+dspacetest=# SELECT authority, confidence FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
dspacetest=# UPDATE metadatavalue set confidence = 500 where resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
UPDATE 960
@@ -320,7 +320,7 @@ UPDATE 960
- CGSpace’s HTTPS certificate expired last night and I didn’t notice, had to renew:
-# /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook "/usr/bin/service nginx stop" --post-hook "/usr/bin/service nginx start"
+# /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook "/usr/bin/service nginx stop" --post-hook "/usr/bin/service nginx start"
- I really need to fix that cron job…
@@ -328,7 +328,7 @@ UPDATE 960
- Run the replacements/deletes for
dc.description.sponsorship
(investors) on CGSpace:
-$ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t 'correct investor' -m 29 -d cgspace -p 'fuuu' -u cgspace
+$ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t 'correct investor' -m 29 -d cgspace -p 'fuuu' -u cgspace
$ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.sponsorship -m 29 -d cgspace -p 'fuuu' -u cgspace
- The scripts for this are here:
@@ -346,7 +346,7 @@ $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.spons
- There are still ~97 fields that weren’t indicated to do anything
- After the above deletions and replacements I regenerated a CSV and sent it to Peter et al to have a look
-dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=126 group by text_value order by count desc) to /tmp/contributors-june28.csv with csv;
+dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=126 group by text_value order by count desc) to /tmp/contributors-june28.csv with csv;
- Re-evaluate
dc.contributor.corporate
and it seems we will move it to dc.contributor.author
as this is more in line with how editors are actually using it
@@ -354,7 +354,7 @@ $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.spons
- Test run of
migrate-fields.sh
with the following re-mappings:
-72 55 #dc.source
+72 55 #dc.source
86 230 #cg.contributor.crp
91 211 #cg.contributor.affiliation
94 212 #cg.species
@@ -367,7 +367,7 @@ $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.spons
- Run all cleanups and deletions of
dc.contributor.corporate
on CGSpace:
-$ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t 'Correct style' -m 126 -d cgspace -u cgspace -p 'fuuu'
+$ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t 'Correct style' -m 126 -d cgspace -u cgspace -p 'fuuu'
$ ./fix-metadata-values.py -i Corporate-Authors-Fix-PB.csv -f dc.contributor.corporate -t 'should be' -m 126 -d cgspace -u cgspace -p 'fuuu'
$ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-Delete-13.csv -m 126 -u cgspace -d cgspace -p 'fuuu'
@@ -383,11 +383,11 @@ $ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-D
- Wow, there are 95 authors in the database who have ‘,’ at the end of their name:
-
# select text_value from metadatavalue where metadata_field_id=3 and text_value like '%,';
+# select text_value from metadatavalue where metadata_field_id=3 and text_value like '%,';
- We need to use something like this to fix them, need to write a proper regex later:
-# update metadatavalue set text_value = regexp_replace(text_value, '(Poole, J),', '\1') where metadata_field_id=3 and text_value = 'Poole, J,';
+# update metadatavalue set text_value = regexp_replace(text_value, '(Poole, J),', '\1') where metadata_field_id=3 and text_value = 'Poole, J,';
diff --git a/docs/2016-07/index.html b/docs/2016-07/index.html
index a4c816296..d37202bde 100644
--- a/docs/2016-07/index.html
+++ b/docs/2016-07/index.html
@@ -44,7 +44,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
In this case the select query was showing 95 results before the update
"/>
-
+
@@ -135,7 +135,7 @@ In this case the select query was showing 95 results before the update
Add dc.description.sponsorship
to Discovery sidebar facets and make investors clickable in item view (#232)
I think this query should find and replace all authors that have “,” at the end of their names:
-dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
+dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
UPDATE 95
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
text_value
@@ -158,7 +158,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
We really only need statistics
and authority
but meh
Fix metadata for species on DSpace Test:
-$ ./fix-metadata-values.py -i /tmp/Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 94 -d dspacetest -u dspacetest -p 'fuuu'
+$ ./fix-metadata-values.py -i /tmp/Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 94 -d dspacetest -u dspacetest -p 'fuuu'
- Will run later on CGSpace
- A user is still having problems with Sherpa/Romeo causing crashes during the submission process when the journal is “ungraded”
@@ -169,7 +169,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
- Delete 23 blank metadata values from CGSpace:
-
cgspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
+cgspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
DELETE 23
- Complete phase three of metadata migration, for the following fields:
@@ -188,7 +188,7 @@ DELETE 23
- Also, run fixes and deletes for species and author affiliations (over 1000 corrections!)
-$ ./fix-metadata-values.py -i Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 212 -d dspace -u dspace -p 'fuuu'
+$ ./fix-metadata-values.py -i Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 212 -d dspace -u dspace -p 'fuuu'
$ ./fix-metadata-values.py -i Affiliations-Fix-1045-Peter-Abenet.csv -f dc.contributor.affiliation -t Correct -m 211 -d dspace -u dspace -p 'fuuu'
$ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Delete-Peter-Abenet.csv -m 211 -u dspace -d dspace -p 'fuuu'
@@ -198,7 +198,7 @@ $ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Dele
- Doing some author cleanups from Peter and Abenet:
-
$ ./fix-metadata-values.py -i /tmp/Authors-Fix-205-UTF8.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
+$ ./fix-metadata-values.py -i /tmp/Authors-Fix-205-UTF8.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
$ ./delete-metadata-values.py -f dc.contributor.author -i /tmp/Authors-Delete-UTF8.csv -m 3 -u dspacetest -d dspacetest -p fuuu
2016-07-13
@@ -215,20 +215,20 @@ $ ./delete-metadata-values.py -f dc.contributor.author -i /tmp/Authors-Delete-UT
- Add species and breed to the XMLUI item display
- CGSpace crashed late at night and the DSpace logs were showing:
-2016-07-18 20:26:30,941 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
+