cgspace-notes/2018-10.md at cfa5f3ddfbee3c36329864d6e73f1fd78f7618f5

alanorth/cgspace-notes

Fork 0

mirror of https://github.com/alanorth/cgspace-notes.git synced 2024-11-04 22:33:02 +01:00

Alan Orth de5e98afec

Add more tag to 2018-10 post

2018-11-01 16:42:20 +02:00

35 KiB

Raw Blame History

title

date

author

2018-10-01

Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
I created a GitHub issue to track this #389, because I'm super busy in Nairobi right now

2018-10-03

I see Moayad was busy collecting item views and downloads from CGSpace yesterday:

# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | awk '{print $1}
' | sort | uniq -c | sort -n | tail -n 10
    933 40.77.167.90
    971 95.108.181.88
   1043 41.204.190.40
   1454 157.55.39.54
   1538 207.46.13.69
   1719 66.249.64.61
   2048 50.116.102.77
   4639 66.249.64.59
   4736 35.237.175.180
 150362 34.218.226.147

Of those, about 20% were HTTP 500 responses (!):

$ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | grep 34.218.226.147 | awk '{print $9}' | sort -n | uniq -c
 118927 200
  31435 500

I added Phil Thornton and Sonal Henson's ORCID identifiers to the controlled vocabulary for cg.creator.orcid and then re-generated the names using my resolve-orcids.py script:

$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > 2018-10-03-orcids.txt
$ ./resolve-orcids.py -i 2018-10-03-orcids.txt -o 2018-10-03-names.txt -d

I found a new corner case error that I need to check, given and family names deactivated:

Looking up the names associated with ORCID iD: 0000-0001-7930-5752
Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752

It appears to be Jim Lorenzen... I need to check that later!
I merged the changes to the 5_x-prod branch (#390)
Linode sent another alert about CPU usage on CGSpace (linode18) this evening
It seems that Moayad is making quite a lot of requests today:

# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Oct/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
   1594 157.55.39.160
   1627 157.55.39.173
   1774 136.243.6.84
   4228 35.237.175.180
   4497 70.32.83.92
   4856 66.249.64.59
   7120 50.116.102.77
  12518 138.201.49.199
  87646 34.218.226.147
 111729 213.139.53.62

But in super positive news, he says they are using my new dspace-statistics-api and it's MUCH faster than using Atmire CUA's internal "restlet" API
I don't recognize the 138.201.49.199 IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams:

# grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c
   8324 GET /bitstream
   4193 GET /handle

Suspiciously, it's only grabbing the CGIAR System Office community (handle prefix 10947):

# grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c
      7 GET /handle/10568
   4186 GET /handle/10947

The user agent is suspicious too:

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36

It's clearly a bot and it's not re-using its Tomcat session, so I will add its IP to the nginx bad bot list
I looked in Solr's statistics core and these hits were actually all counted as isBot:false (of course)... hmmm
I tagged all of Sonal and Phil's items with their ORCID identifiers on CGSpace using my add-orcid-identifiers.py script:

$ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu'

Where 2018-10-03-add-orcids.csv contained:

dc.contributor.author,cg.creator.id
"Henson, Sonal P.",Sonal Henson: 0000-0002-2002-5462
"Henson, S.",Sonal Henson: 0000-0002-2002-5462
"Thornton, P.K.",Philip Thornton: 0000-0002-1854-0182
"Thornton, Philip K",Philip Thornton: 0000-0002-1854-0182
"Thornton, Phil",Philip Thornton: 0000-0002-1854-0182
"Thornton, Philip K.",Philip Thornton: 0000-0002-1854-0182
"Thornton, Phillip",Philip Thornton: 0000-0002-1854-0182
"Thornton, Phillip K.",Philip Thornton: 0000-0002-1854-0182

2018-10-04

Salem raised an issue that the dspace-statistics-api reports downloads for some items that have no bitstreams (like many limited access items)
Every item has at least a LICENSE bundle, and some have a THUMBNAIL bundle, but the indexing code is specifically checking for downloads from the ORIGINAL bundle
- 10568/97460 (100550): has a thumbnail bitstream
- 10568/96112 (96736): has only a LICENSE bitstream
I see there are other bundles we might need to pay attention to: TEXT, @_LOGO-COLLECTION_@, @_LOGO-COMMUNITY_@, etc...
On a hunch I dropped the statistics table and re-indexed and now those two items above have no downloads
So it's fixed, but I'm not sure why!
Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests):

# zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets'
251226

I found a logic error in the dspace-statistics-api indexer.py script that was causing item views to be inserted into downloads
I tagged version 0.4.2 of the tool and redeployed it on CGSpace

2018-10-05

Meet with Peter, Abenet, and Sisay to discuss CGSpace meeting in Nairobi and Sisay's work plan
We agreed that he would do monthly updates of the controlled vocabularies and generate a new one for the top 1,000 AGROVOC terms
Add a link to AReS explorer to the CGSpace homepage introduction text

2018-10-06

Follow up with AgriKnowledge about including Handle links (dc.identifier.uri) on their item pages
In July, 2018 they had said their programmers would include the field in the next update of their website software
CIMMYT's DSpace repository is now running DSpace 5.x!
It's running OAI, but not REST, so I need to talk to Richard about that!

2018-10-08

AgriKnowledge says they're going to add the dc.identifier.uri to their item view in November when they update their website software

2018-10-10

Peter noticed that some recently added PDFs don't have thumbnails
When I tried to force them to be generated I got an error that I've never seen before:

$ dspace filter-media -v -f -i 10568/97613
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: not authorized `/tmp/impdfthumb5039464037201498062.pdf' @ error/constitute.c/ReadImage/412.

I see there was an update to Ubuntu's ImageMagick on 2018-10-05, so maybe something changed or broke?
I get the same error when forcing filter-media to run on DSpace Test too, so it's gotta be an ImageMagic bug
The ImageMagick version is currently 8:6.8.9.9-7ubuntu5.13, and there is an Ubuntu Security Notice from 2018-10-04
Wow, someone on Twitter posted about this breaking his web application (and it was retweeted by the ImageMagick acount!)
I commented out the line that disables PDF thumbnails in /etc/ImageMagick-6/policy.xml:

  <!--<policy domain="coder" rights="none" pattern="PDF" />-->

This works, but I'm not sure what ImageMagick's long-term plan is if they are going to disable ALL image formats...
I suppose I need to enable a workaround for this in Ansible?

2018-10-11

I emailed DuraSpace to update our entry in their DSpace registry (the data was still on DSpace 3, JSPUI, etc)
Generate a list of the top 1500 values for dc.subject so Sisay can start making a controlled vocabulary for it:

dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-10-11-top-1500-subject.csv WITH CSV HEADER;
COPY 1500

Give WorldFish advice about Handles because they are talking to some company called KnowledgeArc who recommends they do not use Handles!
Last week I emailed Altmetric to ask if their software would notice mentions of our Handle in the format "handle:10568/80775" because I noticed that the Land Portal does this
Altmetric support responded to say no, but the reason is that Land Portal is doing even more strange stuff by not using <meta> tags in their page header, and using "dct:identifier" property instead of "dc:identifier"
I re-created my local DSpace databse container using podman instead of Docker:

$ mkdir -p ~/.local/lib/containers/volumes/dspacedb_data
$ sudo podman create --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ sudo podman start dspacedb
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-10-11.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest

I tried to make an Artifactory in podman, but it seems to have problems because Artifactory is distributed on the Bintray repository
I can pull the docker.bintray.io/jfrog/artifactory-oss:latest image, but not start it
I decided to use a Sonatype Nexus repository instead:

$ mkdir -p ~/.local/lib/containers/volumes/nexus_data
$ sudo podman run --name nexus -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3

With a few changes to my local Maven settings.xml it is working well
Generate a list of the top 10,000 authors for Peter Ballantyne to look through:

dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 3 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 10000) to /tmp/2018-10-11-top-10000-authors.csv WITH CSV HEADER;
COPY 10000

CTA uploaded some infographics that are very tall and their thumbnails disrupt the item lists on the front page and in their communities and collections
I decided to constrain the max height of these to 200px using CSS (#392)

2018-10-13

Run all system updates on DSpace Test (linode19) and reboot it
Look through Peter's list of 746 author corrections in OpenRefine
I first facet by blank, trim whitespace, and then check for weird characters that might be indicative of encoding issues with this GREL:

or(
  isNotNull(value.match(/.*\uFFFD.*/)),
  isNotNull(value.match(/.*\u00A0.*/)),
  isNotNull(value.match(/.*\u200A.*/)),
  isNotNull(value.match(/.*\u2019.*/)),
  isNotNull(value.match(/.*\u00b4.*/))
)

Then I exported and applied them on my local test server:

$ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t CORRECT -m 3

I will apply these on CGSpace when I do the other updates tomorrow, as well as double check the high scoring ones to see if they are correct in Sisay's author controlled vocabulary

2018-10-14

Merge the authors controlled vocabulary (#393), usage rights (#394), and the upstream DSpace 5.x cherry-picks (#394) into our 5_x-prod branch
Switch to new CGIAR LDAP server on CGSpace, as it's been running (at least for authentication) on DSpace Test for the last few weeks, and I think they old one will be deprecated soon (today?)
Apply Peter's 746 author corrections on CGSpace and DSpace Test using my fix-metadata-values.py script:

$ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p 'fuuu'

Run all system updates on CGSpace (linode19) and reboot the server
After rebooting the server I noticed that Handles are not resolving, and the dspace-handle-server systemd service is not running (or rather, it exited with success)
Restarting the service with systemd works for a few seconds, then the java process quits
I suspect that the systemd service type needs to be forking rather than simple, because the service calls the default DSpace start-handle-server shell script, which uses nohup and & to background the java process
It would be nice if there was a cleaner way to start the service and then just log to the systemd journal rather than all this hiding and log redirecting
Email the Landportal.org people to ask if they would consider Dublin Core metadata tags in their page's header, rather than the HTML properties they are using in their body
Peter pointed out that some thumbnails were still not getting generated
- When I tried to generate them manually I noticed that the path to the CMYK profile had changed because Ubuntu upgraded Ghostscript from 9.18 to 9.25 last week... WTF?
- Looks like I can use /usr/share/ghostscript/current instead of /usr/share/ghostscript/9.25...
I limited the tall thumbnails even further to 170px because Peter said CTA's were still too tall at 200px (#396)

2018-10-15

Tomcat on DSpace Test (linode19) has somehow stopped running all the DSpace applications
I don't see anything in the Catalina logs or dmesg, and the Tomcat manager shows XMLUI, REST, OAI, etc all "Running: false"
Actually, now I remember that yesterday when I deployed the latest changes from git on DSpace Test I noticed a syntax error in one XML file when I was doing the discovery reindexing
I fixed it so that I could reindex, but I guess the rest of DSpace actually didn't start up...
Create an account on DSpace Test for Felix from Earlham so he can test COPO submission
- I created a new collection and added him as the administrator so he can test submission
- He said he actually wants to test creation of communities, collections, etc, so I had to make him a super admin for now
- I told him we need to think about the workflow more seriously in the future
I ended up having some issues with podman and went back to Docker, so I had to re-create my containers:

$ sudo docker run --name nexus --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3
$ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-10-11.backup
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'

2018-10-16

Generate a list of the schema on CGSpace so CodeObia can compare with MELSpace:

dspace=# \copy (SELECT (CASE when metadata_schema_id=1 THEN 'dc' WHEN metadata_schema_id=2 THEN 'cg' END) AS schema, element, qualifier, scope_note FROM metadatafieldregistry where metadata_schema_id IN (1,2)) TO /tmp/cgspace-schema.csv WITH CSV HEADER;

Talking to the CodeObia guys about the REST API I started to wonder why it's so slow and how I can quantify it in order to ask the dspace-tech mailing list for help profiling it
Interestingly, the speed doesn't get better after you request the same thing multiple times–it's consistently bad on both CGSpace and DSpace Test!

$ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
...
0.35s user 0.06s system 1% cpu 25.133 total
0.31s user 0.04s system 1% cpu 25.223 total
0.27s user 0.06s system 1% cpu 27.858 total
0.20s user 0.05s system 1% cpu 23.838 total
0.30s user 0.05s system 1% cpu 24.301 total

$ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
...
0.22s user 0.03s system 1% cpu 17.248 total
0.23s user 0.02s system 1% cpu 16.856 total
0.23s user 0.04s system 1% cpu 16.460 total
0.24s user 0.04s system 1% cpu 21.043 total
0.22s user 0.04s system 1% cpu 17.132 total

I should note that at this time CGSpace is using Oracle Java and DSpace Test is using OpenJDK (both version 8)
I wonder if the Java garbage collector is important here, or if there are missing indexes in PostgreSQL?
I switched DSpace Test to the G1GC garbage collector and tried again and now the results are worse!

$ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
...
0.20s user 0.03s system 0% cpu 25.017 total
0.23s user 0.02s system 1% cpu 23.299 total
0.24s user 0.02s system 1% cpu 22.496 total
0.22s user 0.03s system 1% cpu 22.720 total
0.23s user 0.03s system 1% cpu 22.632 total

If I make a request without the expands it is ten time faster:

$ time http --print h 'https://dspacetest.cgiar.org/rest/items?limit=100&offset=0'
...
0.20s user 0.03s system 7% cpu 3.098 total
0.22s user 0.03s system 8% cpu 2.896 total
0.21s user 0.05s system 9% cpu 2.787 total
0.23s user 0.02s system 8% cpu 2.896 total

I sent a mail to dspace-tech to ask how to profile this...

2018-10-17

I decided to update most of the existing metadata values that we have in dc.rights on CGSpace to be machine readable in SPDX format (with Creative Commons version if it was included)
Most of the are from Bioversity, and I asked Maria for permission before updating them
I manually went through and looked at the existing values and updated them in several batches:

UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%CC BY %';
UPDATE metadatavalue SET text_value='CC-BY-NC-ND-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-ND%' AND text_value LIKE '%by-nc-nd%';
UPDATE metadatavalue SET text_value='CC-BY-NC-SA-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-SA%' AND text_value LIKE '%by-nc-sa%';
UPDATE metadatavalue SET text_value='CC-BY-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value LIKE '%/by/%';
UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%/by/%' AND text_value NOT LIKE '%zero%';
UPDATE metadatavalue SET text_value='CC-BY-NC-2.5' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE
'%/by-nc%' AND text_value LIKE '%2.5%';
UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%/by-nc%' AND text_value LIKE '%4.0%';
UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%Attribution %' AND text_value NOT LIKE '%zero%';
UPDATE metadatavalue SET text_value='CC-BY-NC-SA-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value NOT LIKE '%zero%' AND text_value LIKE '%4.0%' AND text_value LIKE '%Attribution-NonCommercial-ShareAlike%';
UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value NOT LIKE '%zero%' AND text_value LIKE '%Attribution-NonCommercial %';
UPDATE metadatavalue SET text_value='CC-BY-NC-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value NOT LIKE '%zero%' AND text_value LIKE '%Attribution-NonCommercial %';
UPDATE metadatavalue SET text_value='CC-BY-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value NOT LIKE '%zero%' AND text_value LIKE '%Attribution %';
UPDATE metadatavalue SET text_value='CC-BY-ND-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78184;
UPDATE metadatavalue SET text_value='CC-BY' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value NOT LIKE '%zero%' AND text_value NOT LIKE '%CC0%' AND text_value LIKE '%Attribution %' AND text_value NOT LIKE '%CC-%';
UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78564;

I updated the fields on CGSpace and then started a re-index of Discovery
We also need to re-think the dc.rights field in the submission form: we should probably use a popup controlled vocabulary and list the Creative Commons values with version numbers and allow the user to enter their own (like the ORCID identifier field)
Ask Jane if we can use some of the BDP money to host AReS explorer on a more powerful server
IWMI sent me a list of new ORCID identifiers for their staff so I combined them with our list, updated the names with my resolve-orcids.py script, and regenerated the controlled vocabulary:

$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json MEL\ ORCID_V2.json 2018-10-17-IWMI-ORCIDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq >
2018-10-17-orcids.txt
$ ./resolve-orcids.py -i 2018-10-17-orcids.txt -o 2018-10-17-names.txt -d
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml

I also decided to add the ORCID identifiers that MEL had sent us a few months ago...
One problem I had with the resolve-orcids.py script is that one user seems to have disabled their profile data since we last updated:

Looking up the names associated with ORCID iD: 0000-0001-7930-5752
Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752

So I need to handle that situation in the script for sure, but I'm not sure what to do organizationally or ethically, since that user disabled their name! Do we remove him from the list?
I made a pull request and merged the ORCID updates into the 5_x-prod branch (#397)
Improve the logic of name checking in my resolve-orcids.py script

2018-10-18

I granted MEL's deposit user admin access to IITA, CIP, Bioversity, and RTB communities on DSpace Test so they can start testing real depositing
After they do some tests and we check the values Enrico will send a formal email to Peter et al to ask that they start depositing officially
I upgraded PostgreSQL to 9.6 on DSpace Test using Ansible, then had to manually migrate from 9.5 to 9.6:

# su - postgres
$ /usr/lib/postgresql/9.6/bin/pg_upgrade -b /usr/lib/postgresql/9.5/bin -B /usr/lib/postgresql/9.6/bin -d /var/lib/postgresql/9.5/main -D /var/lib/postgresql/9.6/main -o ' -c config_file=/etc/postgresql/9.5/main/postgresql.conf' -O ' -c config_file=/etc/postgresql/9.6/main/postgresql.conf'
$ exit
# systemctl start postgresql
# dpkg -r postgresql-9.5 postgresql-client-9.5 postgresql-contrib-9.5

2018-10-19

Help Francesca from Bioversity generate a report about items they uploaded in 2015 through 2018
Linode emailed me to say that CGSpace (linode18) had high CPU usage for a few hours this afternoon
Looking at the nginx logs around that time I see the following IPs making the most requests:

# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Oct/2018:(12|13|14|15)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    361 207.46.13.179
    395 181.115.248.74
    485 66.249.64.93
    535 157.55.39.213
    536 157.55.39.99
    551 34.218.226.147
    580 157.55.39.173
   1516 35.237.175.180
   1629 66.249.64.91
   1758 5.9.6.51

5.9.6.51 is MegaIndex, which I've seen before...

2018-10-20

I was going to try to run Solr in Docker because I learned I can run Docker on Travis-CI (for testing my dspace-statistics-api), but the oldest official Solr images are for 5.5, and DSpace's Solr configuration is for 4.9
This means our existing Solr configuration doesn't run in Solr 5.5:

$ sudo docker pull solr:5
$ sudo docker run --name my_solr -v ~/dspace/solr/statistics/conf:/tmp/conf -d -p 8983:8983 -t solr:5
$ sudo docker logs my_solr
...
ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics] Caused by: solr.IntField

Apparently a bunch of variable types were removed in Solr 5
So for now it's actually a huge pain in the ass to run the tests for my dspace-statistics-api
Linode sent a message that the CPU usage was high on CGSpace (linode18) last night
According to the nginx logs around that time it was 5.9.6.51 (MegaIndex) again:

# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Oct/2018:(14|15|16)" | awk '{print $1}' | sort
 | uniq -c | sort -n | tail -n 10
    249 207.46.13.179
    250 157.55.39.173
    301 54.166.207.223
    303 157.55.39.213
    310 66.249.64.95
    362 34.218.226.147
    381 66.249.64.93
    415 35.237.175.180
   1205 66.249.64.91
   1227 5.9.6.51

This bot is only using the XMLUI and it does not seem to be re-using its sessions:

# grep -c 5.9.6.51 /var/log/nginx/*.log
/var/log/nginx/access.log:9323
/var/log/nginx/error.log:0
/var/log/nginx/library-access.log:0
/var/log/nginx/oai.log:0
/var/log/nginx/rest.log:0
/var/log/nginx/statistics.log:0
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-10-20 | sort | uniq
8915

Last month I added "crawl" to the Tomcat Crawler Session Manager Valve's regular expression matching, and it seems to be working for MegaIndex's user agent:

$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1' User-Agent:'"Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"'

So I'm not sure why this bot uses so many sessions — is it because it requests very slowly?

2018-10-21

Discuss AfricaRice joining CGSpace

2018-10-22

Post message to Yammer about usage rights (dc.rights)
Change build.properties to use HTTPS for Handles in our Ansible infrastructure playbooks
We will still need to do a batch update of the dc.identifier.uri and other fields in the database:

dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';

While I was doing that I found two items using CGSpace URLs instead of handles in their dc.identifier.uri so I corrected those
I also found several items that had invalid characters or multiple Handles in some related URL field like cg.link.reference so I corrected those too
Improve the usage rights on the submission form by adding a default selection with no value as well as a better hint to look for the CC license on the publisher page or in the PDF (#398)
I deployed the changes on CGSpace, ran all system updates, and rebooted the server
Also, I updated all Handles in the database to use HTTPS:

dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
UPDATE 76608

Skype with Peter about ToRs for the AReS open source work and future plans to develop tools around the DSpace ecosystem
Help CGSpace users with some issues related to usage rights

2018-10-23

Improve the usage rights (dc.rights) on CGSpace again by adding the long names in the submission form, as well as adding versio 3.0 and Creative Commons Zero (CC0) public domain license (#399)
Add "usage rights" to the XMLUI item display (#400)
I emailed the MARLO guys to ask if they can send us a dump of rights data and Handles from their system so we can tag our older items on CGSpace
Testing REST login and logout via httpie because Felix from Earlham says he's having issues:

$ http --print b POST 'https://dspacetest.cgiar.org/rest/login' email='testdeposit@cgiar.org' password=deposit
acef8a4a-41f3-4392-b870-e873790f696b

$ http POST 'https://dspacetest.cgiar.org/rest/logout' rest-dspace-token:acef8a4a-41f3-4392-b870-e873790f696b

Also works via curl (login, check status, logout, check status):

$ curl -H "Content-Type: application/json" --data '{"email":"testdeposit@cgiar.org", "password":"deposit"}' https://dspacetest.cgiar.org/rest/login
e09fb5e1-72b0-4811-a2e5-5c1cd78293cc
$ curl -X GET -H "Content-Type: application/json" -H "Accept: application/json" -H "rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc" https://dspacetest.cgiar.org/rest/status
{"okay":true,"authenticated":true,"email":"testdeposit@cgiar.org","fullname":"Test deposit","token":"e09fb5e1-72b0-4811-a2e5-5c1cd78293cc"}
$ curl -X POST -H "Content-Type: application/json" -H "rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc" https://dspacetest.cgiar.org/rest/logout
$ curl -X GET -H "Content-Type: application/json" -H "Accept: application/json" -H "rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc" https://dspacetest.cgiar.org/rest/status
{"okay":true,"authenticated":false,"email":null,"fullname":null,"token":null}%

Improve the documentatin of my dspace-statistics-api
Email Modi and Jayashree from ICRISAT to ask if they want to join CGSpace as partners

2018-10-24

I deployed the new Creative Commons choices to the usage rights on the CGSpace submission form
Also, I deployed the changes to show usage rights on the item view
Re-work the dspace-statistics-api to use Python's native json instead of ujson to make it easier to deploy in places where we don't have — or don't want to have — Python headers and a compiler (like containers)
Re-work the deployment of the API to use systemd's EnvironmentFile to read the environment variables instead of Environment in the RMG Ansible infrastructure scripts

2018-10-25

Send Peter and Jane a list of technical ToRs for AReS open source work:
Basic version of AReS that works with metadata fields present in default DSpace 5.x/6.x (for example author, date issued, type, subjects)
- Ability to harvest multiple repositories
- Configurable list of extra fields to harvest, per repository
- Configurable list of field and value mappings for consistent display/search with multiple repositories
- Configurable list of graphs/blocks to display on homepage
- Optional harvesting of DSpace view/download statistics if dspace-statistics-api is available on repository
- Optional harvesting of Altmetric mentions
- Configurable scheduling of harvesting (daily, weekly, etc)
- High-quality README.md on GitHub with description, requirements, deployment instructions, and license (GPLv3 unless ICARDA has a problem with that)
Maria asked if we can add publisher (dc.publisher) to the advanced search filters, so I created a GitHub issue to track it

2018-10-28

I forked the SolrClient library and updated its kazoo dependency to be version 2.5.0 so we stop getting errors about "async" being a reserved keyword in Python 3.7
Then I re-generated the requirements.txt in the dspace-statistics-library and released version 0.5.2
Then I re-deployed the API on DSpace Test, ran all system updates on the server, and rebooted it
I tested my hack of depositing to one collection where the default item and bistream READ policies are restricted and then mapping the item to another collection, but the item retains its default policies so Anonymous cannot see them in the mapped collection either
Perhaps we need to try moving the item and inheriting the target collection's policies?
I merged the changes for adding publisher (dc.publisher) to the advanced search to the 5_x-prod branch (#402)
I merged the changes for adding versionless Creative Commons licenses to the submission form to the 5_x-prod branch (#403)
I will deploy them later this week

2018-10-29

I deployed the publisher and Creative Commons changes to CGSpace, ran all system updates, and rebooted the server
I sent the email to Jane Poole and ILRI ICT and Finance to start the admin process of getting a new Linode server for AReS

2018-10-30

Meet with the COPO guys to walk them through the CGSpace submission workflow and discuss CG core, REST API, etc
- I suggested that they look into submitting via the SWORDv2 protocol because it respects the workflows
- They said that they're not too worried about the hierarchical CG core schema, that they would just flatten metadata like affiliations when depositing to a DSpace repository
- I said that it might be time to engage the DSpace community to add support for more advanced schemas in DSpace 7+ (perhaps partnership with Atmire?)

2018-10-31

More discussion and planning for AReS open sourcing and Amman meeting in 2019-10
I did some work to clean up and improve the dspace-statistics-api README.md and project structure and moved it to the ILRI organization on GitHub
Now the API serves some basic documentation on the root route
I want to announce it to the dspace-tech mailing list soon

35 KiB Raw Blame History Unescape Escape

2018-10-01

2018-10-03

2018-10-04

2018-10-05

2018-10-06

2018-10-08

2018-10-10

2018-10-11

2018-10-13

2018-10-14

2018-10-15

2018-10-16

2018-10-17

2018-10-18

2018-10-19

2018-10-20

2018-10-21

2018-10-22

2018-10-23

2018-10-24

2018-10-25

2018-10-28

2018-10-29

2018-10-30

2018-10-31

35 KiB

Raw Blame History