cgspace-notes/content/post/2017-09.md

24 KiB
Raw Blame History

+++ date = "2017-09-07T16:54:52+07:00" author = "Alan Orth" title = "September, 2017" tags = ["Notes"]

+++

2017-09-06

  • Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two hours

2017-09-07

  • Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group

2017-09-10

  • Delete 58 blank metadata values from the CGSpace database:
dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
DELETE 58
  • I also ran it on DSpace Test because we'll be migrating the CGIAR Library soon and it would be good to catch these before we migrate
  • Run system updates and restart DSpace Test
  • We only have 7.7GB of free space on DSpace Test so I need to copy some data off of it before doing the CGIAR Library migration (requires lots of exporting and creating temp files)
  • I still have the original data from the CGIAR Library so I've zipped it up and sent it off to linode18 for now
  • sha256sum of original-cgiar-library-6.6GB.tar.gz is: bcfabb52f51cbdf164b61b7e9b3a0e498479e4c1ed1d547d32d11f44c0d5eb8a
  • Start doing a test run of the CGIAR Library migration locally
  • Notes and todo checklist here for now: https://gist.github.com/alanorth/3579b74e116ab13418d187ed379abd9c
  • Create pull request for Phase I and II changes to CCAFS Project Tags: #336
  • We've been discussing with Macaroni Bros and CCAFS for the past month or so and the list of tags was recently finalized
  • There will need to be some metadata updatesthough if I recall correctly it is only about seven recordsfor that as well, I had made some notes about it in 2017-07, but I've asked for more clarification from Lili just in case
  • Looking at the DSpace logs to see if we've had a change in the "Cannot get a connection" errors since last month when we adjusted the db.maxconnections parameter on CGSpace:
# grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-09-*
dspace.log.2017-09-01:0
dspace.log.2017-09-02:0
dspace.log.2017-09-03:9
dspace.log.2017-09-04:17
dspace.log.2017-09-05:752
dspace.log.2017-09-06:0
dspace.log.2017-09-07:0
dspace.log.2017-09-08:10
dspace.log.2017-09-09:0
dspace.log.2017-09-10:0
  • Also, since last month (2017-08) Macaroni Bros no longer runs their REST API scraper every hour, so I'm sure that helped
  • There are still some errors, though, so maybe I should bump the connection limit up a bit
  • I remember seeing that Munin shows that the average number of connections is 50 (which is probably mostly from the XMLUI) and we're currently allowing 40 connections per app, so maybe it would be good to bump that value up to 50 or 60 along with the system's PostgreSQL max_connections (formula should be: webapps * 60 + 3, or 3 * 60 + 3 = 183 in our case)
  • I updated both CGSpace and DSpace Test to use these new settings (60 connections per web app and 183 for system PostgreSQL limit)
  • I'm expecting to see 0 connection errors for the next few months

2017-09-11

2017-09-12

  • I was testing the METS XSD caching during AIP ingest but it doesn't seem to help actually
  • The import process takes the same amount of time with and without the caching
  • Also, I captured TCP packets destined for port 80 and both imports only captured ONE packet (an update check from some component in Java):
$ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and 'tcp[32:4] = 0x47455420'
  • Great TCP dump guide here: https://danielmiessler.com/study/tcpdump
  • The last part of that command filters for HTTP GET requests, of which there should have been many to fetch all the XSD files for validation
  • I sent a message to the mailing list to see if anyone knows more about this
  • In looking at the tcpdump results I notice that there is an update check to the ehcache server on every iteration of the ingest loop, for example:
09:39:36.008956 IP 192.168.8.124.50515 > 157.189.192.67.http: Flags [P.], seq 1736833672:1736834103, ack 147469926, win 4120, options [nop,nop,TS val 1175113331 ecr 550028064], length 431: HTTP: GET /kit/reflector?kitID=ehcache.default&pageID=update.properties&id=2130706433&os-name=Mac+OS+X&jvm-name=Java+HotSpot%28TM%29+64-Bit+Server+VM&jvm-version=1.8.0_144&platform=x86_64&tc-version=UNKNOWN&tc-product=Ehcache+Core+1.7.2&source=Ehcache+Core&uptime-secs=0&patch=UNKNOWN HTTP/1.1
  • Turns out this is a known issue and Ehcache has refused to make it opt-in: https://jira.terracotta.org/jira/browse/EHC-461
  • But we can disable it by adding an updateCheck="false" attribute to the main <ehcache > tag in dspace-services/src/main/resources/caching/ehcache-config.xml
  • After re-compiling and re-deploying DSpace I no longer see those update checks during item submission
  • I had a Skype call with Bram Luyten from Atmire to discuss various issues related to ORCID in DSpace
    • First, ORCID is deprecating their version 1 API (which DSpace uses) and in version 2 API they have removed the ability to search for users by name
    • The logic is that searching by name actually isn't very useful because ORCID is essentially a global phonebook and there are tons of legitimately duplicate and ambiguous names
    • Atmire's proposed integration would work by having users lookup and add authors to the authority core directly using their ORCID ID itself (this would happen during the item submission process or perhaps as a standalone / batch process, for example to populate the authority core with a list of known ORCIDs)
    • Once the association between name and ORCID is made in the authority then it can be autocompleted in the lookup field
    • Ideally there could also be a user interface for cleanup and merging of authorities
    • He will prepare a quote for us with keeping in mind that this could be useful to contribute back to the community for a 5.x release
    • As far as exposing ORCIDs as flat metadata along side all other metadata, he says this should be possible and will work on a quote for us

2017-09-13

  • Last night Linode sent an alert about CGSpace (linode18) that it has exceeded the outbound traffic rate threshold of 10Mb/s for the last two hours
  • I wonder what was going on, and looking into the nginx logs I think maybe it's OAI...
  • Here is yesterday's top ten IP addresses making requests to /oai:
# awk '{print $1}' /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
      1 213.136.89.78
      1 66.249.66.90
      1 66.249.66.92
      3 68.180.229.31
      4 35.187.22.255
  13745 54.70.175.86
  15814 34.211.17.113
  15825 35.161.215.53
  16704 54.70.51.7
  • Compared to the previous day's logs it looks VERY high:
# awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
      1 207.46.13.39
      1 66.249.66.93
      2 66.249.66.91
      4 216.244.66.194
     14 66.249.66.90
  • The user agents for those top IPs are:
    • 54.70.175.86: API scraper
    • 34.211.17.113: API scraper
    • 35.161.215.53: API scraper
    • 54.70.51.7: API scraper
  • And this user agent has never been seen before today (or at least recently!):
# grep -c "API scraper" /var/log/nginx/oai.log
62088
# zgrep -c "API scraper" /var/log/nginx/oai.log.*.gz
/var/log/nginx/oai.log.10.gz:0
/var/log/nginx/oai.log.11.gz:0
/var/log/nginx/oai.log.12.gz:0
/var/log/nginx/oai.log.13.gz:0
/var/log/nginx/oai.log.14.gz:0
/var/log/nginx/oai.log.15.gz:0
/var/log/nginx/oai.log.16.gz:0
/var/log/nginx/oai.log.17.gz:0
/var/log/nginx/oai.log.18.gz:0
/var/log/nginx/oai.log.19.gz:0
/var/log/nginx/oai.log.20.gz:0
/var/log/nginx/oai.log.21.gz:0
/var/log/nginx/oai.log.22.gz:0
/var/log/nginx/oai.log.23.gz:0
/var/log/nginx/oai.log.24.gz:0
/var/log/nginx/oai.log.25.gz:0
/var/log/nginx/oai.log.26.gz:0
/var/log/nginx/oai.log.27.gz:0
/var/log/nginx/oai.log.28.gz:0
/var/log/nginx/oai.log.29.gz:0
/var/log/nginx/oai.log.2.gz:0
/var/log/nginx/oai.log.30.gz:0
/var/log/nginx/oai.log.3.gz:0
/var/log/nginx/oai.log.4.gz:0
/var/log/nginx/oai.log.5.gz:0
/var/log/nginx/oai.log.6.gz:0
/var/log/nginx/oai.log.7.gz:0
/var/log/nginx/oai.log.8.gz:0
/var/log/nginx/oai.log.9.gz:0
  • Some of these heavy users are also using XMLUI, and their user agent isn't matched by the Tomcat Session Crawler valve, so each request uses a different session
  • Yesterday alone the IP addresses using the API scraper user agent were responsible for 16,000 sessions in XMLUI:
# grep -a -E "(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)" /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
15924
  • If this continues I will definitely need to figure out who is responsible for this scraper and add their user agent to the session crawler valve regex
  • A search for "API scraper" user agent on Google returns a robots.txt with a comment that this is the Yewno bot: http://www.escholarship.org/robots.txt
  • Also, in looking at the DSpace logs I noticed a warning from OAI that I should look into:
WARN  org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
  • Looking at the spreadsheet with deletions and corrections that CCAFS sent last week
  • It appears they want to delete a lot of metadata, which I'm not sure they realize the implications of:
dspace=# select text_value, count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange') group by text_value;                                                                                                                                                                                                                  
        text_value        | count                              
--------------------------+-------                             
 FP4_ClimateModels        |     6                              
 FP1_CSAEvidence          |     7                              
 SEA_UpscalingInnovation  |     7                              
 FP4_Baseline             |    69                              
 WA_Partnership           |     1                              
 WA_SciencePolicyExchange |     6                              
 SA_GHGMeasurement        |     2                              
 SA_CSV                   |     7                              
 EA_PAR                   |    18                              
 FP4_Livestock            |     7                              
 FP4_GenderPolicy         |     4                              
 FP2_CRMWestAfrica        |    12                              
 FP4_ClimateData          |    24                              
 FP4_CCPAG                |     2                              
 SEA_mitigationSAMPLES    |     2                              
 SA_Biodiversity          |     1                              
 FP4_PolicyEngagement     |    20                              
 FP3_Gender               |     9                              
 FP4_GenderToolbox        |     3                              
(19 rows)
  • I sent CCAFS people an email to ask if they really want to remove these 200+ tags
  • She responded yes, so I'll at least need to do these deletes in PostgreSQL:
dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
DELETE 207
  • When we discussed this in late July there were some other renames they had requested, but I don't see them in the current spreadsheet so I will have to follow that up
  • I talked to Macaroni Bros and they said to just go ahead with the other corrections as well as their spreadsheet was evolved organically rather than systematically!
  • The final list of corrections and deletes should therefore be:
delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
update metadatavalue set text_value='FP3_VietnamLED' where resource_type_id=2 and metadata_field_id=134 and text_value='FP3_VeitnamLED';
update metadatavalue set text_value='PII-FP1_PIRCCA' where resource_type_id=2 and metadata_field_id=235 and text_value='PII-SEA_PIRCCA';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-WA_IntegratedInterventions';
delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
  • Create and merge pull request to shut up the Ehcache update check (#337)
  • Although it looks like there was a previous attempt to disable these update checks that was merged in DSpace 4.0 (although it only affects XMLUI): https://jira.duraspace.org/browse/DS-1492
  • I commented there suggesting that we disable it globally
  • I merged the changes to the CCAFS project tags (#336) but still need to finalize the metadata deletions/renames
  • I merged the CGIAR Library theme changes (#338) to the 5_x-prod branch in preparation for next week's migration
  • I emailed the Handle administrators (hdladmin@cnri.reston.va.us) to ask them what the process for changing their prefix to be resolved by our resolver
  • They responded and said that they need email confirmation from the contact of record of the other prefix, so I should have the CGIAR System Organization people email them before I send the new sitebndl.zip
  • Testing to see how we end up with all these new authorities after we keep cleaning and merging them in the database
  • Here are all my distinct authority combinations in the database before:
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
 text_value |              authority               | confidence 
------------+--------------------------------------+------------
 Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
 Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
 Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |        600
 Orth, A.   | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
 Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |         -1
 Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |          0
 Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 |         -1
 Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 |        600
(8 rows)
  • And then after adding a new item and selecting an existing "Orth, Alan" with an ORCID in the author lookup:
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
 text_value |              authority               | confidence 
------------+--------------------------------------+------------
 Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
 Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
 Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |        600
 Orth, A.   | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
 Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |         -1
 Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |          0
 Orth, Alan | cb3aa5ae-906f-4902-97b1-2667cf148dde |        600
 Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 |         -1
 Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 |        600
(9 rows)
  • It created a new authority... let's try to add another item and select the same existing author and see what happens in the database:
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
 text_value |              authority               | confidence 
------------+--------------------------------------+------------
 Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
 Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
 Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |        600
 Orth, A.   | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
 Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |         -1
 Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |          0
 Orth, Alan | cb3aa5ae-906f-4902-97b1-2667cf148dde |        600
 Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 |         -1
 Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 |        600
(9 rows)
  • No new one... so now let me try to add another item and select the italicized result from the ORCID lookup and see what happens in the database:
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
 text_value |              authority               | confidence 
------------+--------------------------------------+------------
 Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
 Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
 Orth, Alan | d85a8a5b-9b82-4aaf-8033-d7e0c7d9cb8f |        600
 Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |        600
 Orth, A.   | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
 Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |         -1
 Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |          0
 Orth, Alan | cb3aa5ae-906f-4902-97b1-2667cf148dde |        600
 Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 |         -1
 Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 |        600
(10 rows)
  • Shit, it created another authority! Let's try it again!
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';                                                                                             
 text_value |              authority               | confidence
------------+--------------------------------------+------------
 Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
 Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
 Orth, Alan | d85a8a5b-9b82-4aaf-8033-d7e0c7d9cb8f |        600
 Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |        600
 Orth, Alan | 9aed566a-a248-4878-9577-0caedada43db |        600
 Orth, A.   | 1a1943a0-3f87-402f-9afe-e52fb46a513e |        600
 Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e |         -1
 Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |          0
 Orth, Alan | cb3aa5ae-906f-4902-97b1-2667cf148dde |        600
 Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 |         -1
 Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 |        600
(11 rows)
  • It added another authority... surely this is not the desired behavior, or maybe we are not using this as intented?

2017-09-14

  • Communicate with Handle.net admins to try to get some guidance about the 10947 prefix
  • Michael Marus is the contact for their prefix but he has left CGIAR, but as I actually have access to the CGIAR Library server I think I can just generate a new sitebndl.zip file from their server and send it to Handle.net
  • Also, Handle.net says their prefix is up for annual renewal next month so we might want to just pay for it and take it over
  • CGSpace was very slow and Uptime Robot even said it was down at one time
  • I didn't see any abnormally high usage in the REST or OAI logs, but looking at Munin I see the average JVM usage was at 4.9GB and the heap is only 5GB (5120M), so I think it's just normal growing pains
  • Every few months I generally try to increase the JVM heap to be 512M higher than the average usage reported by Munin, so now I adjusted it to 5632M

2017-09-15

  • Apply CCAFS project tag corrections on CGSpace:
dspace=# \i /tmp/ccafs-projects.sql 
DELETE 5
UPDATE 4
UPDATE 1
DELETE 1
DELETE 207

2017-09-17

  • Create pull request for CGSpace to be able to resolve multiple handles (#339)
  • We still need to do the changes to config.dct and regenerate the sitebndl.zip to send to the Handle.net admins
  • According to this dspace-tech mailing list entry from 2011, we need to add the extra handle prefixes to config.dct like this:
"server_admins" = (
"300:0.NA/10568"
"300:0.NA/10947"
)

"replication_admins" = (
"300:0.NA/10568"
"300:0.NA/10947"
)

"backup_admins" = (
"300:0.NA/10568"
"300:0.NA/10947"
)
  • More work on the CGIAR Library migration test run locally, as I was having problem with importing the last fourteen items from the CGIAR System Management Office community
  • The problem was that we remapped the items to new collections after the initial import, so the items were using the 10947 prefix but the community and collection was using 10568
  • I ended up having to read the AIP Backup and Restore closely a few times and then explicitly preserve handles and ignore parents:
$ for item in 10568-93759/ITEM@10947-46*; do ~/dspace/bin/dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/87738 $item; done
  • Also, this was in replace mode (-r) rather than submit mode (-s), because submit mode always generated a new handle even if I told it not to!
  • I decided to start the import process in the evening rather than waiting for the morning, and right as the first community was finished importing I started seeing Timeout waiting for idle object errors
  • I had to cancel the import, clean up a bunch of database entries, increase the PostgreSQL max_connections as a precaution, restart PostgreSQL and Tomcat, and then finally completed the import

2017-09-18

  • I think we should force regeneration of all thumbnails in the CGIAR Library community, as their DSpace is version 1.7 and CGSpace is running DSpace 5.5 so they should look much better
  • One item for comparison:

With original DSpace 1.7 thumbnail

After DSpace 5.5

  • Moved the CGIAR Library Migration notes to a page[cgiar-library-migration]({{< relref "cgiar-library-migration.md" >}})as there seems to be a bug with post slugs defined in frontmatter when you have a permalink scheme defined in config.toml (happens currently in Hugo 0.27.1 at least)