CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

October, 2017

2017-10-01

http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
  • There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
  • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections

2017-10-02

  • Peter Ballantyne said he was having problems logging into CGSpace with “both” of his accounts (CGIAR LDAP and personal, apparently)
  • I looked in the logs and saw some LDAP lookup failures due to timeout but also strangely a “no DN found” error:
2017-10-01 20:24:57,928 WARN  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:ldap_attribute_lookup:type=failed_search javax.naming.CommunicationException\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is java.net.ConnectException\colon; Connection timed out (Connection timed out)]
2017-10-01 20:22:37,982 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:failed_login:no DN found for user pballantyne
  • I thought maybe his account had expired (seeing as it’s was the first of the month) but he says he was finally able to log in today
  • The logs for yesterday show fourteen errors related to LDAP auth failures:
$ grep -c "ldap_authentication:type=failed_auth" dspace.log.2017-10-01
14
  • For what it’s worth, there are no errors on any other recent days, so it must have been some network issue on Linode or CGNET’s LDAP server
  • Linode emailed to say that linode578611 (DSpace Test) needs to migrate to a new host for a security update so I initiated the migration immediately rather than waiting for the scheduled time in two weeks

2017-10-04

  • Twice in the last twenty-four hours Linode has alerted about high CPU usage on CGSpace (linode2533629)
  • Communicate with Sam from the CGIAR System Organization about some broken links coming from their CGIAR Library domain to CGSpace
  • The first is a link to a browse page that should be handled better in nginx:
http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject → https://cgspace.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject
  • We’ll need to check for browse links and handle them properly, including swapping the subject parameter for systemsubject (which doesn’t exist in Discovery yet, but we’ll need to add it) as we have moved their poorly curated subjects from dc.subject to cg.subject.system
  • The second link was a direct link to a bitstream which has broken due to the sequence being updated, so I told him he should link to the handle of the item instead
  • Help Sisay proof sixty-two IITA records on DSpace Test
  • Lots of inconsistencies and errors in subjects, dc.format.extent, regions, countries
  • Merge the Discovery search changes for ISI Journal (#341)

2017-10-05

  • Twice in the past twenty-four hours Linode has warned that CGSpace’s outbound traffic rate was exceeding the notification threshold
  • I had a look at yesterday’s OAI and REST logs in /var/log/nginx but didn’t see anything unusual:
# awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
    141 157.55.39.240
    145 40.77.167.85
    162 66.249.66.92
    181 66.249.66.95
    211 66.249.66.91
    312 66.249.66.94
    384 66.249.66.90
   1495 50.116.102.77
   3904 70.32.83.92
   9904 45.5.184.196
# awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
      5 66.249.66.71
      6 66.249.66.67
      6 68.180.229.31
      8 41.84.227.85
      8 66.249.66.92
     17 66.249.66.65
     24 66.249.66.91
     38 66.249.66.95
     69 66.249.66.90
    148 66.249.66.94
  • Working on the nginx redirects for CGIAR Library
  • We should start using 301 redirects and also allow for /sitemap to work on the library.cgiar.org domain so the CGIAR System Organization people can update their Google Search Console and allow Google to find their content in a structured way
  • Remove eleven occurrences of ACP in IITA’s cg.coverage.region using the Atmire batch edit module from Discovery
  • Need to investigate how we can verify the library.cgiar.org using the HTML or DNS methods
  • Run corrections on 143 ILRI Archive items that had two dc.identifier.uri values (Handle) that Peter had pointed out earlier this week
  • I used OpenRefine to isolate them and then fixed and re-imported them into CGSpace
  • I manually checked a dozen of them and it appeared that the correct handle was always the second one, so I just deleted the first one

2017-10-06

Original flat thumbnails Tweaked with border and box shadow

  • I’ll post it to the Yammer group to see what people think
  • I figured out at way to do the HTML verification for Google Search console for library.cgiar.org
  • We can drop the HTML file in their XMLUI theme folder and it will get copied to the webapps directory during build/install
  • Then we add an nginx alias for that URL in the library.cgiar.org vhost
  • This method is kinda a hack but at least we can put all the pieces into git to be reproducible
  • I will tell Tunji to send me the verification file

2017-10-10

  • Deploy logic to allow verification of the library.cgiar.org domain in the Google Search Console (#343)
  • After verifying both the HTTP and HTTPS domains and submitting a sitemap it will be interesting to see how the stats in the console as well as the search results change (currently 28,500 results):

Google Search Console Google Search Console 2 Google Search results

  • I tried to submit a “Change of Address” request in the Google Search Console but I need to be an owner on CGSpace’s console (currently I’m just a user) in order to do that
  • Manually clean up some communities and collections that Peter had requested a few weeks ago
  • Delete Community 10568/102 (ILRI Research and Development Issues)
  • Move five collections to 10568/27629 (ILRI Projects) using move-collections.sh with the following configuration:
10568/1637 10568/174 10568/27629
10568/1642 10568/174 10568/27629
10568/1614 10568/174 10568/27629
10568/75561 10568/150 10568/27629
10568/183 10568/230 10568/27629
  • Delete community 10568/174 (Sustainable livestock futures)
  • Delete collections in 10568/27629 that have zero items (33 of them!)

2017-10-11

  • Peter added me as an owner on the CGSpace property on Google Search Console and I tried to submit a “Change of Address” request for the CGIAR Library but got an error:

Change of Address error

  • We are sending top-level CGIAR Library traffic to their specific community hierarchy in CGSpace so this type of change of address won’t work—we’ll just need to wait for Google to slowly index everything and take note of the HTTP 301 redirects
  • Also the Google Search Console doesn’t work very well with Google Analytics being blocked, so I had to turn off my ad blocker to get the “Change of Address” tool to work!

2017-10-12

  • Finally finish (I think) working on the myriad nginx redirects for all the CGIAR Library browse stuff—it ended up getting pretty complicated!
  • I still need to commit the DSpace changes (add browse index, XMLUI strings, Discovery index, etc), but I should be able to deploy that on CGSpace soon

2017-10-14

  • Run system updates on DSpace Test and reboot server
  • Merge changes adding a search/browse index for CGIAR System subject to 5_x-prod (#344)
  • I checked the top browse links in Google’s search results for site:library.cgiar.org inurl:browse and they are all redirected appropriately by the nginx rewrites I worked on last week

2017-10-22

  • Run system updates on DSpace Test and reboot server
  • Re-deploy CGSpace from latest 5_x-prod (adds ISI Journal to search filters and adds Discovery index for CGIAR Library systemsubject)
  • Deploy nginx redirect fixes to catch CGIAR Library browse links (redirect to their community and translate subject→systemsubject)
  • Run migration of CGSpace server (linode18) for Linode security alert, which took 42 minutes of downtime

2017-10-26

  • In the last 24 hours we’ve gotten a few alerts from Linode that there was high CPU and outgoing traffic on CGSpace
  • Uptime Robot even noticed CGSpace go “down” for a few minutes
  • In other news, I was trying to look at a question about stats raised by Magdalena and then CGSpace went down due to SQL connection pool
  • Looking at the PostgreSQL activity I see there are 93 connections, but after a minute or two they went down and CGSpace came back up
  • Annnd I reloaded the Atmire Usage Stats module and the connections shot back up and CGSpace went down again
  • Still not sure where the load is coming from right now, but it’s clear why there were so many alerts yesterday on the 25th!
# grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
18022
  • Compared to other days there were two or three times the number of requests yesterday!
# grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-23 | sort -n | uniq | wc -l
3141
# grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-26 | sort -n | uniq | wc -l
7851
  • I still have no idea what was causing the load to go up today
  • I finally investigated Magdalena’s issue with the item download stats and now I can’t reproduce it: I get the same number of downloads reported in the stats widget on the item page, the “Most Popular Items” page, and in Usage Stats
  • I think it might have been an issue with the statistics not being fresh
  • I added the admin group for the systems organization to the admin role of the top-level community of CGSpace because I guess Sisay had forgotten
  • Magdalena asked if there was a way to reuse data in item submissions where items have a lot of similar data
  • I told her about the possibility to use per-collection item templates, and asked if her items in question were all from a single collection
  • We’ve never used it but it could be worth looking at

2017-10-27

  • Linode alerted about high CPU usage again (twice) on CGSpace in the last 24 hours, around 2AM and 2PM

2017-10-28

  • Linode alerted about high CPU usage again on CGSpace around 2AM this morning

2017-10-29

  • Linode alerted about high CPU usage again on CGSpace around 2AM and 4AM
  • I’m still not sure why this started causing alerts so repeatadely the past week
  • I don’t see any tell tale signs in the REST or OAI logs, so trying to do rudimentary analysis in DSpace logs:
# grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2049
  • So there were 2049 unique sessions during the hour of 2AM
  • Looking at my notes, the number of unique sessions was about the same during the same hour on other days when there were no alerts
  • I think I’ll need to enable access logging in nginx to figure out what’s going on
  • After enabling logging on requests to XMLUI on / I see some new bot I’ve never seen before:
137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] "GET /discover?filtertype_0=type&filter_relational_operator_0=equals&filter_0=Internal+Document&filtertype=author&filter_relational_operator=equals&filter=CGIAR+Secretariat HTTP/1.1" 200 7776 "-" "Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)"
  • CORE seems to be some bot that is “Aggregating the world’s open access research papers”
  • The contact address listed in their bot’s user agent is incorrect, correct page is simply: https://core.ac.uk/contact
  • I will check the logs in a few days to see if they are harvesting us regularly, then add their bot’s user agent to the Tomcat Crawler Session Valve
  • After browsing the CORE site it seems that the CGIAR Library is somehow a member of CORE, so they have probably only been harvesting CGSpace since we did the migration, as library.cgiar.org directs to us now
  • For now I will just contact them to have them update their contact info in the bot’s user agent, but eventually I think I’ll tell them to swap out the CGIAR Library entry for CGSpace

2017-10-30

  • Like clock work, Linode alerted about high CPU usage on CGSpace again this morning (this time at 8:13 AM)
  • Uptime Robot noticed that CGSpace went down around 10:15 AM, and I saw that there were 93 PostgreSQL connections:
dspace=# SELECT * FROM pg_stat_activity;
...
(93 rows)
  • Surprise surprise, the CORE bot is likely responsible for the recent load issues, making hundreds of thousands of requests yesterday and today:
# grep -c "CORE/0.6" /var/log/nginx/access.log 
26475
# grep -c "CORE/0.6" /var/log/nginx/access.log.1
135083
  • IP addresses for this bot currently seem to be:
# grep "CORE/0.6" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
137.108.70.6
137.108.70.7
  • I will add their user agent to the Tomcat Session Crawler Valve but it won’t help much because they are only using two sessions:
# grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
session_id=5771742CABA3D0780860B8DA81E0551B
session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
  • … and most of their requests are for dynamic discover pages:
# grep -c 137.108.70 /var/log/nginx/access.log
26622
# grep 137.108.70 /var/log/nginx/access.log | grep -c "GET /discover"
24055
  • Just because I’m curious who the top IPs are:
# awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
    496 62.210.247.93
    571 46.4.94.226
    651 40.77.167.39
    763 157.55.39.231
    782 207.46.13.90
    998 66.249.66.90
   1948 104.196.152.243
   4247 190.19.92.5
  31602 137.108.70.6
  31636 137.108.70.7
  • At least we know the top two are CORE, but who are the others?
  • 190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine
  • Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don’t reuse their session variable, creating thousands of new sessions!
# grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
1419
# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2811
  • From looking at the requests, it appears these are from CIAT and CCAFS
  • I wonder if I could somehow instruct them to use a user agent so that we could apply a crawler session manager valve to them
  • Actually, according to the Tomcat docs, we could use an IP with crawlerIps: https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve
  • Ah, wait, it looks like crawlerIps only came in 2017-06, so probably isn’t in Ubuntu 16.04’s 7.0.68 build!
  • That would explain the errors I was getting when trying to set it:
WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
  • As for now, it actually seems the CORE bot coming from 137.108.70.6 and 137.108.70.7 is only using a few sessions per day, which is good:
# grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
    410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7
    574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7
   1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6
  • I will check again tomorrow

2017-10-31

  • Very nice, Linode alerted that CGSpace had high CPU usage at 2AM again
  • Ask on the dspace-tech mailing list if it’s possible to use an existing item as a template for a new item
  • To follow up on the CORE bot traffic, there were almost 300,000 request yesterday:
# grep "CORE/0.6" /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
 139109 137.108.70.6
 139253 137.108.70.7
  • I’ve emailed the CORE people to ask if they can update the repository information from CGIAR Library to CGSpace
  • Also, I asked if they could perhaps use the sitemap.xml, OAI-PMH, or REST APIs to index us more efficiently, because they mostly seem to be crawling the nearly endless Discovery facets
  • I added GoAccess to the list of package to install in the DSpace role of the Ansible infrastructure scripts
  • It makes it very easy to analyze nginx logs from the command line, to see where traffic is coming from:
# goaccess /var/log/nginx/access.log --log-format=COMBINED
  • According to Uptime Robot CGSpace went down and up a few times
  • I had a look at goaccess and I saw that CORE was actively indexing
  • Also, PostgreSQL connections were at 91 (with the max being 60 per web app, hmmm)
  • I’m really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable
  • Actually, come to think of it, they aren’t even obeying robots.txt, because we actually disallow /discover and /search-filter URLs but they are hitting those massively:
# grep "CORE/0.6" /var/log/nginx/access.log | grep -o -E "GET /(discover|search-filter)" | sort -n | uniq -c | sort -rn 
 158058 GET /discover
  14260 GET /search-filter
  • I tested a URL of pattern /discover in Google’s webmaster tools and it was indeed identified as blocked
  • I will send feedback to the CORE bot team