CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

November, 2018

2018-11-01

  • Finalize AReS Phase I and Phase II ToRs
  • Send a note about my dspace-statistics-api to the dspace-tech mailing list

2018-11-03

  • Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage
  • Today these are the top 10 IPs:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
   1300 66.249.64.63
   1384 35.237.175.180
   1430 138.201.52.218
   1455 207.46.13.156
   1500 40.77.167.175
   1979 50.116.102.77
   2790 66.249.64.61
   3367 84.38.130.177
   4537 70.32.83.92
  22508 66.249.64.59
  • The 66.249.64.x are definitely Google
  • 70.32.83.92 is well known, probably CCAFS or something, as it’s only a few thousand requests and always to REST API
  • 84.38.130.177 is some new IP in Latvia that is only hitting the XMLUI, using the following user agent:
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.792.0 Safari/535.1
  • They at least seem to be re-using their Tomcat sessions:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177' dspace.log.2018-11-03 | sort | uniq
342
  • 50.116.102.77 is also a regular REST API user
  • 40.77.167.175 and 207.46.13.156 seem to be Bing
  • 138.201.52.218 seems to be on Hetzner in Germany, but is using this user agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
  • And it doesn’t seem they are re-using their Tomcat sessions:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03 | sort | uniq
1243
  • Ah, we’ve apparently seen this server exactly a year ago in 2017-11, making 40,000 requests in one day…
  • I wonder if it’s worth adding them to the list of bots in the nginx config?
  • Linode sent a mail that CGSpace (linode18) is using high outgoing bandwidth
  • Looking at the nginx logs again I see the following top ten IPs:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
   1979 50.116.102.77
   1980 35.237.175.180
   2186 207.46.13.156
   2208 40.77.167.175
   2843 66.249.64.63
   4220 84.38.130.177
   4537 70.32.83.92
   5593 66.249.64.61
  12557 78.46.89.18
  32152 66.249.64.59
  • 78.46.89.18 is new since I last checked a few hours ago, and it’s from Hetzner with the following user agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
  • It’s making lots of requests and using quite a number of Tomcat sessions:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' /home/cgspace.cgiar.org/log/dspace.log.2018-11-03 | sort | uniq
8449
  • I could add this IP to the list of bot IPs in nginx, but it seems like a futile effort when some new IP could come along and do the same thing
  • Perhaps I should think about adding rate limits to dynamic pages like /discover and /browse
  • I think it’s reasonable for a human to click one of those links five or ten times a minute…
  • To contrast, 78.46.89.18 made about 300 requests per minute for a few hours today:
# grep 78.46.89.18 /var/log/nginx/access.log | grep -o -E '03/Nov/2018:[0-9][0-9]:[0-9][0-9]' | sort | uniq -c | sort -n | tail -n 20
    286 03/Nov/2018:18:02
    287 03/Nov/2018:18:21
    289 03/Nov/2018:18:23
    291 03/Nov/2018:18:27
    293 03/Nov/2018:18:34
    300 03/Nov/2018:17:58
    300 03/Nov/2018:18:22
    300 03/Nov/2018:18:32
    304 03/Nov/2018:18:12
    305 03/Nov/2018:18:13
    305 03/Nov/2018:18:24
    312 03/Nov/2018:18:39
    322 03/Nov/2018:18:17
    326 03/Nov/2018:18:38
    327 03/Nov/2018:18:16
    330 03/Nov/2018:17:57
    332 03/Nov/2018:18:19
    336 03/Nov/2018:17:56
    340 03/Nov/2018:18:14
    341 03/Nov/2018:18:18
  • If they want to download all our metadata and PDFs they should use an API rather than scraping the XMLUI
  • I will add them to the list of bot IPs in nginx for now and think about enforcing rate limits in XMLUI later
  • Also, this is the third (?) time a mysterious IP on Hetzner has done this… who is this?

2018-11-04

  • Forward Peter’s information about CGSpace financials to Modi from ICRISAT
  • Linode emailed about the CPU load and outgoing bandwidth on CGSpace (linode18) again
  • Here are the top ten IPs active so far this morning:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
   1083 2a03:2880:11ff:2::face:b00c
   1105 2a03:2880:11ff:d::face:b00c
   1111 2a03:2880:11ff:f::face:b00c
   1134 84.38.130.177
   1893 50.116.102.77
   2040 66.249.64.63
   4210 66.249.64.61
   4534 70.32.83.92
  13036 78.46.89.18
  20407 66.249.64.59
  • 78.46.89.18 is back… and still making tons of Tomcat sessions:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04 | sort | uniq
8765
  • Also, now we have a ton of Facebook crawlers:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Nov/2018" | grep "2a03:2880:11ff:" | awk '{print $1}' | sort | uniq -c | sort -n
    905 2a03:2880:11ff:b::face:b00c
    955 2a03:2880:11ff:5::face:b00c
    965 2a03:2880:11ff:e::face:b00c
    984 2a03:2880:11ff:8::face:b00c
    993 2a03:2880:11ff:3::face:b00c
    994 2a03:2880:11ff:7::face:b00c
   1006 2a03:2880:11ff:10::face:b00c
   1011 2a03:2880:11ff:4::face:b00c
   1023 2a03:2880:11ff:6::face:b00c
   1026 2a03:2880:11ff:9::face:b00c
   1039 2a03:2880:11ff:1::face:b00c
   1043 2a03:2880:11ff:c::face:b00c
   1070 2a03:2880:11ff::face:b00c
   1075 2a03:2880:11ff:a::face:b00c
   1093 2a03:2880:11ff:2::face:b00c
   1107 2a03:2880:11ff:d::face:b00c
   1116 2a03:2880:11ff:f::face:b00c
  • They are really making shit tons of Tomcat sessions:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq
14368
  • Their user agent is:
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
  • I will add it to the Tomcat Crawler Session Manager valve
  • Later in the evening… ok, this Facebook bot is getting super annoying:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Nov/2018" | grep "2a03:2880:11ff:" | awk '{print $1}' | sort | uniq -c | sort -n
   1871 2a03:2880:11ff:3::face:b00c
   1885 2a03:2880:11ff:b::face:b00c
   1941 2a03:2880:11ff:8::face:b00c
   1942 2a03:2880:11ff:e::face:b00c
   1987 2a03:2880:11ff:1::face:b00c
   2023 2a03:2880:11ff:2::face:b00c
   2027 2a03:2880:11ff:4::face:b00c
   2032 2a03:2880:11ff:9::face:b00c
   2034 2a03:2880:11ff:10::face:b00c
   2050 2a03:2880:11ff:5::face:b00c
   2061 2a03:2880:11ff:c::face:b00c
   2076 2a03:2880:11ff:6::face:b00c
   2093 2a03:2880:11ff:7::face:b00c
   2107 2a03:2880:11ff::face:b00c
   2118 2a03:2880:11ff:d::face:b00c
   2164 2a03:2880:11ff:a::face:b00c
   2178 2a03:2880:11ff:f::face:b00c
  • And still making shit tons of Tomcat sessions:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq
28470
  • And that’s even using the Tomcat Crawler Session Manager valve!
  • Maybe we need to limit more dynamic pages, like the “most popular” country, item, and author pages
  • It seems these are popular too, and there is no fucking way Facebook needs that information, yet they are requesting thousands of them!
# grep 'face:b00c' /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c 'most-popular/'
7033
  • I added the “most-popular” pages to the list that return X-Robots-Tag: none to try to inform bots not to index or follow those pages
  • Also, I implemented an nginx rate limit of twelve requests per minute on all dynamic pages… I figure a human user might legitimately request one every five seconds

2018-11-05

  • I wrote a small Python script add-dc-rights.py to add usage rights (dc.rights) to CGSpace items based on the CSV Hector gave me from MARLO:
$ ./add-dc-rights.py -i /tmp/marlo.csv -db dspace -u dspace -p 'fuuu'
  • The file marlo.csv was cleaned up and formatted in Open Refine
  • 165 of the items in their 2017 data are from CGSpace!
  • I will add the data to CGSpace this week (done!)
  • Jesus, is Facebook trying to be annoying?
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Nov/2018" | grep -c "2a03:2880:11ff:"
29889
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05 | sort | uniq
29156
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Nov/2018" | grep "2a03:2880:11ff:" | grep -c -E "(handle|bitstream)"
29896
  • 29,000 requests from Facebook, 29,000 Tomcat sessions, and none of the requests are to the dynamic pages I rate limited yesterday!

2018-11-06

  • I updated all the DSpace helper Python scripts to validate against PEP 8 using Flake8
  • While I was updating the rest-find-collections.py script I noticed it was using expand=all to get the collection and community IDs
  • I realized I actually only need expand=collections,subCommunities, and I wanted to see how much overhead the extra expands created so I did three runs of each:
$ time ./rest-find-collections.py 10568/27629 --rest-url https://dspacetest.cgiar.org/rest
  • Average time with all expands was 14.3 seconds, and 12.8 seconds with collections,subCommunities, so 1.5 seconds difference!

2018-11-07

  • Update my dspace-statistics-api to use a database management class with Python contexts so that connections and cursors are automatically opened and closed
  • Tag version 0.7.0 of the dspace-statistics-api

2018-11-08

2018-11-11

  • I added tests to the dspace-statistics-api!
  • It runs with Python 3.5, 3.6, and 3.7 using pytest, including automatically on Travis CI!

2018-11-13

  • Help troubleshoot an issue with Judy Kimani submitting to the ILRI project reports, papers and documents collection on CGSpace
  • For some reason there is an existing group for the “Accept/Reject” workflow step, but it’s empty
  • I added Judy to the group and told her to try again
  • Sisay changed his leave to be full days until December so I need to finish the IITA records that he was working on (IITA_ ALIZZY1802-csv_oct23
  • Sisay had said there were a few PDFs missing and Bosede sent them this week, so I had to find those items on DSpace Test and add the bitstreams to the items manually
  • As for the collection mappings I think I need to export the CSV from DSpace Test, add mappings for each type (ie Books go to IITA books collection, etc), then re-import to DSpace Test, then export from DSpace command line in “migrate” mode…
  • From there I should be able to script the removal of the old DSpace Test collection so they just go to the correct IITA collections on import into CGSpace

2018-11-14

  • Finally import the 277 IITA (ALIZZY1802) records to CGSpace
  • I had to export them from DSpace Test and import them into a temporary collection on CGSpace first, then export the collection as CSV to map them to new owning collections (IITA books, IITA posters, etc) with OpenRefine because DSpace’s dspace export command doesn’t include the collections for the items!
  • Delete all old IITA collections on DSpace Test and run dspace cleanup to get rid of all the bitstreams

2018-11-15

2018-11-18

  • Request invoice from Wild Jordan for their meeting venue in January

2018-11-19

  • Testing corrections and deletions for AGROVOC (dc.subject) that Sisay and Peter were working on earlier this month:
$ ./fix-metadata-values.py -i 2018-11-19-correct-agrovoc.csv -f dc.subject -t correct -m 57 -db dspace -u dspace -p 'fuu' -d
$ ./delete-metadata-values.py -i 2018-11-19-delete-agrovoc.csv -f dc.subject -m 57 -db dspace -u dspace -p 'fuu' -d
  • Then I ran them on both CGSpace and DSpace Test, and started a full Discovery re-index on CGSpace:
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
  • Generate a new list of the top 1500 AGROVOC subjects on CGSpace to send to Peter and Sisay:
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-11-19-top-1500-subject.csv WITH CSV HEADER;

2018-11-20

  • The Discovery re-indexing on CGSpace never finished yesterday… the command died after six minutes
  • The dspace.log.2018-11-19 shows this at the time:
2018-11-19 15:23:04,221 ERROR com.atmire.dspace.discovery.AtmireSolrService @ DSpace kernel cannot be null
java.lang.IllegalStateException: DSpace kernel cannot be null
        at org.dspace.utils.DSpace.getServiceManager(DSpace.java:63)
        at org.dspace.utils.DSpace.getSingletonService(DSpace.java:87)
        at com.atmire.dspace.discovery.AtmireSolrService.buildDocument(AtmireSolrService.java:102)
        at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:815)
        at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:884)
        at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
        at org.dspace.discovery.IndexClient.main(IndexClient.java:117)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
2018-11-19 15:23:04,223 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (4629 of 76007): 72731
  • I looked in the Solr log around that time and I don’t see anything…
  • Working on Udana’s WLE records from last month, first the sixteen records in 2018-11-20 RDL Temp
    • these items will go to the Restoring Degraded Landscapes collection
    • a few items missing DOIs, but they are easily available on the publication page
    • clean up DOIs to use “https://doi.org" format
    • clean up some cg.identifier.url to remove unneccessary query strings
    • remove columns with no metadata (river basin, place, target audience, isbn, uri, publisher, ispartofseries, subject)
    • fix column with invalid spaces in metadata field name (cg. subject. wle)
    • trim and collapse whitespace in all fields
    • remove some weird Unicode characters (0xfffd) from abstracts, citations, and titles using Open Refine: value.replace('�','')
    • add dc.rights to some fields that I noticed while checking DOIs
  • Then the 24 records in 2018-11-20 VRC Temp
    • these items will go to the Variability, Risks and Competing Uses collection
    • trim and collapse whitespace in all fields (lots in WLE subject!)
    • clean up some cg.identifier.url fields that had unneccessary anchors in their links
    • clean up DOIs to use “https://doi.org" format
    • fix column with invalid spaces in metadata field name (cg. subject. wle)
    • remove columns with no metadata (place, target audience, isbn, uri, publisher, ispartofseries, subject)
    • remove some weird Unicode characters (0xfffd) from abstracts, citations, and titles using Open Refine: value.replace('�','')
    • I notice a few items using DOIs pointing at ICARDA’s DSpace like: https://doi.org/20.500.11766/8178, which then points at the “real” DOI on the publisher’s site… these should be using the real DOI instead of ICARDA’s “fake” Handle DOI
    • Some items missing DOIs, but they clearly have them if you look at the publisher’s site