diff --git a/content/posts/2022-09.md b/content/posts/2022-09.md index 7eb99f624..90fc71c82 100644 --- a/content/posts/2022-09.md +++ b/content/posts/2022-09.md @@ -231,4 +231,10 @@ COMMIT - Meeting with Peter, Abenet, Indira, and Michael about CGSpace rollout plan for the Initiatives +## 2022-09-16 + +- Meeting with Marie-Angeqlique, Abenet, Margarita, and Sara about types for CG Core + - We are about halfway through the list of types now, with concrete actions for CG Core and CGSpace + - We will meet next in two weeks to hopefully finalize the list, then we can move on to definitions + diff --git a/docs/2022-09/index.html b/docs/2022-09/index.html new file mode 100644 index 000000000..07d0b10d7 --- /dev/null +++ b/docs/2022-09/index.html @@ -0,0 +1,471 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + September, 2022 | CGSpace Notes + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+
+ + + + +
+
+

CGSpace Notes

+

Documenting day-to-day work on the CGSpace repository.

+
+
+ + + + +
+
+
+ + + + +
+
+

September, 2022

+ +
+

2022-09-01

+
    +
  • A bit of work on the “Mapping CG Core–CGSpace–MEL–MARLO Types” spreadsheet
  • +
  • I tested an item submission on DSpace Test with the Cocoon org.apache.cocoon.uploads.autosave=false change +
      +
    • The submission works as expected
    • +
    +
  • +
  • Start debugging some region-related issues with csv-metadata-quality +
      +
    • I created a new test file test-geography.csv with some different scenarios
    • +
    • I also fixed a few bugs and improved the region-matching logic
    • +
    +
  • +
+ +

2022-09-02

+
    +
  • I worked a bit more on exclusion and skipping logic in csv-metadata-quality +
      +
    • I also pruned and updated all the Python dependencies
    • +
    • Then I released version 0.6.0 now that the excludes and region matching support is working way better
    • +
    +
  • +
+

2022-09-05

+
    +
  • Started a harvest on AReS last night
  • +
  • Looking over the Solr statistics from last month I see many user agents that look suspicious: +
      +
    • Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C)
    • +
    • Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 77.0.3865.90 Safari / 537.36
    • +
    • Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0
    • +
    • Mozilla/5.0 (X11; Linux i686; rv:2.0b12pre) Gecko/20110204 Firefox/4.0b12pre
    • +
    • Mozilla/5.0 (Windows NT 10.0; Win64; x64; Xbox; Xbox One) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36 Edge/44.18363.8131
    • +
    • Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
    • +
    • Mozilla/4.0 (compatible; MSIE 4.5; Windows 98;)
    • +
    • curb
    • +
    • bitdiscovery
    • +
    • omgili/0.5 +http://omgili.com
    • +
    • Mozilla/5.0 (compatible)
    • +
    • Vizzit
    • +
    • Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0
    • +
    • Mozilla/5.0 (Android; Mobile; rv:13.0) Gecko/13.0 Firefox/13.0
    • +
    • Java/17-ea
    • +
    • AdobeUxTechC4-Async/3.0.12 (win32)
    • +
    • ZaloPC-win32-24v473
    • +
    • Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com
    • +
    • Scoop.it
    • +
    • Mozilla/5.0 (Windows NT 6.1; rv:27.0) Gecko/20100101 Firefox/27.0
    • +
    • Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
    • +
    • ows NT 10.0; WOW64; rv: 50.0) Gecko/20100101 Firefox/50.0
    • +
    • WebAPIClient
    • +
    • Mozilla/5.0 Firefox/26.0
    • +
    • Mozilla/5.0 (compatible; woorankreview/2.0; +https://www.woorank.com/)
    • +
    +
  • +
  • For example, some are apparently using versions of Firefox that are over ten years old, and some are obviously trying to look like valid user agents, but making typos (Mozilla / 5.0)
  • +
  • Tons of hosts making requests likt this:
  • +
+
GET /bitstream/handle/10568/109408/Milk%20testing%20lab%20protocol.pdf?sequence=1&isAllowed=\x22><script%20>alert(String.fromCharCode(88,83,83))</script> HTTP/1.1" 400 5 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0
+
    +
  • I got a list of hosts making requests like that so I can purge their hits:
  • +
+
# zcat /var/log/nginx/{access,library-access,oai,rest}.log.[123]*.gz | grep 'String.fromCharCode(' | awk '{print $1}' | sort -u > /tmp/ips.txt 
+
    +
  • I purged 4,718 hits from IPs
  • +
  • I see some new Hetzner ranges that I hadn’t blocked yet apparently? + +
  • +
+
$ awk '{print $1}' /tmp/hetzner.txt | wc -l
+36
+$ sort -u /tmp/hetzner-combined.txt  | wc -l
+49
+
    +
  • I will add this new list to nginx’s bot-networks.conf so they get throttled on scraping XMLUI and get classified as bots in Solr statistics
  • +
  • Then I purged hits from the following user agents:
  • +
+
$ ./ilri/check-spider-hits.sh -f /tmp/agents
+Found 374 hits from curb in statistics
+Found 350 hits from bitdiscovery in statistics
+Found 564 hits from omgili in statistics
+Found 390 hits from Vizzit in statistics
+Found 9125 hits from AdobeUxTechC4-Async in statistics
+Found 97 hits from ZaloPC-win32-24v473 in statistics
+Found 518 hits from nbertaupete95 in statistics
+Found 218 hits from Scoop.it in statistics
+Found 584 hits from WebAPIClient in statistics
+
+Total number of hits from bots: 12220
+
    +
  • Then I will add these user agents to the ILRI spider override in DSpace
  • +
+

2022-09-06

+
    +
  • I’m testing dspace-statistics-api with our DSpace 7 test server +
      +
    • After setting up the env and the database the python -m dspace_statistics_api.indexer runs without issues
    • +
    • While playing with Solr I tried to search for statistics from this month using time:2022-09* but I get this error: “Can’t run prefix queries on numeric fields”
    • +
    • I guess that the syntax in Solr changed since 4.10…
    • +
    • This works, but is super annoying: time:[2022-09-01T00:00:00Z TO 2022-09-30T23:59:59Z]
    • +
    +
  • +
+

2022-09-07

+
    +
  • I tested the controlled-vocabulary changes on DSpace 6 and they work fine +
      +
    • Last week I found that DSpace 7 is more strict with controlled vocabularies and requires IDs for all node values
    • +
    • This is a pain because it means I have to re-do the IDs in each file every time I update them
    • +
    • If I add id="0000" to each, then I can use this vim expression let i=0001 | g/0000/s//\=i/ | let i=i+1 to replace the numbers with increments starting from 1
    • +
    +
  • +
  • Meeting with Marie Angelique, Abenet, Sarа, аnd Margarita to continue the discussion about Types from last week +
      +
    • We made progress with concrete actions and will continue next week
    • +
    +
  • +
+

2022-09-08

+
    +
  • I had a meeting with Nicky from UNEP to discuss issues they are having with their DSpace +
      +
    • I told her about the meeting of DSpace community people that we’re planning at ILRI in the next few weeks
    • +
    +
  • +
+

2022-09-09

+
    +
  • Add some value mappings to AReS because I see a lot of incorrect regions and countries
  • +
  • I also found some values that were blank in CGSpace so I deleted them:
  • +
+
dspace=# BEGIN;
+BEGIN
+dspace=# DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='';
+DELETE 70
+dspace=# COMMIT;
+COMMIT
+
    +
  • Start a full Discovery index on CGSpace to catch these changes in the Discovery
  • +
+

2022-09-11

+
    +
  • Today is Sunday and I see the load on the server is high +
      +
    • Google and a bunch of other bots have been blocked on XMLUI for the past two weeks so it’s not from them!
    • +
    • Looking at the top IPs this morning:
    • +
    +
  • +
+
# cat /var/log/nginx/{access,library-access,oai,rest}.log /var/log/nginx/{access,library-access,oai,rest}.log.1 | grep '11/Sep/2022' | awk '{print $1}' | sort | uniq -c | sort -h | tail -n 40
+...
+    165 64.233.172.79
+    166 87.250.224.34
+    200 69.162.124.231
+    202 216.244.66.198
+    385 207.46.13.149
+    398 207.46.13.147
+    421 66.249.64.185
+    422 157.55.39.81
+    442 2a01:4f8:1c17:5550::1
+    451 64.124.8.36
+    578 137.184.159.211
+    597 136.243.228.195
+   1185 66.249.64.183
+   1201 157.55.39.80
+   3135 80.248.237.167
+   4794 54.195.118.125
+   5486 45.5.186.2
+   6322 2a01:7e00::f03c:91ff:fe9a:3a37
+   9556 66.249.64.181
+
    +
  • The top is still Google, but all the requests are HTTP 503 because I classified them as bots for XMLUI at least
  • +
  • Then there’s 80.248.237.167, which is using a normal user agent and scraping Discovery +
      +
    • That IP is on Internet Vikings aka Internetbolaget and we are already marking that subnet as ‘bot’ for XMLUI so most of these requests are HTTP 503
    • +
    +
  • +
  • On another note, I’m curious to explore enabling caching of certain REST API responses +
      +
    • For example, where the use is for harvesting rather than actual clients getting bitstreams or thumbnails, it seems there might be a benefit to speeding these up for subsequent requestors:
    • +
    +
  • +
+
# awk '{print $7}' /var/log/nginx/rest.log | grep -v retrieve | sort | uniq -c | sort -h | tail -n 10
+      4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/bitstreams
+      4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/metadata
+      4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/bitstreams
+      4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/metadata
+      5 /rest/handle/10568/110310?expand=all
+      5 /rest/handle/10568/89980?expand=all
+      5 /rest/handle/10568/97614?expand=all
+      6 /rest/handle/10568/107086?expand=all
+      6 /rest/handle/10568/108503?expand=all
+      6 /rest/handle/10568/98424?expand=all
+
    +
  • I specifically have to not cache things like requests for bitstreams because those are from actual users and we need to keep the real requests so we get the statistics hit +
      +
    • Will be interesting to check the results above as the day goes on (now 10AM)
    • +
    • To estimate the potential savings from caching I will check how many non-bitstream requests are made versus how many are made more than once (updated the next morning using yesterday’s log):
    • +
    +
  • +
+
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort -u | wc -l
+33733
+# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk '$1 > 1' | wc -l
+5637
+
    +
  • In the afternoon I started a harvest on AReS (which should affect the numbers above also)
  • +
  • I enabled an nginx proxy cache on DSpace Test for this location regex: location ~ /rest/(handle|items|collections|communities)/.+
  • +
+

2022-09-12

+
    +
  • I am testing harvesting DSpace Test via AReS with the nginx proxy cache enabled +
      +
    • I had to tune the regular expression in nginx a bit because the REST requests OpenRXV uses weren’t matching
    • +
    • Now I’m trying this one: /rest/(handle|items|collections|communities)/?
    • +
    • Testing in regex101.com with this test string:
    • +
    +
  • +
+
/rest/handle/10568/27611
+/rest/items?expand=metadata,parentCommunityList,parentCollectionList,bitstreams&limit=10&offset=36270
+/rest/handle/10568/110310?expand=all
+/rest/rest/bitstreams/28926633-c7c2-49c2-afa8-6d81cadc2316/retrieve
+/rest/bitstreams/15412/retrieve
+/rest/items/083dbb0d-11e2-4dfe-902b-eb48e4640d04/metadata
+/rest/items/083dbb0d-11e2-4dfe-902b-eb48e4640d04/bitstreams
+/rest/collections/edea23c0-0ebd-4525-90b0-0b401f997704/items
+/rest/items/14507941-aff2-4d57-90bd-03a0733ad859/metadata
+/rest/communities/b38ea726-475f-4247-a961-0d0b76e67f85/collections
+/rest/collections/e994c450-6ff7-41c6-98df-51e5c424049e/items?limit=10000
+
    +
  • I estimate that it will take about 1GB of cache to harvest 100,000 items from CGSpace with OpenRXV (10,000 pages)
  • +
  • Basically all but 4 and 5 (bitstreams) should match
  • +
  • Upload 682 OICRs from MARLO to CGSpace +
      +
    • We had tested these on DSpace Test last month along with the MELIAs, Policies, and Innovations, but we decided to upload the OICRs first so that other things can link against them as related items
    • +
    +
  • +
+

2022-09-14

+
    +
  • Meeting with Peter, Abenet, Indira, and Michael about CGSpace rollout plan for the Initiatives
  • +
+

2022-09-16

+
    +
  • Meeting with Marie-Angeqlique, Abenet, Margarita, and Sara about types for CG Core +
      +
    • We are about halfway through the list of types now, with concrete actions for CG Core and CGSpace
    • +
    • We will meet next in two weeks to hopefully finalize the list, then we can move on to definitions
    • +
    +
  • +
+ + + + + + +
+ + + +
+ + + + +
+
+ + + + + + + + + diff --git a/docs/categories/index.html b/docs/categories/index.html index 219ccac82..352924d69 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index c4fa9a048..172dfa64c 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 53406bae1..4346aba43 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 91f1b96e0..8a89acdd6 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 33364f2d1..545ecab71 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index b377a1870..db6cc7221 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html index 79e5416d1..1064e5b21 100644 --- a/docs/categories/notes/page/6/index.html +++ b/docs/categories/notes/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/7/index.html b/docs/categories/notes/page/7/index.html index 46cdc1749..56ce39960 100644 --- a/docs/categories/notes/page/7/index.html +++ b/docs/categories/notes/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index b2a04b3cc..03cbeaec6 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index ed9d4400e..a3ebe7570 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index b2393d64e..4111c4a94 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 1f52657bd..bf2037f77 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 834197cac..cb54f3466 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index a0ed8f38b..d4329e09e 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index ae5a3c824..818b09963 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index 5ae06920a..6becd2178 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/9/index.html b/docs/page/9/index.html index 28ca92a50..f8eed51df 100644 --- a/docs/page/9/index.html +++ b/docs/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index cecd2bca4..8a68d8098 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 937cc42a1..fd563063f 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 81ef74520..14291aa92 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index 723659e06..c50f5deab 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 8810e320e..b0e37b53f 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 3c0d3ff62..416b22fd9 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index bec3fbffd..6f02da0ce 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index e1bf5e9d8..8cab35029 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/9/index.html b/docs/posts/page/9/index.html index 6da6b60b8..d67968931 100644 --- a/docs/posts/page/9/index.html +++ b/docs/posts/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index cdb1d2a36..92b090d05 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2022-09-15T08:37:36+03:00 + 2022-09-15T08:37:57+03:00 https://alanorth.github.io/cgspace-notes/ - 2022-09-15T08:37:36+03:00 + 2022-09-15T08:37:57+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2022-09-15T08:37:36+03:00 + 2022-09-15T08:37:57+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2022-09-15T08:37:36+03:00 + 2022-09-15T08:37:57+03:00 https://alanorth.github.io/cgspace-notes/2022-09/ - 2022-09-15T08:37:36+03:00 + 2022-09-15T08:37:57+03:00 https://alanorth.github.io/cgspace-notes/2022-08/ 2022-08-31T17:37:28+03:00