diff --git a/content/posts/2024-06.md b/content/posts/2024-06.md
index 2ef0f33ae..c46668750 100644
--- a/content/posts/2024-06.md
+++ b/content/posts/2024-06.md
@@ -45,4 +45,63 @@ value.parseJson()['datasetVersion']['termsOfUse']
- I also spent some time converting the `ilri/post_bitstreams.py` script to use the DSpace 7 REST API via dspace-rest-client
- There are 28 PDFs specified for these 176 duplicates, and a handful of them do not already exist on CGSpace so I will upload them
+## 2024-06-19
+
+- Spent some time checking the remaining 3312 IFPRI 2016–2019 migration set for duplicates on CGSpace
+ - There seem to be about 50 exact matches of title, type, and issue date
+
+## 2024-06-20
+
+- Finalize merging and uploading metadata for 48 duplicates from the IFPRI 2016–2019 migration set
+- Heavy load on both CGSpace and DSpace 7 Test this afternoon
+ - Took me a while to figure out it was due to someone / something hammering `/search` for a bunch of facets
+ - The `pm2 logs` command was more useful than the nginx logs to see the requests at least, for example:
+
+```
+0|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&spc.page=1&f.accessRights=Open%20Access,equals&f.dateIssued.min=2023&f.dateIssued.max=2024&f.country=Colombia,equals&f.subject=climate%20change,equals&f.region=Latin%20America%20and%20the%20Caribbean,equals&f.publisher=CGIAR%20FOCUS%20Climate%20Security,equals - - ms - -
+1|dspace-ui | GET /search?f.accessRights=Open%20Access,equals&spc.page=1&f.sponsorship=CGIAR%20Trust%20Fund,equals&f.impactArea=Climate%20adaptation%20and%20mitigation,equals&f.region=Eastern%20Africa,equals&f.publisher=International%20Institute%20of%20Tropical%20Agriculture,equals - - ms - -
+3|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&f.sdg=SDG%2012%20-%20Responsible%20consumption%20and%20production,equals&spc.page=1&f.affiliation=CGIAR%20Research%20Program%20on%20Climate%20Change,%20Agriculture%20and%20Food%20Security,equals&f.affiliation=Alliance%20of%20Bioversity%20International%20and%20CIAT,equals&f.dateIssued.min=2020&f.dateIssued.max=2021&f.impactArea=Environmental%20health%20and%20biodiversity,equals - - ms - -
+```
+
+- Still difficult to find the client, because the logs are all [coming from Angular's user agent](https://github.com/DSpace/dspace-angular/issues/2902) and IP
+ - I changed the nginx logging to use the `X-Forwarded-For` header, as the default `combined` log format uses `$remote_addr` by default, which is only accurate if the request doesn't come from Angular (ie directly to the API)
+ - From what I can see now the IPs are all coming from Huawei Cloud and Tencent
+ - The ASNs are AS136907 (Huawei) and AS132203 (Tencent)
+ - For now I will just add those to the list of bot networks
+
+## 2024-06-21
+
+- Update the nginx logging to use [nginx's `real_ip` module](http://nginx.org/en/docs/http/ngx_http_realip_module.html) to log the correct client IP
+ - I think this means we will start sending 'bot' to the Angular / Express frontend because bot IPs will be properly classified now...
+ - I will have to re-work or at least re-think that nginx configuration for requests going to the frontend because the proposed fix in https://github.com/DSpace/dspace-angular/issues/2902 is to pass on the client's user-agent
+- Then I updated the list of bot networks:
+
+```console
+$ wget https://asn.ipinfo.app/api/text/list/AS12876 \
+ https://asn.ipinfo.app/api/text/list/AS132203 \
+ https://asn.ipinfo.app/api/text/list/AS13238 \
+ https://asn.ipinfo.app/api/text/list/AS136907 \
+ https://asn.ipinfo.app/api/text/list/AS14061 \
+ https://asn.ipinfo.app/api/text/list/AS14618 \
+ https://asn.ipinfo.app/api/text/list/AS16276 \
+ https://asn.ipinfo.app/api/text/list/AS16509 \
+ https://asn.ipinfo.app/api/text/list/AS203020 \
+ https://asn.ipinfo.app/api/text/list/AS204287 \
+ https://asn.ipinfo.app/api/text/list/AS21859 \
+ https://asn.ipinfo.app/api/text/list/AS23576 \
+ https://asn.ipinfo.app/api/text/list/AS24940 \
+ https://asn.ipinfo.app/api/text/list/AS396982 \
+ https://asn.ipinfo.app/api/text/list/AS45102 \
+ https://asn.ipinfo.app/api/text/list/AS50245 \
+ https://asn.ipinfo.app/api/text/list/AS55286 \
+ https://asn.ipinfo.app/api/text/list/AS6939 \
+ https://asn.ipinfo.app/api/text/list/AS8075
+$ cat AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
+$ wc -l /tmp/networks.txt
+8675 /tmp/networks.txt
+```
+
+- Update list of ORCID identifiers with new ones from Alliance and IFPRI
+- Finalize uploading the remaining 3,264 items from IFPRI's 2016–2019 batch migration to CGSpace
+
diff --git a/docs/2024-06/index.html b/docs/2024-06/index.html
index 12a072b9a..4829cb943 100644
--- a/docs/2024-06/index.html
+++ b/docs/2024-06/index.html
@@ -19,7 +19,7 @@ We have both Handles and DOIs for these datasets, both from Harvard’s Data
-
+
@@ -44,9 +44,9 @@ We have both Handles and DOIs for these datasets, both from Harvard’s Data
"@type": "BlogPosting",
"headline": "June, 2024",
"url": "https://alanorth.github.io/cgspace-notes/2024-06/",
- "wordCount": "194",
+ "wordCount": "564",
"datePublished": "2024-06-03T14:14:00+03:00",
- "dateModified": "2024-06-16T16:40:54+03:00",
+ "dateModified": "2024-06-18T17:30:08+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@@ -164,6 +164,73 @@ We have both Handles and DOIs for these datasets, both from Harvard’s Data
+
2024-06-19
+
+- Spent some time checking the remaining 3312 IFPRI 2016–2019 migration set for duplicates on CGSpace
+
+- There seem to be about 50 exact matches of title, type, and issue date
+
+
+
+2024-06-20
+
+- Finalize merging and uploading metadata for 48 duplicates from the IFPRI 2016–2019 migration set
+- Heavy load on both CGSpace and DSpace 7 Test this afternoon
+
+- Took me a while to figure out it was due to someone / something hammering
/search
for a bunch of facets
+- The
pm2 logs
command was more useful than the nginx logs to see the requests at least, for example:
+
+
+
+0|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&spc.page=1&f.accessRights=Open%20Access,equals&f.dateIssued.min=2023&f.dateIssued.max=2024&f.country=Colombia,equals&f.subject=climate%20change,equals&f.region=Latin%20America%20and%20the%20Caribbean,equals&f.publisher=CGIAR%20FOCUS%20Climate%20Security,equals - - ms - -
+1|dspace-ui | GET /search?f.accessRights=Open%20Access,equals&spc.page=1&f.sponsorship=CGIAR%20Trust%20Fund,equals&f.impactArea=Climate%20adaptation%20and%20mitigation,equals&f.region=Eastern%20Africa,equals&f.publisher=International%20Institute%20of%20Tropical%20Agriculture,equals - - ms - -
+3|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&f.sdg=SDG%2012%20-%20Responsible%20consumption%20and%20production,equals&spc.page=1&f.affiliation=CGIAR%20Research%20Program%20on%20Climate%20Change,%20Agriculture%20and%20Food%20Security,equals&f.affiliation=Alliance%20of%20Bioversity%20International%20and%20CIAT,equals&f.dateIssued.min=2020&f.dateIssued.max=2021&f.impactArea=Environmental%20health%20and%20biodiversity,equals - - ms - -
+
+- Still difficult to find the client, because the logs are all coming from Angular’s user agent and IP
+
+- I changed the nginx logging to use the
X-Forwarded-For
header, as the default combined
log format uses $remote_addr
by default, which is only accurate if the request doesn’t come from Angular (ie directly to the API)
+- From what I can see now the IPs are all coming from Huawei Cloud and Tencent
+- The ASNs are AS136907 (Huawei) and AS132203 (Tencent)
+- For now I will just add those to the list of bot networks
+
+
+
+2024-06-21
+
+- Update the nginx logging to use nginx’s
real_ip
module to log the correct client IP
+
+- I think this means we will start sending ‘bot’ to the Angular / Express frontend because bot IPs will be properly classified now…
+- I will have to re-work or at least re-think that nginx configuration for requests going to the frontend because the proposed fix in https://github.com/DSpace/dspace-angular/issues/2902 is to pass on the client’s user-agent
+
+
+- Then I updated the list of bot networks:
+
+$ wget https://asn.ipinfo.app/api/text/list/AS12876 \
+ https://asn.ipinfo.app/api/text/list/AS132203 \
+ https://asn.ipinfo.app/api/text/list/AS13238 \
+ https://asn.ipinfo.app/api/text/list/AS136907 \
+ https://asn.ipinfo.app/api/text/list/AS14061 \
+ https://asn.ipinfo.app/api/text/list/AS14618 \
+ https://asn.ipinfo.app/api/text/list/AS16276 \
+ https://asn.ipinfo.app/api/text/list/AS16509 \
+ https://asn.ipinfo.app/api/text/list/AS203020 \
+ https://asn.ipinfo.app/api/text/list/AS204287 \
+ https://asn.ipinfo.app/api/text/list/AS21859 \
+ https://asn.ipinfo.app/api/text/list/AS23576 \
+ https://asn.ipinfo.app/api/text/list/AS24940 \
+ https://asn.ipinfo.app/api/text/list/AS396982 \
+ https://asn.ipinfo.app/api/text/list/AS45102 \
+ https://asn.ipinfo.app/api/text/list/AS50245 \
+ https://asn.ipinfo.app/api/text/list/AS55286 \
+ https://asn.ipinfo.app/api/text/list/AS6939 \
+ https://asn.ipinfo.app/api/text/list/AS8075
+$ cat AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
+$ wc -l /tmp/networks.txt
+8675 /tmp/networks.txt
+
+- Update list of ORCID identifiers with new ones from Alliance and IFPRI
+- Finalize uploading the remaining 3,264 items from IFPRI’s 2016–2019 batch migration to CGSpace
+
diff --git a/docs/categories/index.html b/docs/categories/index.html
index c7097bbaa..7b9b21be4 100644
--- a/docs/categories/index.html
+++ b/docs/categories/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/index.xml b/docs/categories/index.xml
index 7a2a260a3..fa8d0dbb2 100644
--- a/docs/categories/index.xml
+++ b/docs/categories/index.xml
@@ -6,7 +6,7 @@
Recent content in Categories on CGSpace Notes
Hugo
en-us
- Sun, 16 Jun 2024 16:40:54 +0300
+ Tue, 18 Jun 2024 17:30:08 +0300
-
Notes
diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html
index 1c3b04e27..ae7f82754 100644
--- a/docs/categories/notes/index.html
+++ b/docs/categories/notes/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/index.xml b/docs/categories/notes/index.xml
index 08e8cb720..1a6832ba5 100644
--- a/docs/categories/notes/index.xml
+++ b/docs/categories/notes/index.xml
@@ -6,7 +6,7 @@
Recent content in Notes on CGSpace Notes
Hugo
en-us
- Sun, 16 Jun 2024 16:40:54 +0300
+ Tue, 18 Jun 2024 17:30:08 +0300
-
June, 2024
diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html
index c875b090f..06b26ebdf 100644
--- a/docs/categories/notes/page/2/index.html
+++ b/docs/categories/notes/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html
index 269bac67d..7b917bcfa 100644
--- a/docs/categories/notes/page/3/index.html
+++ b/docs/categories/notes/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html
index 3756bb800..f7b4bee65 100644
--- a/docs/categories/notes/page/4/index.html
+++ b/docs/categories/notes/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html
index b9a5b4f10..2813123ab 100644
--- a/docs/categories/notes/page/5/index.html
+++ b/docs/categories/notes/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html
index d6cd96e3c..23c3fbb7e 100644
--- a/docs/categories/notes/page/6/index.html
+++ b/docs/categories/notes/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/7/index.html b/docs/categories/notes/page/7/index.html
index 64d406866..19da2ec89 100644
--- a/docs/categories/notes/page/7/index.html
+++ b/docs/categories/notes/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/8/index.html b/docs/categories/notes/page/8/index.html
index 395f804ff..aa3446e43 100644
--- a/docs/categories/notes/page/8/index.html
+++ b/docs/categories/notes/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/9/index.html b/docs/categories/notes/page/9/index.html
index 065f4cd88..bbd770074 100644
--- a/docs/categories/notes/page/9/index.html
+++ b/docs/categories/notes/page/9/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/index.html b/docs/index.html
index e2e0aacf3..66d886f33 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/index.xml b/docs/index.xml
index 29ac46808..c006b601e 100644
--- a/docs/index.xml
+++ b/docs/index.xml
@@ -6,7 +6,7 @@
Recent content on CGSpace Notes
Hugo
en-us
- Sun, 16 Jun 2024 16:40:54 +0300
+ Tue, 18 Jun 2024 17:30:08 +0300
-
June, 2024
diff --git a/docs/page/10/index.html b/docs/page/10/index.html
index 1f738613d..44ab72931 100644
--- a/docs/page/10/index.html
+++ b/docs/page/10/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/11/index.html b/docs/page/11/index.html
index 8c5b00d54..282d0682a 100644
--- a/docs/page/11/index.html
+++ b/docs/page/11/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/2/index.html b/docs/page/2/index.html
index b94c963f2..ccb7e0172 100644
--- a/docs/page/2/index.html
+++ b/docs/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/3/index.html b/docs/page/3/index.html
index 73c617455..4c3a52874 100644
--- a/docs/page/3/index.html
+++ b/docs/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/4/index.html b/docs/page/4/index.html
index bf6c0be91..b998c3fc4 100644
--- a/docs/page/4/index.html
+++ b/docs/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/5/index.html b/docs/page/5/index.html
index 3e4599d5d..1d1bb4e9c 100644
--- a/docs/page/5/index.html
+++ b/docs/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/6/index.html b/docs/page/6/index.html
index 00ab0fe38..11f5bfe0f 100644
--- a/docs/page/6/index.html
+++ b/docs/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/7/index.html b/docs/page/7/index.html
index e8609a41f..89857bed5 100644
--- a/docs/page/7/index.html
+++ b/docs/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/8/index.html b/docs/page/8/index.html
index 6c718fbc4..2f70b2b21 100644
--- a/docs/page/8/index.html
+++ b/docs/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/9/index.html b/docs/page/9/index.html
index b66358dfa..83949a2b2 100644
--- a/docs/page/9/index.html
+++ b/docs/page/9/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/index.html b/docs/posts/index.html
index 8f3ec4820..0fa75cf8d 100644
--- a/docs/posts/index.html
+++ b/docs/posts/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/index.xml b/docs/posts/index.xml
index 32b922a69..68e399860 100644
--- a/docs/posts/index.xml
+++ b/docs/posts/index.xml
@@ -6,7 +6,7 @@
Recent content in Posts on CGSpace Notes
Hugo
en-us
- Sun, 16 Jun 2024 16:40:54 +0300
+ Tue, 18 Jun 2024 17:30:08 +0300
-
June, 2024
diff --git a/docs/posts/page/10/index.html b/docs/posts/page/10/index.html
index 93a83199e..b29305a6b 100644
--- a/docs/posts/page/10/index.html
+++ b/docs/posts/page/10/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/11/index.html b/docs/posts/page/11/index.html
index 0140a628d..6ebf81319 100644
--- a/docs/posts/page/11/index.html
+++ b/docs/posts/page/11/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html
index 73999b310..1545eb702 100644
--- a/docs/posts/page/2/index.html
+++ b/docs/posts/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html
index d308ce4e4..d89aee3cb 100644
--- a/docs/posts/page/3/index.html
+++ b/docs/posts/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html
index 07d6e4fa4..9cf3e0258 100644
--- a/docs/posts/page/4/index.html
+++ b/docs/posts/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html
index 042512933..b43725343 100644
--- a/docs/posts/page/5/index.html
+++ b/docs/posts/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html
index cc2b12f0b..670270efe 100644
--- a/docs/posts/page/6/index.html
+++ b/docs/posts/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html
index 2bedc8ece..8382b70e1 100644
--- a/docs/posts/page/7/index.html
+++ b/docs/posts/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html
index 195081b37..65437297a 100644
--- a/docs/posts/page/8/index.html
+++ b/docs/posts/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/9/index.html b/docs/posts/page/9/index.html
index f00361e1b..7b3c9e2a8 100644
--- a/docs/posts/page/9/index.html
+++ b/docs/posts/page/9/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index b5e2481cd..88ce07613 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
https://alanorth.github.io/cgspace-notes/categories/
- 2024-06-16T16:40:54+03:00
+ 2024-06-18T17:30:08+03:00
https://alanorth.github.io/cgspace-notes/
- 2024-06-16T16:40:54+03:00
+ 2024-06-18T17:30:08+03:00
https://alanorth.github.io/cgspace-notes/2024-06/
- 2024-06-16T16:40:54+03:00
+ 2024-06-18T17:30:08+03:00
https://alanorth.github.io/cgspace-notes/categories/notes/
- 2024-06-16T16:40:54+03:00
+ 2024-06-18T17:30:08+03:00
https://alanorth.github.io/cgspace-notes/posts/
- 2024-06-16T16:40:54+03:00
+ 2024-06-18T17:30:08+03:00
https://alanorth.github.io/cgspace-notes/2024-05/
2024-05-28T16:40:32+03:00