diff --git a/content/posts/2022-07.md b/content/posts/2022-07.md index 534fa0466..f450c16e4 100644 --- a/content/posts/2022-07.md +++ b/content/posts/2022-07.md @@ -168,3 +168,114 @@ UPDATE 108 - cg.identifier.ccafsprojectpii - cg.subject.cifor - For now we will keep them in the search filters +- I modified my `check-duplicates.py` script a bit to fix a logic error for deleted items and add similarity scores from spacy (see: https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents) + - I want to use this with the MARLO innovation reports, to find related publications and working papers on CGSpace + - I am curious to see how the similarity scores compare to those from trgm... perhaps we don't need them actually +- Deploy latest changes to submission form, Discovery, and browse on CGSpace + - Also run all system updates and reboot the host +- Fix 152 `dcterms.relation` that are using "cgspace.cgiar.org" links instead of handles: + +```console +UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '.*cgspace\.cgiar\.org/handle/(\d+/\d+)$', 'https://hdl.handle.net/\1') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=180 AND text_value ~ 'cgspace\.cgiar\.org/handle/\d+/\d+$'; +``` + +## 2022-07-10 + +- UptimeRobot says that CGSpace is down + - I see high load around 22, high CPU around 800% + - Doesn't seem to be a lot of unique IPs: + +```console +# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | sort -u | wc -l +2243 +``` + +Looking at the top twenty I see some usual IPs, but some new ones on Hetzner that are using many DSpace sessions: + +```console +$ grep 65.109.2.97 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l +1613 +$ grep 95.216.174.97 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l +1696 +$ grep 65.109.15.213 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l +1708 +$ grep 65.108.80.78 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l +1830 +$ grep 65.108.95.23 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l +1811 +``` + +![DSpace sessions week](/cgspace-notes/2022/07/jmx_dspace_sessions-week.png) + +- These IPs are using normal-looking user agents: + - `Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.0.0.1` + - `Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/45.0"` + - `Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:56.0) Gecko/20100101 Firefox/56.0.1 Waterfox/56.0.1` + - `Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.85 Safari/537.36` + +- I will add networks I'm seeing now to nginx's bot-networks.conf for now (not all of Hetzner) and purge the hits later: + - 65.108.0.0/16 + - 65.21.0.0/16 + - 95.216.0.0/16 + - 135.181.0.0/16 + - 138.201.0.0/16 + +- I think I'm going to get to a point where I categorize all commercial subnets as bots by default and then whitelist those we need +- Sheesh, there are a bunch more IPv6 addresses also on Hetzner: + +```console +# awk '{print $1}' /var/log/nginx/{access,library-access}.log | sort | grep 2a01:4f9 | uniq -c | sort -h + 1 2a01:4f9:6a:1c2b::2 + 2 2a01:4f9:2b:5a8::2 + 2 2a01:4f9:4b:4495::2 + 96 2a01:4f9:c010:518c::1 + 137 2a01:4f9:c010:a9bc::1 + 142 2a01:4f9:c010:58c9::1 + 142 2a01:4f9:c010:58ea::1 + 144 2a01:4f9:c010:58eb::1 + 145 2a01:4f9:c010:6ff8::1 + 148 2a01:4f9:c010:5190::1 + 149 2a01:4f9:c010:7d6d::1 + 153 2a01:4f9:c010:5226::1 + 156 2a01:4f9:c010:7f74::1 + 160 2a01:4f9:c010:5188::1 + 161 2a01:4f9:c010:58e5::1 + 168 2a01:4f9:c010:58ed::1 + 170 2a01:4f9:c010:548e::1 + 170 2a01:4f9:c010:8c97::1 + 175 2a01:4f9:c010:58c8::1 + 175 2a01:4f9:c010:aada::1 + 182 2a01:4f9:c010:58ec::1 + 182 2a01:4f9:c010:ae8c::1 + 502 2a01:4f9:c010:ee57::1 + 530 2a01:4f9:c011:567a::1 + 535 2a01:4f9:c010:d04e::1 + 539 2a01:4f9:c010:3d9a::1 + 586 2a01:4f9:c010:93db::1 + 593 2a01:4f9:c010:a04a::1 + 601 2a01:4f9:c011:4166::1 + 607 2a01:4f9:c010:9881::1 + 640 2a01:4f9:c010:87fb::1 + 648 2a01:4f9:c010:e680::1 + 1141 2a01:4f9:3a:2696::2 + 1146 2a01:4f9:3a:2555::2 + 3207 2a01:4f9:3a:2c19::2 +``` + +- Maybe it's time I ban all of Hetzner... sheesh. +- I left for a few hours and the server was going up and down the whole time, still very high CPU and database when I got back + +![CPU day](/cgspace-notes/2022/07/cpu-day.png) + +- I am not sure what's going on + - I extracted all the IPs and used `resolve-addresses-geoip2.py` to analyze them and extract all the Hetzner networks and block them + - It's 181 IPs on Hetzner... +- I rebooted the server to see if it was just some stuck locks in PostgreSQL... +- The load is still higher than I would expect, and after a few more hours I see more Hetzner IPs coming through? Two more subnets to block +- Start a harvest on AReS + +## 2022-07-12 + +- Update an incorrect ORCID identifier for Alliance + + diff --git a/docs/2022-07/index.html b/docs/2022-07/index.html index 777a7b6b6..3f7918603 100644 --- a/docs/2022-07/index.html +++ b/docs/2022-07/index.html @@ -19,7 +19,7 @@ Also, the trgm functions I’ve used before are case insensitive, but Levens - + @@ -44,9 +44,9 @@ Also, the trgm functions I’ve used before are case insensitive, but Levens "@type": "BlogPosting", "headline": "July, 2022", "url": "https://alanorth.github.io/cgspace-notes/2022-07/", - "wordCount": "1095", + "wordCount": "1660", "datePublished": "2022-07-02T14:07:36+03:00", - "dateModified": "2022-07-07T10:02:04+03:00", + "dateModified": "2022-07-08T15:49:45+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -315,7 +315,123 @@ Also, the trgm functions I’ve used before are case insensitive, but Levens
  • For now we will keep them in the search filters
  • +
  • I modified my check-duplicates.py script a bit to fix a logic error for deleted items and add similarity scores from spacy (see: https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents) + +
  • +
  • Deploy latest changes to submission form, Discovery, and browse on CGSpace + +
  • +
  • Fix 152 dcterms.relation that are using “cgspace.cgiar.org” links instead of handles:
  • + +
    UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '.*cgspace\.cgiar\.org/handle/(\d+/\d+)$', 'https://hdl.handle.net/\1') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=180 AND text_value ~ 'cgspace\.cgiar\.org/handle/\d+/\d+$';
    +

    2022-07-10

    + +
    # awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | sort -u | wc -l
    +2243
    +

    Looking at the top twenty I see some usual IPs, but some new ones on Hetzner that are using many DSpace sessions:

    +
    $ grep 65.109.2.97 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
    +1613
    +$ grep 95.216.174.97 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
    +1696
    +$ grep 65.109.15.213 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
    +1708
    +$ grep 65.108.80.78 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
    +1830
    +$ grep 65.108.95.23 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
    +1811
    +

    DSpace sessions week

    + +
    # awk '{print $1}' /var/log/nginx/{access,library-access}.log | sort | grep 2a01:4f9 | uniq -c | sort -h
    +      1 2a01:4f9:6a:1c2b::2
    +      2 2a01:4f9:2b:5a8::2
    +      2 2a01:4f9:4b:4495::2
    +     96 2a01:4f9:c010:518c::1
    +    137 2a01:4f9:c010:a9bc::1
    +    142 2a01:4f9:c010:58c9::1
    +    142 2a01:4f9:c010:58ea::1
    +    144 2a01:4f9:c010:58eb::1
    +    145 2a01:4f9:c010:6ff8::1
    +    148 2a01:4f9:c010:5190::1
    +    149 2a01:4f9:c010:7d6d::1
    +    153 2a01:4f9:c010:5226::1
    +    156 2a01:4f9:c010:7f74::1
    +    160 2a01:4f9:c010:5188::1
    +    161 2a01:4f9:c010:58e5::1
    +    168 2a01:4f9:c010:58ed::1
    +    170 2a01:4f9:c010:548e::1
    +    170 2a01:4f9:c010:8c97::1
    +    175 2a01:4f9:c010:58c8::1
    +    175 2a01:4f9:c010:aada::1
    +    182 2a01:4f9:c010:58ec::1
    +    182 2a01:4f9:c010:ae8c::1
    +    502 2a01:4f9:c010:ee57::1
    +    530 2a01:4f9:c011:567a::1
    +    535 2a01:4f9:c010:d04e::1
    +    539 2a01:4f9:c010:3d9a::1
    +    586 2a01:4f9:c010:93db::1
    +    593 2a01:4f9:c010:a04a::1
    +    601 2a01:4f9:c011:4166::1
    +    607 2a01:4f9:c010:9881::1
    +    640 2a01:4f9:c010:87fb::1
    +    648 2a01:4f9:c010:e680::1
    +   1141 2a01:4f9:3a:2696::2
    +   1146 2a01:4f9:3a:2555::2
    +   3207 2a01:4f9:3a:2c19::2
    +
    +

    CPU day

    + + diff --git a/docs/2022/07/cpu-day.png b/docs/2022/07/cpu-day.png index 09c6ea9f8..e0bce741c 100644 Binary files a/docs/2022/07/cpu-day.png and b/docs/2022/07/cpu-day.png differ diff --git a/docs/2022/07/jmx_dspace_sessions-week.png b/docs/2022/07/jmx_dspace_sessions-week.png new file mode 100644 index 000000000..67487cb08 Binary files /dev/null and b/docs/2022/07/jmx_dspace_sessions-week.png differ diff --git a/docs/categories/index.html b/docs/categories/index.html index 208fbfacd..102da9fcb 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index f628d8200..96b914e83 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 74c333e60..6f9fabf8a 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index c67852a5f..71d238975 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 2a9178788..3e9d0a22f 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index ced4f95e4..5064831cc 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html index cc1054246..9dbccd24e 100644 --- a/docs/categories/notes/page/6/index.html +++ b/docs/categories/notes/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/7/index.html b/docs/categories/notes/page/7/index.html index 3fad6c16b..64452dc7f 100644 --- a/docs/categories/notes/page/7/index.html +++ b/docs/categories/notes/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index 42c9ba3eb..760641b40 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 90c7115d0..5928f6fab 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 534f7e4fd..b18331a12 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 0184cd9ea..7bc731344 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index e3fcbbed1..4bd4263e5 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 260fa9785..336d0b0c2 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index 409b709ba..c667fa635 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index 3bd709463..da0478629 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/9/index.html b/docs/page/9/index.html index 437accfee..ceef044a0 100644 --- a/docs/page/9/index.html +++ b/docs/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index 70eaea271..e3c956d14 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index fcf9f844a..dba517d73 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 058713aab..ba9b71c32 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index a71f52d0c..527db0d8a 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 0cc7b80a8..36e8dd8d6 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index e2ecc926d..9a318923c 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index 1bad7c985..bb00ed684 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index 52ac16c60..5804b6dfb 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/9/index.html b/docs/posts/page/9/index.html index 2731176dc..3e0f4771b 100644 --- a/docs/posts/page/9/index.html +++ b/docs/posts/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index af42045ea..8d99e1188 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2022-07-07T10:02:04+03:00 + 2022-07-08T15:49:45+03:00 https://alanorth.github.io/cgspace-notes/ - 2022-07-07T10:02:04+03:00 + 2022-07-08T15:49:45+03:00 https://alanorth.github.io/cgspace-notes/2022-07/ - 2022-07-07T10:02:04+03:00 + 2022-07-08T15:49:45+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2022-07-07T10:02:04+03:00 + 2022-07-08T15:49:45+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2022-07-07T10:02:04+03:00 + 2022-07-08T15:49:45+03:00 https://alanorth.github.io/cgspace-notes/2022-06/ 2022-07-04T09:25:14+03:00 diff --git a/static/2022/07/cpu-day.png b/static/2022/07/cpu-day.png index 09c6ea9f8..e0bce741c 100644 Binary files a/static/2022/07/cpu-day.png and b/static/2022/07/cpu-day.png differ diff --git a/static/2022/07/jmx_dspace_sessions-week.png b/static/2022/07/jmx_dspace_sessions-week.png new file mode 100644 index 000000000..67487cb08 Binary files /dev/null and b/static/2022/07/jmx_dspace_sessions-week.png differ