mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-09-30 06:04:16 +02:00
474 lines
24 KiB
Markdown
474 lines
24 KiB
Markdown
---
|
|
title: "July, 2022"
|
|
date: 2022-07-02T14:07:36+03:00
|
|
author: "Alan Orth"
|
|
categories: ["Notes"]
|
|
---
|
|
|
|
## 2022-07-02
|
|
|
|
- I learned how to use the Levenshtein functions in PostgreSQL
|
|
- The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing
|
|
- Also, the trgm functions I've used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first
|
|
|
|
<!--more-->
|
|
|
|
- A working query checking for duplicates in the recent AfricaRice items is:
|
|
|
|
```console
|
|
localhost/dspace= ☘ SELECT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND levenshtein_less_equal(LOWER('International Trade and Exotic Pests: The Risks for Biodiversity and African Economies'), LEFT(LOWER(text_value), 255), 3) <= 3;
|
|
text_value
|
|
────────────────────────────────────────────────────────────────────────────────────────
|
|
International trade and exotic pests: the risks for biodiversity and African economies
|
|
(1 row)
|
|
|
|
Time: 399.751 ms
|
|
```
|
|
|
|
- There is a great [blog post discussing Soundex with Levenshtein](https://www.crunchydata.com/blog/fuzzy-name-matching-in-postgresql) and creating indexes to make them faster
|
|
- I want to do some proper checks of accuracy and speed against my trigram method
|
|
|
|
## 2022-07-03
|
|
|
|
- Start a harvest on AReS
|
|
|
|
## 2022-07-04
|
|
|
|
- Linode told me that CGSpace had high load yesterday
|
|
- I also got some up and down notices from UptimeRobot
|
|
- Looking now, I see there was a very high CPU and database pool load, but a mostly normal DSpace session count
|
|
|
|
![CPU load day](/cgspace-notes/2022/07/cpu-day.png)
|
|
![JDBC pool day](/cgspace-notes/2022/07/jmx_tomcat_dbpools-day.png)
|
|
|
|
- Seems we have some old database transactions since 2022-06-27:
|
|
|
|
![PostgreSQL locks week](/cgspace-notes/2022/07/postgres_locks_ALL-week.png)
|
|
![PostgreSQL query length week](/cgspace-notes/2022/07/postgres_querylength_ALL-week.png)
|
|
|
|
- Looking at the top connections to nginx yesterday:
|
|
|
|
```console
|
|
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort | uniq -c | sort -h | tail
|
|
1132 64.124.8.34
|
|
1146 2a01:4f8:1c17:5550::1
|
|
1380 137.184.159.211
|
|
1533 64.124.8.59
|
|
4013 80.248.237.167
|
|
4776 54.195.118.125
|
|
10482 45.5.186.2
|
|
11177 172.104.229.92
|
|
15855 2a01:7e00::f03c:91ff:fe9a:3a37
|
|
22179 64.39.98.251
|
|
```
|
|
|
|
- And the total number of unique IPs:
|
|
|
|
```console
|
|
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort -u | wc -l
|
|
6952
|
|
```
|
|
|
|
- This seems low, so it must have been from the request patterns by certain visitors
|
|
- 64.39.98.251 is Qualys, and I'm debating blocking [all their IPs](https://pci.qualys.com/static/help/merchant/getting_started/check_scanner_ip_addresses.htm) using a geo block in nginx (need to test)
|
|
- The top few are known ILRI and other CGIAR scrapers, but 80.248.237.167 is on InternetVikings in Sweden, using a normal user agentand scraping Discover
|
|
- 64.124.8.59 is making requests with a normal user agent and belongs to Castle Global or Zayo
|
|
- I ran all system updates and rebooted the server (could have just restarted PostgreSQL but I thought I might as well do everything)
|
|
- I implemented a geo mapping for the user agent mapping AND the nginx `limit_req_zone` by extracting the networks into an external file and including it in two different geo mapping blocks
|
|
- This is clever and relies on the fact that we can use defaults in both cases
|
|
- First, we map the user agent of requests from these networks to "bot" so that Tomcat and Solr handle them accordingly
|
|
- Second, we use this as a key in a `limit_req_zone`, which relies on a default mapping of '' (and nginx doesn't evaluate empty cache keys)
|
|
- I noticed that CIP uploaded a number of Georgian presentations with `dcterms.language` set to English and Other so I changed them to "ka"
|
|
- Perhaps we need to update our list of languages to include all instead of the most common ones
|
|
- I wrote a script `ilri/iso-639-value-pairs.py` to extract the names and Alpha 2 codes for all ISO 639-1 languages from pycountry and added them to `input-forms.xml`
|
|
|
|
## 2022-07-06
|
|
|
|
- CGSpace went down and up a few times due to high load
|
|
- I found one host in Romania making very high speed requests with a normal user agent (`Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C`):
|
|
|
|
```console
|
|
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | sort | uniq -c | sort -h | tail -n 10
|
|
516 142.132.248.90
|
|
525 157.55.39.234
|
|
587 66.249.66.21
|
|
593 95.108.213.59
|
|
1372 137.184.159.211
|
|
4776 54.195.118.125
|
|
5441 205.186.128.185
|
|
6267 45.5.186.2
|
|
15839 2a01:7e00::f03c:91ff:fe9a:3a37
|
|
36114 146.19.75.141
|
|
```
|
|
|
|
- I added 146.19.75.141 to the list of bot networks in nginx
|
|
- While looking at the logs I started thinking about Bing again
|
|
- They apparently [publish a list of all their networks](https://www.bing.com/toolbox/bingbot.json)
|
|
- I wrote a script to use `prips` to [print the IPs for each network](https://stackoverflow.com/a/52501093/1996540)
|
|
- The script is `bing-networks-to-ips.sh`
|
|
- From Bing's IPs alone I purged 145,403 hits... sheesh
|
|
- Delete two items on CGSpace for Margarita because she was getting the "Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:0b26875a-..." error
|
|
- This is the same DSpace 6 bug I noticed in 2021-03, 2021-04, and 2021-05
|
|
- Update some `cg.audience` metadata to use "Academics" instead of "Academicians":
|
|
|
|
```console
|
|
dspace=# UPDATE metadatavalue SET text_value='Academics' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value='Academicians';
|
|
UPDATE 104
|
|
```
|
|
|
|
- I will also have to remove "Academicians" from input-forms.xml
|
|
|
|
|
|
## 2022-07-07
|
|
|
|
- Finalize lists of non-AGROVOC subjects in CGSpace that I started last week
|
|
- I used the [SQL helper functions](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) to find the collections where each term was used:
|
|
|
|
```console
|
|
localhost/dspace= ☘ SELECT DISTINCT(ds6_item2collectionhandle(dspace_object_id)) AS collection, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND LOWER(text_value) = 'water demand' GROUP BY collection ORDER BY count DESC LIMIT 5;
|
|
collection │ count
|
|
─────────────┼───────
|
|
10568/36178 │ 56
|
|
10568/36185 │ 46
|
|
10568/36181 │ 35
|
|
10568/36188 │ 28
|
|
10568/36179 │ 21
|
|
(5 rows)
|
|
```
|
|
|
|
- For now I only did terms from my list that had 100 or more occurrences in CGSpace
|
|
- This leaves us with thirty-six terms that I will send to Sara Jani and Elizabeth Arnaud for evaluating possible inclusion to AGROVOC
|
|
- Write to some submitters from CIAT, Bioversity, and CCAFS to ask if they are still uploading new items with their legacy subject fields on CGSpace
|
|
- We want to remove them from the submission form to create space for new fields
|
|
- Update one term I noticed people using that was close to AGROVOC:
|
|
|
|
```console
|
|
dspace=# UPDATE metadatavalue SET text_value='development policies' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value='development policy';
|
|
UPDATE 108
|
|
```
|
|
|
|
- After contacting some editors I removed some old metadata fields from the submission form and browse indexes:
|
|
- Bioversity subject (`cg.subject.bioversity`)
|
|
- CCAFS phase 1 project tag (`cg.identifier.ccafsproject`)
|
|
- CIAT project tag (`cg.identifier.ciatproject`)
|
|
- CIAT subject (`cg.subject.ciat`)
|
|
- Work on cleaning and proofing forty-six AfricaRice items for CGSpace
|
|
- Last week we identified some duplicates so I removed those
|
|
- The data is of mediocre quality
|
|
- I've been fixing citations (nitpick), adding licenses, adding volume/issue/extent, fixing DOIs, and adding some AGROVOC subjects
|
|
- I even found titles that have typos, looking something like OCR errors...
|
|
|
|
## 2022-07-08
|
|
|
|
- Finalize the cleaning and proofing of AfricaRice records
|
|
- I found two suspicious items that claim to have been published but I can't find in the respective journals, so I removed those
|
|
- I uploaded the forty-four items to [DSpace Test](https://dspacetest.cgiar.org/handle/10568/119135)
|
|
- Margarita from CCAFS said they are no longer using the CCAFS subject or CCAFS phase 2 project tag
|
|
- I removed these from the input-form.xml and Discovery facets:
|
|
- cg.identifier.ccafsprojectpii
|
|
- cg.subject.cifor
|
|
- For now we will keep them in the search filters
|
|
- I modified my `check-duplicates.py` script a bit to fix a logic error for deleted items and add similarity scores from spacy (see: https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents)
|
|
- I want to use this with the MARLO innovation reports, to find related publications and working papers on CGSpace
|
|
- I am curious to see how the similarity scores compare to those from trgm... perhaps we don't need them actually
|
|
- Deploy latest changes to submission form, Discovery, and browse on CGSpace
|
|
- Also run all system updates and reboot the host
|
|
- Fix 152 `dcterms.relation` that are using "cgspace.cgiar.org" links instead of handles:
|
|
|
|
```console
|
|
UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '.*cgspace\.cgiar\.org/handle/(\d+/\d+)$', 'https://hdl.handle.net/\1') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=180 AND text_value ~ 'cgspace\.cgiar\.org/handle/\d+/\d+$';
|
|
```
|
|
|
|
## 2022-07-10
|
|
|
|
- UptimeRobot says that CGSpace is down
|
|
- I see high load around 22, high CPU around 800%
|
|
- Doesn't seem to be a lot of unique IPs:
|
|
|
|
```console
|
|
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | sort -u | wc -l
|
|
2243
|
|
```
|
|
|
|
Looking at the top twenty I see some usual IPs, but some new ones on Hetzner that are using many DSpace sessions:
|
|
|
|
```console
|
|
$ grep 65.109.2.97 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
|
|
1613
|
|
$ grep 95.216.174.97 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
|
|
1696
|
|
$ grep 65.109.15.213 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
|
|
1708
|
|
$ grep 65.108.80.78 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
|
|
1830
|
|
$ grep 65.108.95.23 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
|
|
1811
|
|
```
|
|
|
|
![DSpace sessions week](/cgspace-notes/2022/07/jmx_dspace_sessions-week.png)
|
|
|
|
- These IPs are using normal-looking user agents:
|
|
- `Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.0.0.1`
|
|
- `Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/45.0"`
|
|
- `Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:56.0) Gecko/20100101 Firefox/56.0.1 Waterfox/56.0.1`
|
|
- `Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.85 Safari/537.36`
|
|
|
|
- I will add networks I'm seeing now to nginx's bot-networks.conf for now (not all of Hetzner) and purge the hits later:
|
|
- 65.108.0.0/16
|
|
- 65.21.0.0/16
|
|
- 95.216.0.0/16
|
|
- 135.181.0.0/16
|
|
- 138.201.0.0/16
|
|
|
|
- I think I'm going to get to a point where I categorize all commercial subnets as bots by default and then whitelist those we need
|
|
- Sheesh, there are a bunch more IPv6 addresses also on Hetzner:
|
|
|
|
```console
|
|
# awk '{print $1}' /var/log/nginx/{access,library-access}.log | sort | grep 2a01:4f9 | uniq -c | sort -h
|
|
1 2a01:4f9:6a:1c2b::2
|
|
2 2a01:4f9:2b:5a8::2
|
|
2 2a01:4f9:4b:4495::2
|
|
96 2a01:4f9:c010:518c::1
|
|
137 2a01:4f9:c010:a9bc::1
|
|
142 2a01:4f9:c010:58c9::1
|
|
142 2a01:4f9:c010:58ea::1
|
|
144 2a01:4f9:c010:58eb::1
|
|
145 2a01:4f9:c010:6ff8::1
|
|
148 2a01:4f9:c010:5190::1
|
|
149 2a01:4f9:c010:7d6d::1
|
|
153 2a01:4f9:c010:5226::1
|
|
156 2a01:4f9:c010:7f74::1
|
|
160 2a01:4f9:c010:5188::1
|
|
161 2a01:4f9:c010:58e5::1
|
|
168 2a01:4f9:c010:58ed::1
|
|
170 2a01:4f9:c010:548e::1
|
|
170 2a01:4f9:c010:8c97::1
|
|
175 2a01:4f9:c010:58c8::1
|
|
175 2a01:4f9:c010:aada::1
|
|
182 2a01:4f9:c010:58ec::1
|
|
182 2a01:4f9:c010:ae8c::1
|
|
502 2a01:4f9:c010:ee57::1
|
|
530 2a01:4f9:c011:567a::1
|
|
535 2a01:4f9:c010:d04e::1
|
|
539 2a01:4f9:c010:3d9a::1
|
|
586 2a01:4f9:c010:93db::1
|
|
593 2a01:4f9:c010:a04a::1
|
|
601 2a01:4f9:c011:4166::1
|
|
607 2a01:4f9:c010:9881::1
|
|
640 2a01:4f9:c010:87fb::1
|
|
648 2a01:4f9:c010:e680::1
|
|
1141 2a01:4f9:3a:2696::2
|
|
1146 2a01:4f9:3a:2555::2
|
|
3207 2a01:4f9:3a:2c19::2
|
|
```
|
|
|
|
- Maybe it's time I ban all of Hetzner... sheesh.
|
|
- I left for a few hours and the server was going up and down the whole time, still very high CPU and database when I got back
|
|
|
|
![CPU day](/cgspace-notes/2022/07/cpu-day.png)
|
|
|
|
- I am not sure what's going on
|
|
- I extracted all the IPs and used `resolve-addresses-geoip2.py` to analyze them and extract all the Hetzner networks and block them
|
|
- It's 181 IPs on Hetzner...
|
|
- I rebooted the server to see if it was just some stuck locks in PostgreSQL...
|
|
- The load is still higher than I would expect, and after a few more hours I see more Hetzner IPs coming through? Two more subnets to block
|
|
- Start a harvest on AReS
|
|
|
|
## 2022-07-12
|
|
|
|
- Update an incorrect ORCID identifier for Alliance
|
|
- Adjust collection permissions on CIFOR publications collection so Vika can submit without approval
|
|
|
|
## 2022-07-14
|
|
|
|
- Someone on the DSpace Slack mentioned having issues with the database configuration in DSpace 7.3
|
|
- The reason is apparently that the default `db.dialect` changed from "org.dspace.storage.rdbms.hibernate.postgres.DSpacePostgreSQL82Dialect" to "org.hibernate.dialect.PostgreSQL94Dialect" as a result of a Hibernate update
|
|
- Then I was getting more errors starting the backend server in Tomcat, but the issue was that the backend server needs Solr to be up first!
|
|
|
|
## 2022-07-17
|
|
|
|
- Start a harvest on AReS around 3:30PM
|
|
- Later in the evening I see CGSpace was going down and up (not as bad as last Sunday) with around 18.0 load...
|
|
- I see very high CPU usage:
|
|
|
|
![CPU day](/cgspace-notes/2022/07/cpu-day2.png)
|
|
|
|
- But DSpace sessions are normal (not like last weekend):
|
|
|
|
![DSpace sessions week](/cgspace-notes/2022/07/jmx_dspace_sessions-week2.png)
|
|
|
|
- I see some Hetzner IPs in the top users today, but most of the requests are getting HTTP 503 because of the changes I made last week
|
|
- I see 137.184.159.211, which is on Digital Ocean, and the DNS is apparently iitawpsite.iita.org
|
|
- I've seen their user agent before, but I don't think I knew it was IITA: "GuzzleHttp/6.3.3 curl/7.84.0 PHP/7.4.30"
|
|
- I already have something in nginx to mark Guzzle as a bot, but interestingly it shows up in Solr as `$http_user_agent` so there is a logic error in my nginx config
|
|
- Ouch, the logic error seems to be this:
|
|
|
|
```console
|
|
geo $ua {
|
|
default $http_user_agent;
|
|
|
|
include /etc/nginx/bot-networks.conf;
|
|
}
|
|
```
|
|
|
|
- After some testing on DSpace Test I see that this is actually setting the default user agent to a literal `$http_user_agent`
|
|
- The [nginx map docs](http://nginx.org/en/docs/http/ngx_http_map_module.html) say:
|
|
|
|
> The resulting value can contain text, variable (0.9.0), and their combination (1.11.0).
|
|
|
|
- But I can't get it to work, neither for the default value or for matching my IP...
|
|
- I will have to ask on the nginx mailing list
|
|
- The total number of requests and unique hosts was not even very high (below here around midnight so is almost all day):
|
|
|
|
```console
|
|
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | sort -u | wc -l
|
|
2776
|
|
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | wc -l
|
|
40325
|
|
```
|
|
|
|
## 2022-07-18
|
|
|
|
- Reading more about nginx's geo/map and doing some tests on DSpace Test, it appears that the [geo module cannot do dynamic values](https://stackoverflow.com/questions/47011497/nginx-geo-module-wont-use-variables)
|
|
- So this issue with the literal `$http_user_agent` is due to the geo block I put in place earlier this month
|
|
- I reworked the logic so that the geo block sets "bot" or and empty string when a network matches or not, and then re-use that value in a mapping that passes through the host's user agent in case geo has set it to an empty string
|
|
- This allows me to accomplish the original goal while still only using one bot-networks.conf file for the `limit_req_zone` and the user agent mapping that we pass to Tomcat
|
|
- Unfortunately this means I will have hundreds of thousands of requests in Solr with a literal `$http_user_agent`
|
|
- I might try to purge some by enumerating all the networks in my block file and running them through `check-spider-ip-hits.sh`
|
|
- I extracted all the IPs/subnets from `bot-networks.conf` and prepared them so I could enumerate their IPs
|
|
- I had to add `/32` to all single IPs, which I did with this crazy vim invocation:
|
|
|
|
```console
|
|
:g!/\/\d\+$/s/^\(\d\+\.\d\+\.\d\+\.\d\+\)$/\1\/32/
|
|
```
|
|
|
|
- Explanation:
|
|
- `g!`: global, lines *not* matching (the opposite of `g`)
|
|
- `/\/\d\+$/`, pattern matching `/` with one or more digits at the end of the line
|
|
- `s/^\(\d\+\.\d\+\.\d\+\.\d\+\)$/\1\/32/`, for lines not matching above, capture the IPv4 address and add `/32` at the end
|
|
- Then I ran the list through prips to enumerate the IPs:
|
|
|
|
```console
|
|
$ while read -r line; do prips "$line" | sed -e '1d; $d'; done < /tmp/bot-networks.conf > /tmp/bot-ips.txt
|
|
$ wc -l /tmp/bot-ips.txt
|
|
1946968 /tmp/bot-ips.txt
|
|
```
|
|
|
|
- I started running `check-spider-ip-hits.sh` with the 1946968 IPs and left it running in dry run mode
|
|
|
|
## 2022-07-19
|
|
|
|
- Patrizio and Fabio emailed me to ask if their IP was banned from CGSpace
|
|
- It's one of the Hetzner ones so I said yes definitely, and asked more about how they are using the API
|
|
- Add ORCID identifer for Ram Dhulipala, Lilian Wambua, and Dan Masiga to CGSpace and tag them and some other existing items:
|
|
|
|
```console
|
|
dc.contributor.author,cg.creator.identifier
|
|
"Dhulipala, Ram K","Ram Dhulipala: 0000-0002-9720-3247"
|
|
"Dhulipala, Ram","Ram Dhulipala: 0000-0002-9720-3247"
|
|
"Dhulipala, R.","Ram Dhulipala: 0000-0002-9720-3247"
|
|
"Wambua, Lillian","Lillian Wambua: 0000-0003-3632-7411"
|
|
"Wambua, Lilian","Lillian Wambua: 0000-0003-3632-7411"
|
|
"Masiga, D.K.","Daniel Masiga: 0000-0001-7513-0887"
|
|
"Masiga, Daniel K.","Daniel Masiga: 0000-0001-7513-0887"
|
|
"Jores, Joerg","Joerg Jores: 0000-0003-3790-5746"
|
|
"Schieck, Elise","Elise Schieck: 0000-0003-1756-6337"
|
|
"Schieck, Elise G.","Elise Schieck: 0000-0003-1756-6337"
|
|
$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2022-07-19-add-orcids.csv -db dspace -u dspace -p 'fuuu'
|
|
```
|
|
|
|
- Review the AfricaRice records from earlier this month again
|
|
- I found one more duplicate and one more suspicious item, so the total after removing those is now forty-two
|
|
- I took all the ~560 IPs that had hits so far in `check-spider-ip-hits.sh` above (about 270,000 into the list of 1946968 above) and ran them directly on CGSpace
|
|
- This purged 199,032 hits from Solr, very many of which were from Qualys, but also that Chinese bot on 124.17.34.0/24 that was grabbing PDFs a few years ago which I blocked in nginx, but never purged the hits from
|
|
- Then I deleted all IPs up to the last one where I found hits in the large file of 1946968 IPs and re-started the script
|
|
|
|
## 2022-07-20
|
|
|
|
- Did a few more minor edits to the forty-two AfricaRice records (including generating thumbnails for the handful that are Creative Commons licensed) then did a test import on my local instance
|
|
- Once it worked well I did an import to CGSpace:
|
|
|
|
```console
|
|
$ dspace import -a -e fuuu@example.com -m 2022-07-20-africarice.map -s /tmp/SimpleArchiveFormat
|
|
```
|
|
|
|
- Also make edits to ~62 affiliations on CGSpace because I noticed they were messed up
|
|
- Extract another ~1,600 IPs that had hits since I started the second round of `check-spider-ip-hits.sh` yesterday and purge another 303,594 hits
|
|
- This is about 999846 into the original list of 1946968 from yesterday
|
|
- A metric fuck ton of the IPs in this batch were from Hetzner
|
|
|
|
## 2022-07-21
|
|
|
|
- Extract another ~2,100 IPs that had hits since I started the third round of `check-spider-ip-hits.sh` last night and purge another 763,843 hits
|
|
- This is about 1441221 into the original list of 1946968 from two days ago
|
|
- Again these are overwhelmingly Hetzner (not surprising since my bot-networks.conf file in nginx is mostly Hetzner)
|
|
- I responded to my original request to Atmire about the log4j to reload4j migration in DSpace 6.4
|
|
- I had initially requested a comment from them in 2022-05
|
|
- Extract another ~1,200 IPs that had hits from the fourth round of `check-spider-ip-hits.sh` earlier today and purge another 74,591 hits
|
|
- Now the list of IPs I enumerated from the nginx `bot-networks.conf` is finished
|
|
|
|
## 2022-07-22
|
|
|
|
- I created a new Linode instance for testing DSpace 7
|
|
- Jose from the CCAFS team sent me the final versions of 3,500+ Innovations, Policies, MELIAs, and OICRs from MARLO
|
|
- I re-synced CGSpace with DSpace Test so I can have a newer snapshot of the production data there for testing the CCAFS MELIAs, OICRs, Policies, and Innovations
|
|
- I re-created the tip-submit and tip-approve DSpace user accounts for Alliance's new TIP submit tool and added them to the Alliance submitters and Alliance admins accounts respectively
|
|
- Start working on updating the Ansible infrastructure playbooks for DSpace 7 stuff
|
|
|
|
## 2022-07-23
|
|
|
|
- Start a harvest on AReS
|
|
- More work on DSpace 7 related issues in the [Ansible infrastructure playbooks](https://github.com/ilri/rmg-ansible-public)
|
|
|
|
## 2022-07-24
|
|
|
|
- More work on DSpace 7 related issues in the [Ansible infrastructure playbooks](https://github.com/ilri/rmg-ansible-public)
|
|
|
|
## 2022-07-25
|
|
|
|
- More work on DSpace 7 related issues in the [Ansible infrastructure playbooks](https://github.com/ilri/rmg-ansible-public)
|
|
- I see that, for Solr, we will need to copy the DSpace configsets to the writable data directory rather than the default home dir
|
|
- The [Taking Solr to production guide](https://solr.apache.org/guide/8_11/taking-solr-to-production.html) recommends keeping the unzipped code separate from the data, which we do in our Solr role already
|
|
- So that means we keep the unzipped code in `/opt/solr-8.11.2`, but the data directory in `/var/solr/data`, with the DSpace Solr cores here `/var/solr/data/configsets`
|
|
- I'm not sure how to integrate that into my playbooks yet
|
|
- Much to my surprise, Discovery indexing on DSpace 7 was really fast when I did it just now, apparently taking 40 minutes of wall clock time?!:
|
|
|
|
```console
|
|
$ /usr/bin/time -v /home/dspace7/bin/dspace index-discovery -b
|
|
The script has started
|
|
(Re)building index from scratch.
|
|
Done with indexing
|
|
The script has completed
|
|
Command being timed: "/home/dspace7/bin/dspace index-discovery -b"
|
|
User time (seconds): 588.18
|
|
System time (seconds): 91.26
|
|
Percent of CPU this job got: 28%
|
|
Elapsed (wall clock) time (h:mm:ss or m:ss): 40:05.79
|
|
Average shared text size (kbytes): 0
|
|
Average unshared data size (kbytes): 0
|
|
Average stack size (kbytes): 0
|
|
Average total size (kbytes): 0
|
|
Maximum resident set size (kbytes): 635380
|
|
Average resident set size (kbytes): 0
|
|
Major (requiring I/O) page faults: 1513
|
|
Minor (reclaiming a frame) page faults: 216412
|
|
Voluntary context switches: 1671092
|
|
Involuntary context switches: 744007
|
|
Swaps: 0
|
|
File system inputs: 4396880
|
|
File system outputs: 74312
|
|
Socket messages sent: 0
|
|
Socket messages received: 0
|
|
Signals delivered: 0
|
|
Page size (bytes): 4096
|
|
Exit status: 0
|
|
```
|
|
|
|
- Leroy from the Alliance wrote to say that the CIAT Library is back up so I might be able to download all the PDFs
|
|
- It had been shut down for a security reason a few months ago and we were planning to download them all and attach them to their relevant items on CGSpace
|
|
- I noticed one item that had the PDF already on CGSpace so I'll need to consider that when I eventually do the import
|
|
- I had to re-create the tip-submit and tip-approve accounts for Alliance on DSpace Test again
|
|
- After I created them last week they somehow got deleted...?!... I couldn't find them or the mel-submit account either!
|
|
|
|
<!-- vim: set sw=2 ts=2: -->
|