Compare commits

..

104 Commits

Author SHA1 Message Date
63a2dcfdee Add notes for 2025-01-03 2025-01-03 12:37:39 +03:00
e7d7d4af89 Add notes 2024-12-04 16:27:49 +03:00
bd2d9779bb Add notes 2024-11-19 10:40:23 +03:00
47b96e8370 Add notes for 2024-10-08 2024-10-08 13:46:23 +03:00
512848fc73 Add notes for 2024-10-03 2024-10-03 11:51:44 +03:00
f8a1876ad2 Add notes for 2024-09-29 2024-09-30 07:56:53 +03:00
bb1367025a Add notes for 2024-09-23 2024-09-23 13:10:20 +03:00
dabbc20806 Update notes for 2024-09-16 2024-09-17 08:11:03 +04:00
edd2a8b306 Add docs again 2024-09-17 08:02:34 +04:00
842373d26f Update themes/hugo-theme-bootstrap4-blog 2024-09-17 08:01:55 +04:00
35342f95dc Add notes for 2024-09-16 2024-09-16 22:52:51 +04:00
79708bd30c Add notes for 2024-09-14 2024-09-14 23:02:16 +03:00
a5298945a3 Add notes for 2024-09 2024-09-09 10:20:09 +03:00
062019463c Add docs 2024-08-28 11:35:14 +03:00
f1c25111d0 Add notes 2024-08-28 11:35:05 +03:00
da6d73bc1f content/post/2024-07.md: fix spaces 2024-08-22 09:51:08 +03:00
7be53639dc Add content/posts/2024-08.md 2024-08-16 19:57:30 -07:00
64b8957945 Update notes 2024-08-07 08:54:13 -07:00
89d1b61442 Update notes for 2024-07-11 2024-07-11 13:08:22 +03:00
668947909a Add notes 2024-07-02 11:12:03 +03:00
7858008918 Add notes for 2024-06-21 2024-06-23 09:34:49 +03:00
c3436ea6c2 Add notes for 2024-06-18 2024-06-18 17:30:08 +03:00
bf4a6402d7 Add notes 2024-06-16 16:40:54 +03:00
8383cd466b Add notes for 2024-06-03 2024-06-03 17:31:03 +03:00
6d574d645d Add notes for 2024-05-28 2024-05-28 16:40:32 +03:00
befe3a3a58 Add notes for 2024-05-27 2024-05-27 21:40:09 +03:00
39d8d0876c Add notes for 2024-05-20 2024-05-20 17:34:14 +03:00
28a0c82e96 Minor syntax fix in example 2024-05-16 08:27:56 +03:00
7fc97884df Add notes for 2024-05-13 2024-05-13 16:24:11 +03:00
223453adbb Add notes for 2023-05-13 2024-05-13 08:21:17 +03:00
1b523bf055 Add notes for 2024-05-05 2024-05-05 21:43:52 +03:00
908a75a5c7 Add notes for 2024-05-01 2024-05-01 17:10:05 +03:00
e323c15e8b Add notes for 2024-04-29 2024-04-29 17:21:28 +03:00
8f156a0365 Add notes 2024-04-27 11:22:58 +03:00
515cc0650f Add notes 2024-04-25 15:28:35 +03:00
6db3da2739 Add notes 2024-04-18 17:00:25 +03:00
60b244486f Add notes 2024-04-18 09:38:02 +03:00
efd8eb7f79 Add notes 2024-04-16 09:35:30 +03:00
281827944a Add notes for 2024-04-12 2024-04-12 20:40:52 +03:00
864b3b136e Add notes 2024-04-09 16:50:56 +03:00
01a2ff5bfd Add notes 2024-04-04 10:23:49 +03:00
d71c430a7d Add notes 2024-03-25 18:53:18 +03:00
0e43fc97d7 Add notes for 2024-03-19 2024-03-19 16:24:20 +03:00
90c4d46607 Add notes 2024-03-19 09:01:13 +03:00
83c053f7ee Add notes for 2024-03-13 2024-03-14 09:29:05 +03:00
ba68787282 Update notes for 2024-03-11 2024-03-11 21:58:15 +03:00
1fc45e8f1b Add notes for 2024-03-11 2024-03-11 18:04:40 +03:00
11f1935f85 Add notes for 2024-03-08 2024-03-08 17:31:19 +03:00
5ff70af33b Add notes for 2024-03 2024-03-04 10:02:14 +03:00
b60a58f56a Fix date for 2024-02 frontmatter 2024-03-01 09:55:02 +03:00
cc28c0ccdc Add notes for 2024-02-29 2024-02-29 16:38:38 +03:00
1e87242956 Add notes for 2024-02-29 2024-02-29 09:41:44 +03:00
483a170f06 Add notes 2024-02-27 17:18:35 +03:00
0692b8666c Add notes for 2024-02-23 2024-02-24 20:44:15 +03:00
b2eaff29b1 Add notes for 2024-02-20 2024-02-20 22:55:09 +03:00
da0fd61b7e Add notes for 2024-02-19 2024-02-19 16:48:20 +03:00
3f4b66bd08 Add notes for 2024-02 2024-02-06 11:45:02 +03:00
ed290fb6f8 Add notes for 2024-01-29 2024-02-05 11:09:40 +03:00
63c20dbef9 Add notes for 2024-01-27 2024-01-28 09:23:40 +03:00
300b2e4271 Notes for 2024-01-23 2024-01-24 08:24:50 +03:00
57fe0587a4 Add notes 2024-01-18 15:59:49 +03:00
20ace46614 Add notes 2024-01-10 17:21:12 +03:00
3475d4fd5d Add notes for 2024-01-10 2024-01-10 08:34:16 +03:00
1dfb54ef6b Update notes for 2024-01-07 2024-01-07 22:18:43 +03:00
82c79fc257 Add notes for 2024-01-07 2024-01-07 20:43:02 +03:00
cf5c1e2155 Add notes for 2024-01-06 2024-01-06 17:46:07 +03:00
7418dae4b9 Add notes 2024-01-05 15:45:46 +03:00
264cdcf1db Add notes 2023-12-29 12:08:57 +03:00
293b500b26 content/posts/2023-07.md: minor grammar fix 2023-12-27 10:48:32 +03:00
17a241de5b Add notes for 2023-12-20 2023-12-21 10:09:15 +03:00
7695eacf7a Add notes 2023-12-18 23:15:27 +03:00
f4c985c16b Add notes for 2023-12-12 2023-12-12 14:57:07 +03:00
bc6412de09 Add notes for 2023-12-08 2023-12-09 09:55:16 +03:00
2ecafafc17 Notes for 2023-12-08 2023-12-08 16:32:48 +03:00
804a505ae2 docs: regenerate 2023-12-06 20:57:19 +03:00
6c5fa7375f Fix notes for 2023-11 2023-12-06 20:57:07 +03:00
f2bee38014 Add notes for 2023-12-05 2023-12-06 09:55:57 +03:00
a50fe66c78 Add notes 2023-12-02 10:38:09 +03:00
177c3b796d Add notes for 2023-11-23 2023-11-23 16:15:13 +03:00
eb218389a0 Add notes for 2023-11-18 2023-11-19 14:29:52 +03:00
1dd5900fbf Add notes for 2023-11-16 2023-11-16 17:25:15 +03:00
d14dd7114a Add notes for 2023-11-11 2023-11-13 16:54:36 +03:00
01fb17950b Add notes 2023-11-08 08:20:31 +03:00
c6d514bef9 Add notes for 2023-11-02 2023-11-02 20:58:43 +03:00
34523acc47 Add notes for 2023-10-27 2023-10-27 17:09:30 +03:00
3a4ecbd82d Add notes 2023-10-24 23:26:01 +03:00
c9bcfca903 Add notes for 2023-10-16 2023-10-16 17:03:59 +03:00
7e3a7951d6 Add notes for 2023-10-13 2023-10-13 17:17:41 +03:00
8d39fc7d71 Fix typo 2023-10-08 22:04:41 +03:00
22dd379e9a Add notes for 2023-10-07 2023-10-08 10:57:53 +03:00
98cdd21cb5 Add notes for 2023-10-06 2023-10-06 15:19:34 +03:00
62838a091c Add notes for 2023-10-05 2023-10-05 17:58:03 +03:00
cb40610726 Update notes 2023-10-04 09:24:33 +03:00
249d9be387 Update notes 2023-09-30 13:07:23 +03:00
4a02a78186 Add notes for 2023-09-25 2023-09-25 17:38:05 +03:00
aa6cbb488d Add notes for 2023-09-22 2023-09-23 10:15:01 +03:00
aeaa397612 Add notes for 2023-09-19 2023-09-19 21:13:52 +03:00
d60b85433d Update notes for 2023-09-16 2023-09-16 23:38:04 +03:00
202d3fb88f Add notes for 2023-09-16 2023-09-16 20:24:24 +03:00
afcbc67874 Add notes for 2023-09-13 2023-09-14 20:57:25 +03:00
22e47beeb6 Add notes for 2023-09-10 2023-09-11 09:18:52 +03:00
223979f267 Add notes for 2023-09-09 2023-09-10 09:58:29 +03:00
28d62f1c0c Update notes for 2023-09-08 2023-09-09 00:25:48 +03:00
34bf124d5d Add notes for 2023-09-08 2023-09-09 00:25:12 +03:00
193 changed files with 16872 additions and 9701 deletions

View File

@@ -169,7 +169,7 @@ $ csvjoin --outer -c alpha2 ~/Downloads/clarisa-countries.csv ~/Downloads/UNSD\
- Then re-export the UN M.49 countries to a clean list because the one I did yesterday somehow has errors:
```console
csvcut -d ';' -c 'ISO-alpha2 Code,Country or Area' ~/Downloads/UNSD\ \ Methodology.csv | sed -e '1s/ISO-alpha2 Code/alpha2/' -e '1s/Country or Area/UN M.49 Name/' > ~/Downloads/un-countries.csv
$ csvcut -d ';' -c 'ISO-alpha2 Code,Country or Area' ~/Downloads/UNSD\ \ Methodology.csv | sed -e '1s/ISO-alpha2 Code/alpha2/' -e '1s/Country or Area/UN M.49 Name/' > ~/Downloads/un-countries.csv
```
- Check the number of lines in each file:

View File

@@ -63,7 +63,7 @@ $ csvjoin -c doi /tmp/2023-02-01-cgspace-doi-metadata.csv ~/Downloads/2023-02-01
```console
curl -f -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.actionArea", "value":"Systems Transformation", "language": "en_US"}'
$ curl -f -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.actionArea", "value":"Systems Transformation", "language": "en_US"}'
```
- I need to ask on the DSpace Slack about this POST pagination

View File

@@ -53,7 +53,7 @@ categories: ["Notes"]
- In the past I've found their _licensing_ information to not be very reliable (preferring Crossref), but I think their _open access status_ is more reliable, especially when the provider is listed as being the publisher
- Even so, sometimes the version can be "acceptedVersion", which is presumably the author's version, as opposed to the "publishedVersion", which means it's available as open access on the publisher's website
- I did some quality assurance and found ~100 that were marked as Limited Access, but should have been Open Access, and fixed a handful of licenses
- Delete duplicate metadata as describe in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
- Delete duplicate metadata as described in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
- Start working on some statistics on AGROVOC usage for my presenation next week
- I used the following SQL query to dump values from all subject fields and lower case them:

View File

@@ -237,7 +237,7 @@ https://dspace7test.ilri.org/server/api/discover/search/objects?query=lastModifi
- Oh nice, and we can do the same for accession date:
```
https://dspace7test.ilri.org/server/api/discover/search/objects?query=dc.date.accessioned_dt%3A%5B2023-08-01T00%3A00%3A00Z%20TO%20%2A%5D'
https://dspace7test.ilri.org/server/api/discover/search/objects?query=dc.date.accessioned_dt%3A%5B2023-08-01T00%3A00%3A00Z%20TO%20%2A%5D
```
- That is this query: `dc.date.accessioned_dt:[2023-08-01T00:00:00Z TO *]`

View File

@@ -18,4 +18,226 @@ categories: ["Notes"]
- It still feels hacky, but using [AfterViewInit](https://stackoverflow.com/questions/41936631/how-to-trigger-the-function-after-dom-markup-is-loaded-in-angular-style-applicat), and importing the Altmetric `embed.js` in the component works
- The style on mobile also needs work...
## 2023-09-06
- Discussion with Marie about finalizing the output types list on GitHub
- I did some review and cleanup in preparation for publishing the new list
## 2023-09-07
- Export CGSpace to start doing a review of the metadata
- First I will start by extracting all items with DOIs, along with some fields I can compare against Crossref:
```console
$ csvgrep -c 'cg.identifier.doi[en_US]' -r 'doi.org' ~/Downloads/2023-09-07-cgspace.csv \
| csvcut -c 'id,dc.title[en_US],dcterms.issued[en_US],dcterms.available[en_US],cg.issn[en_US],cg.isbn[en_US],cg.volume[en_US],cg.issue[en_US],cg.number[en_US],dcterms.extent[en_US],cg.identifier.doi[en_US],cg.reviewStatus[en_US],cg.isijournal[en_US],dcterms.license[en_US],dcterms.accessRights[en_US],dcterms.type[en_US],dc.identifier.uri[en_US]' \
> /tmp/2023-09-07-cgspace-dois.csv
$ csvgrep -c 'cg.identifier.doi[en_US]' -r 'doi.org' ~/Downloads/2023-09-07-cgspace.csv | csvcut -c 'cg.identifier.doi[en_US]' | sed 1d > /tmp/2023-09-07-cgspace-dois.txt
```
- Then I resolved the DOIs from Crossref:
```console
$ ./ilri/crossref_doi_lookup.py -i /tmp/2023-09-07-cgspace-dois.txt -o /tmp/2023-09-07-cgspace-dois-results.csv -e a.orth@cgiar.org
```
- A user emailed to ask about uploading a 180MB PDF to CGSpace
- I used GhostScript to try reducing it using the `screen`, `ebook` and `prepress` presets:
```console
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=primer-screen.pdf Primer\ \(digital\)_Climate-\ smart\ and\ regenerative\ agriculture\ in\ climate\ change\ adaptation.pdf
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=primer-ebook.pdf Primer\ \(digital\)_Climate-\ smart\ and\ regenerative\ agriculture\ in\ climate\ change\ adaptation.pdf
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/prepress -dNOPAUSE -dQUIET -dBATCH -sOutputFile=primer-prepress.pdf Primer\ \(digital\)_Climate-\ smart\ and\ regenerative\ agriculture\ in\ climate\ change\ adaptation.pdf
```
- The `prepress` one is 300DPI and looks visually identical to the original, so I proposed that we use that one
## 2023-09-08
- I did a review of the metadata for our items with DOIs, comparing with data from Crossref
- I spot checked a handful of issue / online dates and licenses, and saw that Crossref's dates are always more accurate than ours when they differ
- I also filled in some missing volumes, issues, ISSNs, and extents
- This results in 14,000 changes to existing items, which will take several days to import unfortunately
- After eight hours the first file is only about 2/3 finished... sigh
- Meet with Peter to discuss changes to the DSpace 7 test
- Minor updates to submission forms and some new ideas for the home page and item page
- I figured out how to use a themed home page component and add a cards UI to our CGSpace theme
## 2023-09-09
- I can't believe that almost 18 hours later the first CSV import with 5,000 changes is not done...
- Run all system updates on CGSpace and reboot it, as it had been two months since the last time
## 2023-09-10
- Minor work on the DSpace 7 home page
## 2023-09-11
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-09-12
- Minor work on DSpace 7 home page
- Minor work on CG Core types
- I published a new HTML version of the updated IPtypes and archived the current version as v2.0.0 so we can still reference it
## 2023-09-13
- Stefano reminded me about the updated OAI MODS mappings on CGSpace so I re-applied them on DSpace Test and updated the OAI index so he could confirm
- Now I'm ready to put it on CGSpace if he confirms
- I created a basic theme for CIP on DSpace 7
- While doing that I noticed that a bunch of CIP bitstreams didn't have the latest 500px thumbnails so I re-ran filter-media on a handful of their collections
- I had two occurrences of an OOM kill of the Tomcat 9 java process on DSpace 7 test tonight
- Once while doing a Discovery index, the other while doing filter media
## 2023-09-15
- Discuss issues with the Altmetric API with the Altmetric support team
- Apparently we can use a different API, the [Explorer API](https://www.altmetric.com/explorer/documentation/api), since we already have access to the Explorer dashboard
- I reduced the Solr heap size on DSpace 7 from 3GB to 2GB
- Apparentlty I already did this from 4GB to 3GB a few months ago
- The Solr admin interface was showing Solr taking ~1GB of RAM so I think this should be safe
- Mark on DSpace Slack said he uses PM2's `--max-memory-restart` so the processes restart when they hit the limit
- Also, he said he had to reduce `cache:serverSide:botCache:max` from 1000 to 500 to cache less SSR pages in memory
- I decided to try deploying DSpace 7 Test on a Hetzner server with 64GB RAM, 6 CPUs, and 2x512GB NVMe SSD
## 2023-09-16
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
- Configure the privacy policy page on DSpace 7 using a themed component with the text from our DSpace 6 site
- I realized that for all my custom Angular components I should be using `routerLink` instead of `href` when I am constructing links
- The `routerLink` routes within the single page application and saves state, while the `href` reloads the page
- Using the `routerLink` way is faster and results in less flashing and jumping in the page when navigating
- See: https://stackoverflow.com/a/61588147
## 2023-09-17
- I added an About page to DSpace 7 Test using similar logic to the privacy page
## 2023-09-18
- I filed a GitHub issue for being unable to navigate dropdown lists using the keyboard on the dspace-angular submission form: https://github.com/DSpace/dspace-angular/issues/2500
- I filed a GitHub issue for the search filters capitalizing metadata values: https://github.com/DSpace/dspace-angular/issues/2501
## 2023-09-19
- Complete migration of DSpace 7 Test from Linode to Hetzner
- Export some years of Solr stats from CGSpace to import on the new DSpace 7 Test:
```console
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2020-2022.json -f 'time:[2020-01-01T00\:00\:00Z TO 2022-12-31T23\:59\:59Z]' -k uid -S actingGroupId,actingGroupParentId,actorMemberGroupId,author_mtdt,author_mtdt_search,bitstreamCount,bitstreamId,complete_query,complete_query_search,containerBitstream,containerCollection,containerCommunity,containerItem,core_update_run_nb,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,file_id,filterquery,first_name,geoipcountrycode,geoIpCountryCode,group_id,group_map,group_name,ip_ngram,ip_search,isArchived,isInternal,iso_mtdt,iso_mtdt_search,isWithdrawn,last_name,name,ngram_query_search,ngram_simplequery_search,orphaned,parent_count,p_communities_id,p_communities_map,p_communities_name,p_group_id,p_group_map,p_group_name,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,referrer_ngram,referrer_search,simple_query,simple_query_search,solr_update_time_stamp,storage_nb_of_bitstreams,storage_size,storage_statistics_type,subject_mtdt,subject_mtdt_search,text,userAgent_ngram,userAgent_search,version_id,workflowItemId
```
- Ben sent me an export of ILRI presentations from Slideshare and asked if we could see if any are missing on CGSpace
- First I exported CGSpace and extracted the `cg.identifier.url` column so I could normalize all Slideshare URLs to use "https://www.slideshare.net" instead of localized variants (es.slideshare.net, fr.slideshare.net, etc) as well as non-https links and links with query params and slashes at the end
- This was about 250 URLs
- I extracted the URL field from both our list and the Slideshare list and then used [GNU `join` to print non-matched lines](https://unix.stackexchange.com/questions/274548/join-two-files-each-with-two-columns-including-non-matching-lines):
```console
$ join -t, -v 2 -11 -21 -o auto /tmp/cgspace-ilri-slideshare-sorted-only-urls-sorted.csv /tmp/ilri-slideshare-sorted-sorted.csv | wc -l
542
```
- Important to note that you must use GNU `sort` on the fiels first, as I had tried sorting in vim and it didn't satisfy `join`
- So it seems there are 542 Slideshare presentations we are missing
## 2023-09-20
- Regarding the incorrect city in Solr statistics, I see we have 1,600,000 of them
- Before filing a GitHub issue, I want to check if they maybe come from an Atmire module, as I see them clustered around two particular CUA versions:
```json
{
"responseHeader": {
"status": 0,
"QTime": 2760,
"params": {
"q": "city:com.maxmind.geoip2.record.City*",
"facet.field": "cua_version",
"indent": "true",
"rows": "0",
"wt": "json",
"facet": "true",
"_": "1695192301927"
}
},
"response": {
"numFound": 1661863,
"start": 0,
"docs": []
},
"facet_counts": {
"facet_queries": {},
"facet_fields": {
"cua_version": [
"6.x-4.1.10-ilri-RC7",
1112186,
"6.x-4.1.10-ilri-RC5",
451180,
"6.x-4.1.10-ilri-RC9",
0
]
},
"facet_dates": {},
"facet_ranges": {},
"facet_intervals": {}
}
}
```
- I migrated AReS from Linode to Hetzner
- I asked on Slack and someone told me that we need to edit `src/app/menu.resolver.ts` to add new drop down menus to the top navbar
- It works, though is unfortunate that we can't do it in a theme
## 2023-09-21
- More minor work on DSpace 7 home page and menus
- Meeting to discuss types and DSpace 7 migration plans
- Create a DSpace 7 theme for IITA
## 2023-09-22
- Create a DSpace 7 theme for IWMI
- I had some issues with pm2 on the new DSpace 7 Test
- It seems to be due to mixing systemd starting versus manually starting / stopping...
- After reading the discussion in [this pm2 issue](https://github.com/Unitech/pm2/issues/2914) I realize that we probably need to use `--no-daemon` to have systemd fully manage the processes without pm2 trying to save state
## 2023-09-23
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-09-25
- CGSpace metadata and community / collection cleanup
- Review some patches on DSpace Angular
- Create a basic Alliance theme for DSpace 7
## 2023-09-27
- I realized that we can get controlled vocabularies from DSpace 7's REST API, for both value-pairs and hierarchical controlled vocabularies, ie:
https://dspace7test.ilri.org/server/api/submission/vocabularies/common_iso_languages/entries
## 2023-09-29
- Meeting with Aditi and others to discuss plan for using CGSpace to do a systematic review of CGIAR research on climate change
- I cleaned up metadata for a hundred or so items, and realized we will need to do more to make sure abstracts and open access status are correct since there will be a laser focus on the metadata
## 2023-09-30
- Export CGSpace to check for missing Initiative collection mappings
- Still working on checking Unpaywall for access rights and licenses for our DOIs
- Regarding Unpaywall's "evidence" metadata about whether an item is open access or not, after looking at dozens of items manually:
- evidence: "oa journal (via doaj)" <---- yes
- evidence: "open (via free article)" <---- hmmm, not always correct
- evidence: "open (via page says license)" <--- noooo, can't rely on that
- evidence: "open (via page says Open Access)" <---- yes...?
- evidence: "open (via free pdf)" <---- hmmm, not always correct
- evidence: "oa journal (via publisher name)" <---- noooo
- I updated access status for about four hundred more items based on this, and licenses for a dozen or so
<!-- vim: set sw=2 ts=2: -->

150
content/posts/2023-10.md Normal file
View File

@@ -0,0 +1,150 @@
---
title: "October, 2023"
date: 2023-10-02T09:05:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-10-02
- Export CGSpace to check DOIs against Crossref
- I found that [Crossref's metadata is in the public domain under the CC0 license](https://www.crossref.org/documentation/retrieve-metadata/rest-api/rest-api-metadata-license-information/)
- One interesting thing is the abstracts, which are copyrighted by the copyright owner, meaning Crossref cannot waive the copyright under the terms of the CC0 license, because it is not theirs to waive
- We can be on the safe side by using only abstracts for items that are licensed under Creative Commons
<!--more-->
- This GREL extracts the _text_ content of the `<jats:p>` tags (ie, no other JATS XML markup tags like `<jats:i>`, `<jats:sub>`, etc):
```console
forEach(value.parseXml().select("jats|p"),i,i.xmlText()).join("")
```
- Note that we need to use `select("jats|p")` instead of `select("jats:p")` for OpenRefine's parseXml, and we need to `join()` on the end
- I updated metadata for about 3,000 items using Crossref metadata
- I stripped trailing periods for titles where they were missing on the Crossref titles
- I copied abstracts for about 600 items that were missing them, for items that were Creative Commons
- I updated publishers for a few thousand more where ours and Crossref disagreed, checking a handful manually first
- I also added subjects to the `crossref_doi_lookup.py` script to see if they will be useful for us
- When checking with csv-metadata-quality I can validate those subjects against AGROVOC and add them if they are valid
## 2023-10-03
- I added the item type to the collection subscription email on DSpace 6
- It's done differently on DSpace 7 so I'll have to see how to do it there...
- Test a patch that fixes a bug with item versioning disabled in DSpace 7
- I hadn't realized that DSpace 7 defaulted to versioning being enabled, whereas we never used this in DSpace 6 (yet)
- Submit [an issue regarding duplicate Discovery sort fields](https://github.com/DSpace/DSpace/issues/9104) in DSpace 7
## 2023-10-05
- Some discussion this week about issue and online dates for journal articles, with regards to PRMS
- I looked more closely at the [Crossref API docs](https://github.com/CrossRef/rest-api-doc/blob/master/api_format.md) and realized (again) that their "issue" date is not the same as our issue date—they take the earlier of the print and online dates!
- Also, *very many* items have no print date at all, perhaps due to delays, errors, or simply because the journal is "online only"!
- I suggested again that PRMS should consider both, and take the earlier of the two, then make sure whether the date is in the current reporting period
- I managed to find 80 items with print publishing dates from 2023 and updated those from Crossref, but for the rest we will have to think about how we handle them
## 2023-10-06
- More discussion about dates after looking closely at them yesterday and today
- Crossref doesn't always have both issued and online dates—sometimes they have one, sometimes the other, and sometimes both, so we cannot rely on them 100% for that.
- In some cases, the item is available online for months (or even a year!), but has not been included in an issue yet, and thus has no "issue" date, for example:
- https://doi.org/10.1002/csc2.20914 <--- published online January 2023!
- https://doi.org/10.1111/mcn.13401 <--- published online July 2022!
- Even journals make mistakes: this journal article was "issued" in 2022, but online in 2023! This is not Crossref's fault, but the journal's!
- https://doi.org/10.1186/s40066-022-00400-6
- I found a bunch more strange cases regarding dates and recommended to PRMS team that they use the earlier of the issued and online dates
- Meet with Aditi to start discussing the scope of knowledge products we can get for the CGIAR climate change synthesis
## 2023-10-07
- I spent a few hours (!) debugging an issue in Python when downloading PDFs
- I think it ended up being due to `requests_cache`!!! Grrrr
- On a positive note I've greatly refactored my script for discovering and downloading PDFs from Unpaywall
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-10-08
- Starting to see some stuck locks on CGSpace this morning
- I will give notice and restart CGSpace
- Work on Python script to harvest DSpace REST API and save to CSV
## 2023-10-11
- File an issue on the DSpace issue tracker regarding the MaxMind JSON objects in our Solr statistics: https://github.com/DSpace/DSpace/issues/9118
## 2023-10-12
- Discuss MODS issues in CGSpace's OAI-PMH with Stefano and Valentina
- AGRIS can currently only support MODS 3.7 so they need us to roll our 3.8 work from 2023-06 back down, which requires some minor changes to the crosswalk
## 2023-10-13
- I did some more minor work to get the MODS 3.7 changes ready for AGRIS on DSpace Test
## 2023-10-14
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
- I deployed the AGRIS changes for OAI-PMH on CGSpace
## 2023-10-16
- Fix some typos in ILRI subjects on CGSpace
- These were affecting the taxonomy on ilri.org
- I exported CGSpace and did some validation and cleanup on ILRI subjects, moving some to AGROVOC subjects
- Port the MODS 3.7 crosswalk from DSpace 6 to DSpace 7
- It works fine, we only need to take note that the OAI-PMH endpoint is now relative to the `/server` path instead of a dedicated OAI path
## 2023-10-17
- Export CGSpace to do some cleanups all over on invalid metadata values
- I found many metadata values in the wrong field, wrong format, etc
- This ended up being cleanups for 694 items
## 2023-10-20
- Export CGSpace to check for missing Initiative collection mappings
- I also did a run of looking up all Initiative outputs with DOIs against Crossref to check for missing dates, publishers, etc
- I found issued dates for a few, and online dates for over 100
- I also fixed some incorrect licenses, access status, and abstracts
## 2023-10-23
- Export a list of Internal Documents for Peter to review to see if we can re-classify some
- Peter sent changes for 740 items so I applied them on CGSpace
- Testing the changes for OpenRXV DSpace 7 compatibility
## 2023-10-24
- Sync DSpace 7 Test with a fresh CGSpace snapshot
- Meeting with FARA to discuss DSpace training and support
- Meeting with IFPRI about migrating to CGSpace
## 2023-10-25
- Maria was asking about an error deleting an item in the Alliance community
- The error was "Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:..."
- According to my notes this error happened a few times in the past and is some kind of corner case regarding permissions
- I deleted the item for her
- I deleted a handful of old CRP groups on CGSpace
## 2023-10-27
- Peter sent me a list of journal articles from Altmetric that have an ILRI affiliation, but no Handle
- I used my `crossref_doi_lookup.py` script to fetch the metadata for them using their DOIs, then did a bunch of cleanup in OpenRefine
- Test some LDAP patches for DSpace 7
## 2023-10-30
- Some work on metadata for Aditi's review
- I found more preprints grrrr
## 2023-10-31
- Peter got back to me with the cleanups on ILRI journal articles from Altmetric that we didn't have on CGSpace
- I did another duplicate check and found four more duplicates that had been uploaded yesterday
- Then I did a quick sanity check and uploaded the remaining 19 items to CGSpace
<!-- vim: set sw=2 ts=2: -->

215
content/posts/2023-11.md Normal file
View File

@@ -0,0 +1,215 @@
---
title: "November, 2023"
date: 2023-11-02T12:59:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-11-01
- Work a bit on the ETL pipeline for the CGIAR Climate Change Synthesis
- I improved the filtering and wrote some Python using pandas to merge my sources more reliably
## 2023-11-02
- Export CGSpace to check missing Initiative collection mappings
- Start a harvest on AReS
<!--more-->
- IFPRI contacted us about importing their Slideshare presentations to CGSpace
- There are ~1,700 of them and date back to as early as 2008
- I did a quick cleanup of the metadata export from Slideshare (including tagging with some AGROVOC in OpenRefine) and uploaded to DSpace Test
## 2023-11-03
- A little bit of work on the CGIAR Climate Change Synthesis
- Discuss some CGSpace migration plans with Leigh from IFPRI
- For their Slideshare content we agreed:
- Exclude private
- Exclude deleted
- Exclude non presentation types
- Exclude duplicates within the collection for now until we can sort them out
- That leaves about 1,500 items out of the 1,700
- I did a duplicate check against CGSpace and found 44 items with 1.0 similarity so I removed those
## 2023-11-04
- Export CGSpace to check for missing Initiative collection mappings
- I ran through the list of potential duplicates on the IFPRI Slideshare presentations
## 2023-11-05
- Work with Salem to migrate AReS to the new version
## 2023-11-07
- DSpace 7 Test went down and there is very high load on the server
- I saw very high load from Java but didn't have time to check exactly what was wrong so I just rebooted the host
- A few hours after restarting the system went down again, with very high load from Java again
- I see lots of messages like this in the Tomcat log:
```
tomcat9[732]: [9955.662s][info ][gc] GC(6291) Pause Full (G1 Compaction Pause) 4085M->4080M(4096M) 677.251ms
tomcat9[732]: [9955.662s][info ][gc] GC(6290) Concurrent Mark Cycle 677.558ms
tomcat9[732]: [9955.666s][info ][gc] GC(6292) To-space exhausted
```
- I see some messages in `dspace.log` about heap space:
```
Caused by: java.lang.OutOfMemoryError: Java heap space
```
- I will increase Tomcat's heap from 4096m to 5120m
- A few hours later it happened again, so I increased the heap from 5120m to 6144m
- Not sure what's going on today...
- I tested moving the CGIAR Fund Council community to the CGIAR historic archive on DSpace Test:
```console
$ dspace community-filiator -r -p 10568/83389 -c 10947/2516
$ dspace community-filiator -s -p 10947/2515 -c 10947/2516
$ dspace index-discovery -r 10947/2516
$ dspace index-discovery -r 10947/2515
$ dspace index-discovery -r 10568/83389
$ dspace index-discovery
```
- I think this is the minimal we can do to avoid a full Discovery reindex which is very expensive
- I helped Maria resize some massive PDFs for upload to CGSpace using GhostScript prepress mode as I had done before in [September, 2023]({{< relref "2023-09.md" >}}),
## 2023-11-08
- DSpace 7 Test has very high load again and I see more Java heap space errors in the log
```console
# grep -c 'Caused by: java.lang.OutOfMemoryError: Java heap space' /home/dspace7/log/dspace.log-2023-11-07
35
# grep -c 'Caused by: java.lang.OutOfMemoryError: Java heap space' /home/dspace7/log/dspace.log
7
```
- I don't know what is happening... I will increase the heap size from 6144m to 7168m again...
- I did some work on the value mappings in AReS
- I wanted to test the import/export feature, and found that I could get a JSON and convert it to CSV for manipulation in OpenRefine
- Importing duplicates records, so I deleted and re-created the index in Elasticsearch first
- Then I started a new harvest on AReS to make sure the mappings are applied
## 2023-11-09
- Ryan asked me for help uploading a large PDF to CGSpace
- I tried my usual GhostScript preprint invocation and found the size decrease significantly, but some minor artifacts appeared in the images
- Interestingly, the [GhostScript docs](https://ghostscript.com/docs/9.54.0/VectorDevices.htm) mention that `prepress` doesn't give the best results:
> Please be aware that the /prepress setting does not indicate the highest quality conversion. Using any of these presets will involve altering the input, and as such may result in a PDF of poorer quality (compared to the input) than simply using the defaults. The 'best' quality (where best means closest to the original input) is obtained by not setting this parameter at all (or by using /default).
- Also, I found [a question on StackOverflow discussing some further techniques for PDFs with images](https://stackoverflow.com/questions/40849325/ghostscript-pdfwrite-specify-jpeg-quality):
```console
$ gs -sOutputFile=137166-default-dct.pdf -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH -dPDFSETTINGS=/default -c "<< /ColorACSImageDict << /VSamples [ 1 1 1 1 ] /HSamples [ 1 1 1 1 ] /QFactor 0.08 /Blend 1 >> /ColorImageDownsampleType /Bicubic /ColorConversionStrategy /LeaveColorUnchanged >> setdistillerparams" -f 137166.pdf
```
- This looks much better, and is still much smaller than the original
- Also, I used `pdfimages` to extract all the images from the original and the one above and found:
```console
$ du -sh images-*
886M images-default-dct
1012M images-original
```
- And from [WeCompress's analysis](https://www.wecompress.com/en/analyze) I see that the images are 85% of the size of the PDF
## 2023-11-10
- I finished checking the IFPRI Slideshare records and added some tagging of countries, regions, and CRPs and then uploaded them to CGSpace
## 2023-11-11
- Salem fixed a bug on OpenRXV that was splitting country values by "," before matching them with ISO countries
- I exported CGSpace to check for missing Initiative collection mappings
- Start a fresh harvest on AReS
## 2023-11-16
- Discuss mapping ICARDA outputs from Initiatives to ICARDA collections on CGSpace
- I added MEL's CGSpace user to the administrator group of a handful of collections
- I also did a batch mapping of 274 existing Initiative outputs from ICARDA to the relevant collections
## 2023-11-18
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-11-22
- I was checking out the [DSpace 7 statistics](https://github.com/DSpace/RestContract/blob/main/statistics-reports.md) again and found that we have total visits and total downloads for each DSpace object, for example [this item](https://dspace7test.ilri.org/items/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748):
- TotalVisits: https://dspace7test.ilri.org/server/api/statistics/usagereports/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748_TotalVisits
- TotalDownloads: https://dspace7test.ilri.org/server/api/statistics/usagereports/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748_TotalDownloads
- And the numbers match those in my dspace-statisitcs-api *exactly*!
- This can be useful to get an individual DSpace object's stats, but there is no way to iterate over all objects like all items...
- We can look at using this to draw stats on the community, collection, and item pages
## 2023-11-23
- Brian King was asking me how many PDFs we had in CGSpace so I got a rough estimate using this SQL query:
```console
localhost/dspace7= ☘ SELECT COUNT(uuid) FROM bitstream WHERE bitstream_format_id=(SELECT bitstream_format_id FROM bitstreamformatregistry WHERE mimetype='application/pdf');
count
───────
47818
(1 row)
```
- It's been some time since I looked at our Solr statistics to find new bots
- I found a few new ones that I [submitted to COUNTER-Robots](https://github.com/atmire/COUNTER-Robots/pull/60) and added to our local bot list:
- GuzzleHttp/7
- Owler@ows.eu/1
- newspaperjs
- I ran my old `check-spider-hits.sh` script with a list of bots from our local overrides to purge hits from Solr:
```console
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 30 hits from ubermetrics in statistics
Purging 59 hits from curb in statistics
Purging 36 hits from bitdiscovery in statistics
Purging 87 hits from omgili in statistics
Purging 47 hits from Vizzit in statistics
Purging 109 hits from Java\/17-ea in statistics
Purging 40 hits from AdobeUxTechC4-Async in statistics
Purging 21 hits from ZaloPC-win32-24v473 in statistics
Purging 21 hits from nbertaupete95 in statistics
Purging 52 hits from Scoop\.it in statistics
Purging 16 hits from WebAPIClient in statistics
Purging 241 hits from RStudio in statistics
Purging 1255 hits from ^MEL in statistics
Purging 47850 hits from GuzzleHttp in statistics
Purging 8714 hits from Owler in statistics
Purging 1083 hits from newspaperjs in statistics
Purging 369 hits from ^Chrome$ in statistics
Purging 1474 hits from curl in statistics
Total number of bot hits purged: 61504
```
- I also noticed 35,000 requests over the past few years from lowercase user agents, which is [definitely weird](https://developers.whatismybrowser.com/api/features/user-agent-checks/weird/#all_lower_case), for example:
- `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/89.0.4389.90 safari/537.36`
- `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/90.0.4430.93 safari/537.36`
- I'm gonna add those to our overrides and purge them:
```console
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 35816 hits from ^mozilla in statistics
Total number of bot hits purged: 35816
```
## 2023-11-30
- Minor updates to our OAI MODS crosswalk
- Stefano found a minor markup issue with our alternative titles (`<titleInfo>` tag)
- Very high load on CGSpace since after lunch
- I killed some locks that had been stuck for a few hours
<!-- vim: set sw=2 ts=2: -->

271
content/posts/2023-12.md Normal file
View File

@@ -0,0 +1,271 @@
---
title: "December, 2023"
date: 2023-12-01T08:48:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-12-01
- There is still high load on CGSpace and I don't know why
- I don't see a high number of sessions compared to previous days in the last few weeks
<!-- more -->
```console
$ for file in dspace.log.2023-11-[23]*; do echo "$file"; grep -a -oE 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
dspace.log.2023-11-20
22865
dspace.log.2023-11-21
20296
dspace.log.2023-11-22
19688
dspace.log.2023-11-23
17906
dspace.log.2023-11-24
18453
dspace.log.2023-11-25
17513
dspace.log.2023-11-26
19037
dspace.log.2023-11-27
21103
dspace.log.2023-11-28
23023
dspace.log.2023-11-29
23545
dspace.log.2023-11-30
21298
```
- Even the number of unique IPs is not very high compared to the last week or so:
```console
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort | uniq | wc -l
17023
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.2.gz | sort | uniq | wc -l
17294
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.3.gz | sort | uniq | wc -l
22057
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.4.gz | sort | uniq | wc -l
32956
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.5.gz | sort | uniq | wc -l
11415
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.6.gz | sort | uniq | wc -l
15444
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.7.gz | sort | uniq | wc -l
12648
```
- It doesn't make any sense so I think I'm going to restart the server...
- After restarting the server the load went down to normal levels... who knows...
- I started trying to see how I'm going to generate the fake statistics for the Alliance bitstream that was replaced
- I exported all the statistics for the owningItem now:
```console
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/stats-export.json -f 'owningItem:b5862bfa-9799-4167-b1cf-76f0f4ea1e18' -k uid
```
- Importing them into DSpace Test didn't show the statistics in the Atmire module, but I see them in Solr...
## 2023-12-02
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-12-04
- Send a message to Altmetric support because the item IWMI highlighted last month still doesn't show the attention score for the Handle after I tweeted it several times weeks ago
- Spent some time writing a Python script to fix the literal MaxMind City JSON objects in our Solr statistics
- There are about 1.6 million of these, so I exported them using solr-import-export-json with the query `city:com*` but ended up finding many that have missing bundles, container bitstreams, etc:
```
city:com* AND -bundleName:[* TO *] AND -containerBitstream:[* TO *] AND -file_id:[* TO *] AND -owningItem:[* TO *] AND -version_id:[* TO *]
```
- (Note the negation to find fields that are missing)
- I don't know what I want to do with these yet
## 2023-12-05
- I finished the `fix_maxmind_stats.py` script and fixed 1.6 million records and imported them on CGSpace after testing on DSpace 7 Test
- Altmetric said there was a glitch regarding the Handle and DOI linking and they successfully re-scraped the item page and linked them
- They sent me a list of current production IPs and I notice that some of them are in our nginx bot network list:
```console
$ for network in $(csvcut -c network /tmp/ips.csv | sed 1d | sort -u); do grepcidr $network ~/src/git/rmg-ansible-public/roles/dspace/files/nginx/bot-networks.conf; done
108.128.0.0/13 'bot';
46.137.0.0/16 'bot';
52.208.0.0/13 'bot';
52.48.0.0/13 'bot';
54.194.0.0/15 'bot';
54.216.0.0/14 'bot';
54.220.0.0/15 'bot';
54.228.0.0/15 'bot';
63.32.242.35/32 'bot';
63.32.0.0/14 'bot';
99.80.0.0/15 'bot'
```
- I will remove those for now so that Altmetric doesn't have any unexpected issues harvesting
## 2023-12-08
- Finalized the script to generate Solr statistics for Alliance research Mirjam
- The script is `ilri/generate_solr_statistics.py`
- I generated ~3,200 statistics based on her records of the download statistics of [that item](https://hdl.handle.net/10568/131997) and imported them on CGSpace
- Did some work on the DSpace 7 submission form
- Peter asked for lists of affiliations, investors, and publishers to do some cleanups
- I generated a list from a CSV export instead of doing it based on a SQL dump...
```console
$ csvcut -c 'cg.contributor.affiliation[en_US]' /tmp/initiatives.csv \
| sed -e 1d -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' -e '/^$/d' \
| sort | uniq -c | sort -hr \
| awk 'BEGIN { FS = "^[[:space:]]+[[:digit:]]+[[:space:]]+" } {print $2}'\
| sed -e '1i cg.contributor.affiliation' -e 's/^\(.*\)$/"\1"/' \
> /tmp/2023-12-08-initiatives-affiliations.csv
```
- Export a list of authors as well:
```console
localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "dc.contributor.author", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 3 GROUP BY "dc.contributor.author" ORDER BY count DESC) to /tmp/2023-12-08-authors.csv WITH CSV HEADER;
COPY 102435
```
## 2023-12-11
- Work on OpenRXV dependencies and podman a bit
- Peter noticed that the statistics for this month are very very low on CGSpace
- I don't know what is going on, perhaps it is related to me adjusting the nginx config last week?
- Ah, it's probably because of the spider patterns I updated on 2023-11
## 2023-12-16
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-12-17
- Pull latest master branch for OpenRXV and deploy on the server
- I threw away some changes in the tree regarding the Angular base ref, and it broke AReS
- So note to self: we need to set the base ref in `frontend/Dockerfile` before building!
- Now Salem fixed the country map
## 2023-12-18
- Work a bit on the IFPRI-ISNAR archive from Leigh
- More work on the DSpace 7 home page
## 2023-12-19
- More work on the DSpace 7 home page
- The Alliance TIP team is testing deposits to the DSpace 7 REST API and getting an HTTP 500 error
- In the DSpace logs I see this after they log in, create the item, and update the metadata:
```
2023-12-19 17:49:28,022 ERROR unknown unknown org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
```
- I found some messages on the dspace-tech mailing list suggesting this might be an old bug: https://groups.google.com/g/dspace-tech/c/My1GUFYFGoU/m/tS7-WAJPAwAJ
- I restarted Tomcat and told the Alliance TIP team to try again
## 2023-12-20
- The Alliance guys said that submitting via REST works now... sigh, so that's just some old DSpace 5/6 REST API bug
- I lowercased all our AGROVOC keywords in `dcterms.subject` in SQL:
```console
dspace=# BEGIN;
BEGIN
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 462
dspace=*# COMMIT;
COMMIT
```
## 2023-12-25
- Looking into [Solr backups](https://solr.apache.org/guide/8_11/making-and-restoring-backups.html)
- Since we are not running in Solr Cloud mode we need to use the replication endpoint for Solr standalone
- This works:
```console
$ curl 'http://localhost:8983/solr/statistics/replication?command=backup'
{
"responseHeader":{
"status":0,
"QTime":26},
"status":"OK"}
```
- Then I saw the size of the snapshot reach the size of the index...
```console
# du -sh /var/solr/data/configsets/statistics/data/*
22G /var/solr/data/configsets/statistics/data/index
16G /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
# du -sh /var/solr/data/configsets/statistics/data/*
22G /var/solr/data/configsets/statistics/data/index
20G /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
# du -sh /var/solr/data/configsets/statistics/data/*
22G /var/solr/data/configsets/statistics/data/index
21G /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
# du -sh /var/solr/data/configsets/statistics/data/*
22G /var/solr/data/configsets/statistics/data/index
22G /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
```
- Then I deleted the core and restored from the snapshot backup:
```console
$ curl http://localhost:8983/solr/statistics/update -H "Content-type: text/xml" --data-binary '<delete><query>*:*</query></delete>'
$ curl http://localhost:8983/solr/statistics/update -H "Content-type: text/xml" --data-binary '<commit />'
$ curl 'http://localhost:8983/solr/statistics/replication?command=restore&name=statistics'
```
- Interestingly the import worked fine, but created a new data index:
```console
# du -sh /var/solr/data/configsets/statistics/data/*
4.0K /var/solr/data/configsets/statistics/data/index.properties
22G /var/solr/data/configsets/statistics/data/restore.20231225154626463
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
22G /var/solr/data/configsets/statistics/data/snapshot.statistics
```
- Not sure the implications of that—Solr uses the data just fine
- I can surely use this for atomic Solr backups
## 2023-12-27
- Delete duplicate metadata as described in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
- Do some other metadata cleanups on CGSpace
- I also looked up our DOIs on Crossref to get some missing abstracts and correct licenses and dates
- Some minor work on the CGSpace DSpace 7 theme to fix the navbar on mobile
- Some work on the IFPRI ISNAR archive
## 2023-12-28
- I started porting the [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) to DSpace 7
- Some work on the IFPRI ISNAR archive
- I ended up going through most of the PDFs to get better dates and abstracts
## 2023-12-29
- I created a new Hetzner server to replace the current DSpace 6 CGSpace next week when we migrate to DSpace 7
- Interesting, I haven't checked for content pointing to legacy domains in several years (!)
- `inurl:mahider.cgiar.org`: 0 results on Google!
- `inurl:mahider.ilri.org`: 2,100 results on Google
- `inurl:mahider.ilri.org inurl:https`: 2 results on Google (!)
- `inurl:dspace.ilri.org:` 1,390 results on Google
- `inurl:dspace.ilri.org inurl:https`: 0 results on Google (!)
- So it seems I can do away with the HTTPS virtual hosts finally
- Well my current certificates expired on 2021-02-13 and nobody noticed... so...
<!-- vim: set sw=2 ts=2: -->

430
content/posts/2024-01.md Normal file
View File

@@ -0,0 +1,430 @@
---
title: "January, 2024"
date: 2024-01-02T10:08:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-01-02
- Work on preparation of new server for DSpace 7 migration
- I'm not quite sure what we need to do for the Handle server
- For now I just ran the `dspace make-handle-config` script and diffed it with the one from DSpace 6
- I sent the bundle to the Handle admins to make sure it's OK before we do the migration
- Continue testing and debugging the cgspace-java-helpers on DSpace 7
- Work on IFPRI ISNAR archive cleanup
<!--more-->
## 2024-01-03
- I haven't heard from the Handle admins so I'm preparing a backup solution using nginx streams
- This seems to work in my simple tests (this must be outside the `http {}` block):
```
stream {
upstream handle_tcp_9000 {
server 188.34.177.10:9000;
}
server {
listen 9000;
proxy_connect_timeout 1s;
proxy_timeout 3s;
proxy_pass handle_tcp_9000;
}
}
```
- Here I forwarded a test TCP port 9000 from one server to another and was able to retrieve a test HTML that was running on the target
- I will have to do TCP and UDP on port 2641, and TCP/HTTP on port 8000.
- I did some more minor work on the IFPRI ISNAR archive
- I got some PDFs from the UMN AgEcon search and fixed some metadata
- Then I did some duplicate checking and found five items already on CGSpace
## 2024-01-04
- Upload 692 items for the ISNAR archive to CGSpace: https://cgspace.cgiar.org/handle/10568/136192
- Help Peter proof and upload 252 items from the 2023 Gender conference to CGSpace
- Meeting with IFPRI to discuss their migration to CGSpace
- We agreed to add two new fields, one for IFPRI project and one for IFPRI publication ranking
- Most likely we will use `cg.identifier.project` as a general field and consolidate other project fields there
- Not sure which field to use for the publication rank...
## 2024-01-05
- Proof and upload 51 items in bulk for IFPRI
- I did a big cleanup of user groups in anticipation of complaints about slow workflow tasks etc in DSpace 7
- I removed ILRI editors from all the dozens of CCAFS community and collection groups, and I should do the same for other CRPs since they are closed for two years now
## 2024-01-06
- Migrate CGSpace to DSpace 7
## 2024-01-07
- High load on the server and UptimeRobot saying the frontend is flapping
- I noticed tons of logs from pm2 in the systemd journal, so I disabled those in the systemd unit because they are available from pm2's log directory anyway
- I also noticed the same for Solr, so I disabled stdout for that systemd unit as well
- I spent a lot of time bringing back the nginx rate limits we used in DSpace 6 and it seems to have helped
- I see some client doing weird HEAD requests to search pages:
```
47.76.35.19 - - [07/Jan/2024:00:00:02 +0100] "HEAD /search/?f.accessRights=Open+Access%2Cequals&f.actionArea=Resilient+Agrifood+Systems%2Cequals&f.author=Burkart%2C+Stefan%2Cequals&f.country=Kenya%2Cequals&f.impactArea=Climate+adaptation+and+mitigation%2Cequals&f.itemtype=Brief%2Cequals&f.publisher=CGIAR+System+Organization%2Cequals&f.region=Asia%2Cequals&f.sdg=SDG+12+-+Responsible+consumption+and+production%2Cequals&f.sponsorship=CGIAR+Trust+Fund%2Cequals&f.subject=environmental+factors%2Cequals&spc.page=1 HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.2504.63 Safari/537.36"
```
- I will add their network blocks (AS45102) and regenerate my list of bot networks:
```console
$ wget https://asn.ipinfo.app/api/text/list/AS16276 \
https://asn.ipinfo.app/api/text/list/AS23576 \
https://asn.ipinfo.app/api/text/list/AS24940 \
https://asn.ipinfo.app/api/text/list/AS13238 \
https://asn.ipinfo.app/api/text/list/AS14061 \
https://asn.ipinfo.app/api/text/list/AS12876 \
https://asn.ipinfo.app/api/text/list/AS55286 \
https://asn.ipinfo.app/api/text/list/AS203020 \
https://asn.ipinfo.app/api/text/list/AS204287 \
https://asn.ipinfo.app/api/text/list/AS50245 \
https://asn.ipinfo.app/api/text/list/AS6939 \
https://asn.ipinfo.app/api/text/list/AS45102 \
https://asn.ipinfo.app/api/text/list/AS21859
$ cat AS* | sort | uniq | wc -l
4897
$ cat AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
$ wc -l /tmp/networks.txt
2017 /tmp/networks.txt
```
- I'm surprised to see the number of networks reduced from my current ones... hmmm.
- I will also update my list of Bing networks:
```console
$ ./ilri/bing-networks-to-ips.sh
$ ~/go/bin/mapcidr -a < /tmp/bing-ips.txt > /tmp/bing-networks.txt
$ wc -l /tmp/bing-networks.txt
250 /tmp/bing-networks.txt
```
## 2024-01-08
- Export list of publishers for Peter to select some amount to use as a controlled vocabulary:
```console
localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "dcterms.publisher", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 178 GROUP BY "dcterms.publisher" ORDER BY count DESC) to /tmp/2024-01-publishers.csv WITH CSV HEADER;
COPY 4332
```
- Address some feedback on DSpace 7 from users, including fileing some issues on GitHub
- https://github.com/DSpace/dspace-angular/issues/2730: List of available metadata fields is truncated when adding new metadata in "Edit Item"
- The Alliance TIP team was having issues posting to one collection via the legacy DSpace 6 REST API
- In the DSpace logs I see the same issue that they had last month:
```
ERROR unknown unknown org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
```
## 2024-01-09
- I restarted Tomcat to see if it helps the REST issue
- After talking with Peter about publishers we decided to get a clean list of the top ~100 publishers and then make sure all CGIAR centers, Initiatives, and Impact Platforms are there as well
- I exported a list from PostgreSQL and then filtered by count > 40 in OpenRefine and then extracted the metadata values:
```
$ csvcut -c dcterms.publisher ~/Downloads/2024-01-09-publishers4.csv | sed -e 1d -e 's/"//g' > /tmp/top-publishers.txt
```
- Export a list of ORCID identifiers from PostgreSQL to look them up on ORCID and update our controlled vocabulary:
```console
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=247) to /tmp/2024-01-09-orcid-identifiers.txt;
localhost/dspace7= ☘ \q
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/2024-01-09-orcid-identifiers.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2024-01-09-orcids.txt
$ ./ilri/resolve_orcids.py -i /tmp/2024-01-09-orcids.txt -o /tmp/2024-01-09-orcids-names.txt -d
```
- Then I updated existing ORCID identifiers in CGSpace:
```
$ ./ilri/update_orcids.py -i /tmp/2024-01-09-orcids-names.txt -db dspace -u dspace -p bahhhh
```
- Bizu seems to be having issues due to belonging to too many groups
- I see some messages from Solr in the DSpace log:
```
2024-01-09 06:23:35,893 ERROR unknown unknown org.dspace.authorize.AuthorizeServiceImpl @ Failed getting getting community/collection admin status for bahhhhh@cgiar.org The search error is: Error from server at http://localhost:8983/solr/search: org.apache.solr.search.SyntaxError: Cannot parse 'search.resourcetype:Community AND (admin:eef481147-daf3-4fd2-bb8d-e18af8131d8c OR admin:g80199ef9-bcd6-4961-9512-501dea076607 OR admin:g4ac29263-cf0c-48d0-8be7-7f09317d50ec OR admin:g0e594148-a0f6-4f00-970d-6b7812f89540 OR admin:g0265b87a-2183-4357-a971-7a5b0c7add3a OR admin:g371ae807-f014-4305-b4ec-f2a8f6f0dcfa OR admin:gdc5cb27c-4a5a-45c2-b656-a399fded70de OR admin:ge36d0ece-7a52-4925-afeb-6641d6a348cc OR admin:g15dc1173-7ddf-43cf-a89a-77a7f81c4cfc OR admin:gc3a599d3-c758-46cd-9855-c98f6ab58ae4 OR admin:g3d648c3e-58c3-4342-b500-07cba10ba52d OR admin:g82bf5168-65c1-4627-8eb4-724fa0ea51a7 OR admin:ge751e973-697d-419c-b59b-5a5644702874 OR admin:g44dd0a80-c1e6-4274-9be4-9f342d74928c OR admin:g4842f9c2-73ed-476a-a81a-7167d8aa7946 OR admin:g5f279b3f-c2ce-4c75-b151-1de52c1a540e OR admin:ga6df8adc-2e1d-40f2-8f1e-f77796d0eecd OR admin:gfdfc1621-382e-437a-8674-c9007627565c OR admin:g15cd114a-0b89-442b-a1b4-1febb6959571 OR admin:g12aede99-d018-4c00-b4d4-a732541d0017 OR admin:gc59529d7-002a-4216-b2e1-d909afd2d4a9 OR admin:gd0806714-bc13-460d-bedd-121bdd5436a4 OR admin:gce70739a-8820-4d56-b19c-f191855479e4 OR admin:g7d3409eb-81e3-4156-afb1-7f02de22065f OR admin:g54bc009e-2954-4dad-8c30-be6a09dc5093 OR admin:gc5e1d6b7-4603-40d7-852f-6654c159dec9 OR admin:g0046214d-c85b-4f12-a5e6-2f57a2c3abb0 OR admin:g4c7b4fd0-938f-40e9-ab3e-447c317296c1 OR admin:gcfae9b69-d8dd-4cf3-9a4e-d6e31ff68731 OR ... admin:g20f366c0-96c0-4416-ad0b-46884010925f)': too many boolean clauses The search resourceType filter was: search.resourcetype:Community
```
- There are 1,805 OR clauses in the full log!
- We previous had this issue in 2020-01 and 2020-02 with DSpace 5 and DSpace 6
- At the time the solution was to increase the `maxBooleanClauses` in Solr and to disable access rights awareness, but I don't think we want to do the second one now
- I saw many users of Solr in other applications increasing this to obscenely high numbers, so I think we should be OK to increase it from 1024 to 2048
- Re-visiting the DSpace user groomer to delete inactive users
- In 2023-08 I noticed that this was now [possible in DSpace 7](https://github.com/DSpace/DSpace/pull/2928)
- As a test I tried to delete all users who have been inactive since six years ago (Janury 9, 2018):
```console
$ dspace dsrun org.dspace.eperson.Groomer -a -b 01/09/2018 -d
```
- I tested it on DSpace 7 Test and it worked... I am debating running it on CGSpace...
- I see we have almost 9,000 users:
```console
$ dspace user -L > /tmp/users-before.txt
$ wc -l /tmp/users-before.txt
8943 /tmp/users-before.txt
```
- I decided to do the same on CGSpace and it worked without errors
- I finished working on the controlled vocabulary for publishers
## 2024-01-10
- I spent some time deleting old groups on CGSpace
- I looked into the use of the `cg.identifier.ciatproject` field and found there are only a handful of uses, with some even seeming to be a mistake:
```console
localhost/dspace7= ☘ SELECT DISTINCT text_value AS "cg.identifier.ciatproject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata
_field_id = 232 GROUP BY "cg.identifier.ciatproject" ORDER BY count DESC;
cg.identifier.ciatproject │ count
───────────────────────────┼───────
D145 │ 4
LAM_LivestockPlus │ 2
A215 │ 1
A217 │ 1
A220 │ 1
A223 │ 1
A224 │ 1
A227 │ 1
A229 │ 1
A230 │ 1
CLIMATE CHANGE MITIGATION │ 1
LIVESTOCK │ 1
(12 rows)
Time: 240.041 ms
```
- I think we can move those to a new `cg.identifier.project` if we create one
- The `cg.identifier.cpwfproject` field is similarly sparse, but the CCAFS ones are widely used
## 2024-01-12
- Export a list of affiliations to do some cleanup:
```console
localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 211 GROUP BY "cg.contributor.affiliation" ORDER BY count DESC) to /tmp/2024-01-affiliations.csv WITH CSV HEADER;
COPY 11719
```
- I first did some clustering and editing in OpenRefine, then I'll import those back into CGSpace and then do another export
- Troubleshooting the statistics pages that aren't working on DSpace 7
- On a hunch, I queried for for Solr statistics documents that **did not have an `id` matching the 36-character UUID pattern**:
```console
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&rows=0'
{
"responseHeader":{
"status":0,
"QTime":0,
"params":{
"q":"-id:/.{36}/",
"rows":"0"}},
"response":{"numFound":800167,"start":0,"numFoundExact":true,"docs":[]
}}
```
- They seem to come mostly from 2020, 2023, and 2024:
```console
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&facet.range=time&facet=true&facet.range.start=2010-01-01T00:00:00Z&facet.range.end=NOW&facet.range.gap=%2B1YEAR&rows=0'
{
"responseHeader":{
"status":0,
"QTime":13,
"params":{
"facet.range":"time",
"q":"-id:/.{36}/",
"facet.range.gap":"+1YEAR",
"rows":"0",
"facet":"true",
"facet.range.start":"2010-01-01T00:00:00Z",
"facet.range.end":"NOW"}},
"response":{"numFound":800168,"start":0,"numFoundExact":true,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{},
"facet_ranges":{
"time":{
"counts":[
"2010-01-01T00:00:00Z",0,
"2011-01-01T00:00:00Z",0,
"2012-01-01T00:00:00Z",0,
"2013-01-01T00:00:00Z",0,
"2014-01-01T00:00:00Z",0,
"2015-01-01T00:00:00Z",89,
"2016-01-01T00:00:00Z",11,
"2017-01-01T00:00:00Z",0,
"2018-01-01T00:00:00Z",0,
"2019-01-01T00:00:00Z",0,
"2020-01-01T00:00:00Z",1339,
"2021-01-01T00:00:00Z",0,
"2022-01-01T00:00:00Z",0,
"2023-01-01T00:00:00Z",653736,
"2024-01-01T00:00:00Z",144993],
"gap":"+1YEAR",
"start":"2010-01-01T00:00:00Z",
"end":"2025-01-01T00:00:00Z"}},
"facet_intervals":{},
"facet_heatmaps":{}}}
```
- They seem to come from 2023-08 until now (so way before we migrated to DSpace 7):
```console
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&facet.range=time&facet=true&facet.range.start=2023-01-01T00:00:00Z&facet.range.end=NOW&facet.range.gap=%2B1MONTH&rows=0'
{
"responseHeader":{
"status":0,
"QTime":196,
"params":{
"facet.range":"time",
"q":"-id:/.{36}/",
"facet.range.gap":"+1MONTH",
"rows":"0",
"facet":"true",
"facet.range.start":"2023-01-01T00:00:00Z",
"facet.range.end":"NOW"}},
"response":{"numFound":800168,"start":0,"numFoundExact":true,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{},
"facet_ranges":{
"time":{
"counts":[
"2023-01-01T00:00:00Z",1,
"2023-02-01T00:00:00Z",0,
"2023-03-01T00:00:00Z",0,
"2023-04-01T00:00:00Z",0,
"2023-05-01T00:00:00Z",0,
"2023-06-01T00:00:00Z",0,
"2023-07-01T00:00:00Z",0,
"2023-08-01T00:00:00Z",27621,
"2023-09-01T00:00:00Z",59165,
"2023-10-01T00:00:00Z",115338,
"2023-11-01T00:00:00Z",96147,
"2023-12-01T00:00:00Z",355464,
"2024-01-01T00:00:00Z",125429],
"gap":"+1MONTH",
"start":"2023-01-01T00:00:00Z",
"end":"2024-02-01T00:00:00Z"}},
"facet_intervals":{},
"facet_heatmaps":{}}}
```
- I see that we had 31,744 statistic events yesterday, and 799 have no `id`!
- I asked about this on Slack and will file an issue on GitHub if someone else also finds such records
- Several people said they have them, so it's a bug of some sort in DSpace, not our configuration
## 2024-01-13
- Yesterday alone we had 37,000 unique IPs making requests to nginx
- I looked up the ASNs and found 6,000 IPs from this network in Amazon Singapore: 47.128.0.0/14
## 2024-01-15
- Investigating the CSS selector warning that I've seen in PM2 logs:
```console
0|dspace-ui | 1 rules skipped due to selector errors:
0|dspace-ui | .custom-file-input:lang(en)~.custom-file-label -> unmatched pseudo-class :lang
```
- It seems to be a bug in Angular, as this selector comes from Bootstrap 4.6.x and is not invalid
- But that led me to a more interesting issue with `inlineCritical` optimization for styles in Angular SSR that might be responsible for causing high load in the frontend
- See: https://github.com/angular/angular/issues/42098
- See: https://github.com/angular/universal/issues/2106
- See: https://github.com/GoogleChromeLabs/critters/issues/78
- Since the production site was flapping a lot I decided to try disabling inlineCriticalCss
- There have been on and off load issues with the Angular frontend today
- I think I will just block all data center network blocks for now
- In the last week I see almost 200,000 unique IPs:
```console
# zcat -f /var/log/nginx/*access.log /var/log/nginx/*access.log.1 /var/log/nginx/*access.log.2.gz /var/log/nginx/*access.log.3.gz /var/log/nginx/*access.log.4.gz /var/log/nginx/*access.log.5.gz /var/log/nginx/*access.log.6.gz | awk '{print $1}' | sort -u |
tee /tmp/ips.txt | wc -l
196493
```
- Looking these IPs up I see there are 18,000 coming from Comcast, 10,000 from AT&T, 4110 from Charter, 3500 from Cox and dozens of other residential IPs
- I highly doubt these are home users browsing CGSpace... seems super fishy
- Also, over 1,000 IPs from SpaceX Starlink in the last week. RIGHT
- I will temporarily add a few new datacenter ISP network blocks to our rate limit:
- 16509 Amazon-02
- 701 UUNET
- 8075 Microsoft
- 15169 Google
- 14618 Amazon-AES
- 396982 Google Cloud
- The load on the server *immediately* dropped
## 2024-01-17
- It turns out AS701 (UUNET) is Verizon Business, which is used as an ISP for many staff at IFPRI
- This was causing them to see HTTP 429 "too many requests" errors on CGSpace
- I removed this ASN from the rate limiting
## 2024-01-18
- Start looking at Solr stats again
- I found one statistics record that has 22,000 of the same collection in `owningColl` and 22,000 of the same community in `owningComm`
- The record is from 2015 and think it would be easier to delete it than fix it:
```console
$ curl http://localhost:8983/solr/statistics/update -H "Content-type: text/xml" --data-binary '<delete><query>uid:3b4eefba-a302-4172-a286-dcb25d70129e</query></delete>'
```
- Looking again, there are at least 1,000 of these so I will need to come up with an actual solution to fix these
- I'm noticing we have 1,800+ links to defunct resources on bioversityinternational.org in the `cg.link.permalink` field
- I should ask Alliance if they have any plans to fix those, or upload them to CGSpace
## 2024-01-22
- Meeting with IWMI about ORCID integration on CGSpace now that we've migrated to DSpace 7
- File an issue for the inaccurate DSpace statistics: https://github.com/DSpace/DSpace/issues/9275
## 2024-01-23
- Meeting with IWMI about ORCID integration and the DSpace API for use with WordPress
- IFPRI sent me an list of their author ORCIDs to add to our controlled vocabulary
- I joined them with our current list and resolved their names on ORCID and updated them in our database:
```console
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml ~/Downloads/IFPRI\ ORCiD\ All.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2024-01-23-orcids.txt
$ ./ilri/resolve_orcids.py -i /tmp/2024-01-23-orcids.txt -o /tmp/2024-01-23-orcids-names.txt -d
$ ./ilri/update_orcids.py -i /tmp/2024-01-23-orcids-names.txt -db dspace -u dspace -p fuuu
```
- This adds about 400 new identifiers to the controlled vocabulary
- I consolidated our various project identifier fields for closed programs into one `cg.identifer.project`:
- `cg.identifier.ccafsproject`
- `cg.identifier.ccafsprojectpii`
- `cg.identifier.ciatproject`
- `cg.identifier.cpwfproject`
- I prefixed the existing 2,644 metadata values with "CCAFS", "CIAT", or "CPWF" so we can figure out where they came from if need be, and deleted the old fields from the metadata registry
## 2024-01-26
- Minor work on dspace-angular to clean up component styles
- Add `cg.identifier.publicationRank` to CGSpace metadata registry and submission form
## 2024-01-29
- Rework the nginx bot and network limits slightly to remove some old patterns/networks and remove Google
- The Google Scholar team contacted me to ask why their requests were timing out (well...)
<!-- vim: set sw=2 ts=2: -->

118
content/posts/2024-02.md Normal file
View File

@@ -0,0 +1,118 @@
---
title: "February, 2024"
date: 2024-02-05T11:10:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-02-05
- Delete duplicate metadata as described in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
- Lower case all the AGROVOC subjects on CGSpace
<!--more-->
```sql
dspace=# BEGIN;
BEGIN
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 180
dspace=*# COMMIT;
COMMIT
```
## 2024-02-06
- Discuss IWMI using the CGSpace REST API for their new website
- Export the IWMI community to extract their ORCID identifiers:
```console
$ dspace metadata-export -i 10568/16814 -f /tmp/iwmi.csv
$ csvcut -c 'cg.creator.identifier,cg.creator.identifier[en_US]' ~/Downloads/2024-02-06-iwmi.csv \
| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' \
| sort -u \
| tee /tmp/iwmi-orcids.txt \
| wc -l
353
$ ./ilri/resolve_orcids.py -i /tmp/iwmi-orcids.txt -o /tmp/iwmi-orcids-names.csv -d
```
- I noticed some similar looking names in our list so I clustered them in OpenRefine and manually checked a dozen or so to update our list
## 2024-02-07
- Maria asked me about the "missing" item from last week again
- I can see it when I used the Admin search, but not in her workflow
- It was submitted by TIP so I checked that user's workspace and found it there
- After depositing, it went into the workflow so Maria should be able to see it now
## 2024-02-09
- Minor edits to CGSpace submission form
- Upload 55 ISNAR book chapters to CGSpace from Peter
## 2024-02-19
- Looking into the collection mapping issue on CGSpace
- It seems to be by design in DSpace 7: https://github.com/DSpace/dspace-angular/issues/1203
- This is a massive setback for us...
## 2024-02-20
- Minor work on OpenRXV to fix a bug in the ng-select drop downs
- Minor work on the DSpace 7 nginx configuration to allow requesting robots.txt and sitemaps without hitting rate limits
## 2024-02-21
- Minor updates on OpenRXV, including one bug fix for missing mapped collections
- Salem had to re-work the harvester for DSpace 7 since the mapped collections and parent collection list are separate!
## 2024-02-22
- Discuss tagging of datasets and re-work the submission form to encourage use of DOI field for any item that has a DOI, and the normal URL field if not
- The "cg.identifier.dataurl" field will be used for "related" datasets
- I still have to check and move some metadata for existing datasets
## 2024-02-23
- This morning Tomcat died due to an OOM kill from the kernel:
```console
kernel: Out of memory: Killed process 698 (java) total-vm:14151300kB, anon-rss:9665812kB, file-rss:320kB, shmem-rss:0kB, UID:997 pgtables:20436kB oom_score_adj:0
```
- I don't see any abnormal pattern in my Grafana graphs, for JVM or system load... very weird
- I updated the submission form on CGSpace to include the new changes to URLs for datasets
- I also updated about 80 datasets to move the URLs to the correct field
## 2024-02-25
- This morning Tomcat died while I was doing a CSV export, with an OOM kill from the kernel:
```console
kernel: Out of memory: Killed process 720768 (java) total-vm:14079976kB, anon-rss:9301684kB, file-rss:152kB, shmem-rss:0kB, UID:997 pgtables:19488kB oom_score_adj:0
```
- I don't know why this is happening so often recently...
## 2024-02-27
- IFPRI sent me a list of authors to add to our list for now, until we can find a better way of doing it
- I extracted the existing authors from our controlled vocabulary and combined them with IFPRI's:
```console
$ xmllint --xpath '//node/isComposedBy/node()' dspace/config/controlled-vocabularies/dc-contributor-author.xml \
| grep -oE 'label=".*"' \
| sed -e 's/label="//' -e 's/"$//' > /tmp/authors
$ cat /tmp/authors /tmp/ifpri-authors | sort -u > /tmp/new-authors
```
## 2024-02-28
- I figured out a way to add a new Angular component to handle all our relation fields
## 2024-02-29
- Clean up a bunch of metadata on CGSpace
<!-- vim: set sw=2 ts=2: -->

207
content/posts/2024-03.md Normal file
View File

@@ -0,0 +1,207 @@
---
title: "March, 2024"
date: 2024-03-01T09:55:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-03-01
- Last week Bizu reported an issue with the "browse by issue date" drop down
- I verified it, and suspect it could be due to missing issue dates...
- It might be this issue: https://github.com/DSpace/dspace-angular/issues/2808
<!--more-->
- I spent some time trying to reproduce the bug affecting `onebox` fields that are configured to use external vocabularies and are not repeatable
- I filed an issue: https://github.com/DSpace/dspace-angular/issues/2846
## 2024-03-03
- I did some cleanups on abstracts, licenses, and dates from CrossRef
- I also did some minor cleanups to affiliations because I saw some incorrect and duplicate ones in our list
## 2024-03-05
- I tried a new technique to get some affiliations from Crossref using OpenRefine
- First I split them and clustered, resolving a few hundred clusters out of 1500 (!)
- Then I used a custom text facet with a few dozen CGIAR and other large affiliations to reduce the work
- Then I joined them with our affiliations, paying no attention to duplicates
- Then I deduped them using the Jython technique I learned in 2023-02
## 2024-03-06
- Peter sent me some more corrections for the authors that I had sent him in 2023-12
## 2024-03-08
- IFPRI sent me their 2023 records from CONTENTdm so I started working on those
- I found a way to match their ORCID identifiers in our list using Jython in OpenRefine:
```python
import re
with open(r"/tmp/cg-creator-identifier.txt",'r') as f :
orcid_ids = [orcid_id.strip() for orcid_id in f]
matched = False
for orcid_id in orcid_ids:
if re.search(r'.+: {}'.format(value), orcid_id):
matched = True
break
if matched:
return orcid_id
else:
return value
```
- I realized that [UNICEF was renamed to its current name in 1953](https://www.unicef.org/about-unicef/frequently-asked-questions#3) so I replaced all other variations in our vocabularies and metadata:
```sql
UPDATE metadatavalue SET text_value='United Nations Children''s Fund' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value IN ('United Nations International Children''s Emergency Fund', 'United Nations International Children''s Emergency Fund', 'UNICEF');
```
- Note the use of two single quotes to escape the one in the name
## 2024-03-11
- Experimenting with moving some of my Python scripts to the DSpace 7 REST API
- I need a way to get UUIDs for Handles...
- Seems that I can use a Discovery query like: https://dspace7test.ilri.org/server/api/discover/search/objects?dsoType=item&query=handle:10568/130864
- Then just take the first result...?
- I spent some time working on the script get abstracts from CGSpace, and found a bug in my logic
- I also noticed that one item had two abstracts, but the first one was blank!
- Looking deeper, I found 113 blank metadata values so I deleted those:
```sql
BEGIN;
DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='';
COMMIT;
```
- I also found a few dozen items with "N/A" for their citation, so I deleted those too:
```sql
BEGIN;
DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='N/A' AND metadata_field_id=146;
COMMIT;
```
- I deployed the change to disable Angular SSR's `inlineCriticalCss` on production because we had heavy load on the frontend and I've been meaning to do this permanently for some time
- Maria asked me for a CSV with all the broken Bioversity permalinks so I exported them for her:
```console
$ csvcut -c 'id,dc.title[en_US],dc.identifier.uri[en_US],cg.link.permalink[en_US]' ~/Downloads/2024-03-05-cgspace.csv \
| csvgrep -c 'cg.link.permalink[en_US]' -r '^.+$' > /tmp/2024-03-11-Bioversity-Permalinks.csv
```
## 2024-03-12
- Run the duplicate checker for IFPRI 2023 batch upload
## 2024-03-13
- I found about 428 duplicates in the IFPRI 2023 batch records
- Alarmingly, I found about 18 that are duplicated on CGSpace as well!
- I looked closer and decided that 11 were duplicates, so I merged the metadata and withdrew the later ones
- Alliance asked me to get him the Handles for items submitted by TIP that are not discoverable
- I found it easiest to use the `ds6_item2itemhandle` [DSpace SQL helper function](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) with a nested query on the provenance:
```sql
SELECT ds6_item2itemhandle(dspace_object_id) AS handle FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item WHERE NOT discoverable) AND metadata_field_id=28 AND text_value LIKE 'Submitted by Alliance TIP Submit%';
```
## 2024-03-14
- Looking in to reports of rate limiting of Altmetric's bot on CGSpace
- I don't see any HTTP 429 responses for their user agents in any of our logs...
- I tried myself on an item page and never hit a limit...
```console
$ for num in {1..60}; do echo -n "Request ${num}: "; curl -s -o /dev/null -w "%{http_code}" https://dspace7test.ilri.org/items/c9b8999d-3001-42ba-a267-14f4bfa90b53 && echo; done
Request 1: 200
Request 2: 200
Request 3: 200
Request 4: 200
...
Request 60: 200
```
- All responses were HTTP 200...
- In any case, I whitelisted their production IPs and told them to try again
- I imported 468 of IFPRI's 2023 records that were confirmed to not be duplicates to CGSpace
- I also spent some time merging metadata from 415 of the remaining 432 duplicates with the metadata for the existing items on CGSpace
- This was a bit of dirty work using csvkit, xsv, and OpenRefine
## 2024-03-17
- There are 17 records from IFPRI's 2023 batch that are remaining from the 432 that I identified as already being on CGSpace
- These are different in that they are duplicates on CGSpace as well, so the csvjoin failed and the metadata got messed up in my migration
- I looked closer and whittled this down to 14 actual records, and spent some time working on them
- I isolated 12 of these items that existed on CGSpace and added publication ranks, project identifiers, and provenance links
- Now there only remain two confusing records about the Inkomati catchment
## 2024-03-18
- Checking to see how many IFPRI records we have migrated so far:
```console
$ csvgrep -c 'dc.description.provenance[en_US]' -m 'Original URL from IFPRI CONTENTdm' cgspace.csv \
| csvcut -c 'id,dc.title[en_US],dc.identifier.uri[en_US],dc.description.provenance[en_US],dcterms.type[en_US]' \
| tee /tmp/ifpri-records.csv \
| csvstat --count
898
```
- I finalized the remaining two on Inkomati catchment and now we are at 900!
# 2024-03-19
- IWMI sent me some new author ORCID identifiers so I updated our list
- Started working on updating my data for the Ontology CoP webinar on CGIAR and AGROVOC
- First extracting all unique subjects on CGSpace:
```
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119)) to /tmp/2024-03-19-cgspace-subjects.csv WITH CSV HEADER;
COPY 28024
```
- Then I extracted the subjects and looked them up against AGROVOC:
```console
$ csvcut -c subject /tmp/2024-03-19-cgspace-subjects.csv | sed '1d' > /tmp/2024-03-19-cgspace-subjects.txt
$ ./ilri/agrovoc_lookup.py -i /tmp/2024-03-19-cgspace-subjects.txt -o /tmp/2024-03-19-cgspace-subjects-results.csv
```
## 2024-03-20
- Identify seven duplicates on CGSpace from the PRMS results and withdraw them from CGSpace
## 2024-03-21
- Look more closely at duplicates on CGSpace based on a fresh export
- Using DOIs I found ~842 that occur more than once for journal articles alone, so probably around 400 duplicates
- I did a handful of them, merging the metadata and withdrawing the duplicate, and decided to add `dcterms.replaces` with the handle in the original
## 2024-03-22
- Look at duplicate DOIs on CGSpace and address a dozen or so
## 2024-03-23
- Look at duplicate DOIs on CGSpace and address a dozen or so
- Update Tomcat and Solr to latest versions
- I had done some tests with these last week, and did a last minute test on DSpace 7 Test to make sure submission and searching worked
## 2024-03-24
- Slowly process several dozen more duplicate DOIs on CGSpace, sigh...
## 2024-03-26
- File an issue on dspace-angular about improving withdrawn item tombstones: https://github.com/DSpace/dspace-angular/issues/2880
- Merge metadata and withdraw more duplicates on CGSpace
<!-- vim: set sw=2 ts=2: -->

169
content/posts/2024-04.md Normal file
View File

@@ -0,0 +1,169 @@
---
title: "April, 2024"
date: 2024-04-04T10:23:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-04-04
- Work on CGSpace duplicate DOIs more
<!--more-->
## 2024-04-08
- Start working on IFPRI's 2022 batch import
- I ran the duplicate checker against CGSpace and started downloading all linked PDFs
## 2024-04-09
- Continue working on IFPRI's 2022 batch import
- I started validating the potential duplicates in OpenRefine
## 2024-04-12
- Finish working on the 650 IFPRI 2022 records that were not already on CGSpace, then uploaded them
- I need to merge the metadata for the remaining 212 that are already on CGSpace
- Spend some time looking at duplicate DOIs again...
## 2024-04-13
- Spend some time looking at duplicate DOIs again...
## 2024-04-14
- Spend some time looking at duplicate DOIs again...
## 2024-04-15
- Spend some time looking at duplicate DOIs again...
- Delete ~260 duplicate metadata values using the elaborate SQL and sort method I documented here: https://github.com/DSpace/DSpace/issues/8253#issuecomment-1331756418
- Tony noticed that the DSpace 7 REST API is very slow with the embeds so I profiled a bit:
```
$ time curl -s -o /dev/null 'https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&page=0&size=100&embed=thumbnail,bundles/bitstreams&sort=dcterms.issued,desc'
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 47.515 total
$ time curl -s -o /dev/null 'https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&page=0&size=100&sort=dcterms.issued,desc'
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 4.764 total
```
- Finalize processing the remaining 206 items from the IFPRI 2022 batch set that already existed on CGSpace
- I merged metadata with the existing items
- There are still six remaining items that I identified as being duplicates (3x2) in the IFPRI set itself
## 2024-04-16
- Spend some time looking at duplicate DOIs again...
- Assist Deborah with an advanced query on CGSpace for biodiversity and health:
```
dcterms.issued:[2010 TO 2024] AND dcterms.type:"Journal Article" AND (dc.title:"biodiversity" OR dcterms.subject:"biodiversity" OR dc.title:"health" OR dcterms.subject:"health")
```
- Remove CIMMYT URLs and citations from 277 journal articles on CGSpace since it is a bit tacky
- I used this Jython expression in OpenRefine with [Crossref's content negotiation](https://citation.crosscite.org/docs.html) to get citations for all DOIs:
```python
import urllib2
doi = cells['cg.identifier.doi[en_US]'].value
url = "https://api.crossref.org/works/" + doi + "/transform/text/x-bibliography"
useragent = "Python (mailto:a.o@cgiar.org)"
request = urllib2.Request(url.encode("utf-8"), headers={"User-Agent" : useragent})
get = urllib2.urlopen(request)
return get.read().decode('utf-8')
```
- It took ten or so minutes for it to finish (and note this is Python 2 inside OpenRefine so I had to be careful with Unicode), but worked well!
## 2024-04-18
- Write a SQL query to build the IFPRI CONTENTdm redirects to Handles:
```sql
SELECT m.text_value, h.handle FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=28 AND m.text_value LIKE 'Original URL%' AND h.resource_type_id=2;
```
- Similarly, I need a SQL query to get the redirects for duplicate Handles, querying for `dcterms.replaces`:
```sql
SELECT m.text_value AS handle_from, h.handle AS handle_to FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=181 AND h.resource_type_id=2;
```
- Then I can work that list into an nginx map with redirect, for example:
```console
server {
...
if ($new_uri) {
return 301 $new_uri;
}
}
map $request_uri $new_uri {
/handle/10568/112821 /handle/10568/97605;
}
```
## 2024-04-19
- Spend some time looking at duplicate DOIs again...
- Refresh ORCID identifiers from ORCID API and update CGSpace metadata and controlled vocabulary
## 2024-04-20
- I read an [interesting thread about DOI casing](https://github.com/greenelab/scihub/issues/9)
- Apparently the DOI specification says ASCII characters in DOIs are case insensitive
- Indeed, [Crossref recommends lower case](https://www.crossref.org/documentation/member-setup/constructing-your-dois/) for all DOIs
- I was curious about the DOIs in our database so I checked before and after lower casing:
```console
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=220 AND text_value IS NOT NULL AND text_value !='') TO /tmp/dois-sql-before.txt;
COPY 25675
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(lower(text_value)) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=220 AND text_value IS NOT NULL AND text_value !='') TO /tmp/dois-sql-after.txt;
COPY 25666
```
- I need to investigate options for lower casing these in the repository, for example in a curation task, and in all workflows around DSpace metadata...
## 2024-04-23
- Spent some time writing a Java curation task to normalize DOIs in items when they enter the workflow edit step
- The workflow curation tasks are not documented very well but I got a basic configuration working
- I found a bug in DSpace curation tasks and discussed on Slack
- I finalized the `NormalizeDOIs` curation task and released v7.6.1.1 of the [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) project
## 2024-04-24
- A bit more testing of the curation tasks
- I tested a patch by Mark Wood
- I added support for normalizing DOIs to this same format to my [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) project
## 2024-04-25
- I lowercased the remaining 3,900 DOIs on CGSpace that had uppercase ASCII characters
- Spend some time looking at duplicate DOIs again...
## 2024-04-26
- Spend some time looking at duplicate DOIs again...
## 2024-04-29
- Start working on the IFPRI 20202021 batch migration
- I modified my `check_duplicates.py` script to check for DOIs instead of titles, and use a similarity of 1.0 to make sure the match is exact
- I noticed something in the Tomcat log:
```console
tomcat9[690]: WARNING: The HTTP response header [Content-Disposition] with value [attachment; filename="Literature review on Womens Empowerment and their Resilience2.pdf"] has been removed from the response because it is invalid
tomcat9[690]: java.lang.IllegalArgumentException: The Unicode character [] at code point [8,217] cannot be encoded as it is outside the permitted range of 0 to 255
```
- I found the bitstream's ID and then used the `ds6_bitstream2itemhandle` [SQL helper function](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) to find the item's handle
- Then I replaced the curly quote with a regular quote in all bistreams
<!-- vim: set sw=2 ts=2: -->

197
content/posts/2024-05.md Normal file
View File

@@ -0,0 +1,197 @@
---
title: "May, 2024"
date: 2024-05-01T10:39:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-05-01
- I dumped all the CGSpace DOIs and resolved them with my `crossref_doi_lookup.py` script
- Then I did some work to add missing abstracts (about 900!), volumes, issues, licenses, publishers, and types, etc
<!--more-->
## 2024-05-05
- Spend some time looking at duplicate DOIs again...
## 2024-05-06
- Spend some time looking at duplicate DOIs again...
## 2024-05-07
- Discuss RSS feeds and OpenSearch with IWMI
- It seems our OpenSearch feed settings are using the defaults, so I need to copy some of those over from our old DSpace 6 branch
- I saw a patch for an interesting issue on DSpace GitHub: [Error submitting or deleting items - URI too long when user is in a large number of groups](https://github.com/DSpace/DSpace/issues/9544)
- I hadn't realized it, but we have lots of those errors:
```console
$ zstdgrep -a 'URI Too Long' log/dspace.log-2024-04-* | wc -l
1423
```
- Spend some time looking at duplicate DOIs again...
## 2024-05-08
- Spend some time looking at duplicate DOIs again...
- I finally finished looking at the duplicate DOIs for journal articles
- I updated the list of handle redirects and there are 386 of them!
## 2024-05-09
- Spend some time working on the IFPRI 20202021 batch
- I started by checking for exact duplicates (1.0 similarity) using DOI, type, and issue date
## 2024-05-12
- I couldn't figure out how to do a complex join on withdrawn items along with their metadata, so I pull out a few like titles, handles, and provenance separately:
```psql
dspace=# \COPY (SELECT i.uuid, m.text_value AS uri FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=25) TO /tmp/withdrawn-handles.csv CSV HEADER;
dspace=# \COPY (SELECT i.uuid, m.text_value AS title FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=64) TO /tmp/withdrawn-titles.csv CSV HEADER;
dspace=# \COPY (SELECT i.uuid, m.text_value AS submitted_by FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=28 AND m.text_value LIKE 'Submitted by%') TO /tmp/withdrawn-submitted-by.csv CSV HEADER;
```
- Then joined them:
```console
$ csvjoin -c uuid /tmp/withdrawn-title.csv /tmp/withdrawn-handles.csv /tmp/withdrawn-submitted-by.csv > /tmp/withdrawn.csv
```
- This gives me an insight into who submitted at 334 of the duplicates over the past few years...
- I fixed a few hundred titles with leading/trailing whitespace, newlines, and ligatures like ff, fi, fl, ffi, and ffl
## 2024-05-13
- Export a list of IFPRI information products with handle links and CONTENTdm links:
```
$ csvgrep -c 'dc.description.provenance[en_US]' -m 'CONTENTdm' cgspace.csv \
| csvcut -c 'id,dc.description.provenance[en_US],dc.identifier.uri[en_US]' \
| tee /tmp/ifpri-redirects.csv \
| csvstat --count
2645
```
- I discovered the `/server/api/pid/find` endpoint today, which is much more direct and manageable than the `/server/api/discover/search/objects?query=` endpoint when trying to get metadata for a Handle (item, collection, or community)
- The "pid" stands for permanent identifiers apparently, and we can use it like this:
```
https://dspace7test.ilri.org/server/api/pid/find?id=10568/118424
```
## 2024-05-15
- I got journal titles for 2,900 journal articles that were missing them from Crossref
## 2024-05-16
Helping IFPRI with some DSpace 7 API support, these are two queries for items issued in 2024:
- https://dspace7test.ilri.org/server/api/discover/search/objects?query=dcterms.issued:2024
- https://dspace7test.ilri.org/server/api/discover/search/objects?query=dcterms.issued_dt%3A%5B2024-01-01T00%3A00%3A00Z%20TO%20%2A%5D — note the Lucene search syntax is URL encoded version of `:[2024-01-01T00:00:00Z TO *]`
Both of them return the same number of results and seem identitical as far as I can see, but the second one uses Solr date indexes and requires the full Lucene datetime and range syntax
I wrote a new version of the `check_duplicates.py` script to help identify duplicates with different types
- Initially I called it `check_duplicates_fast.py` but it's actually not faster
- I need to find a way to deal with duplicates from IFPRI's repository because there are some mismatched types...
## 2024-05-20
Continue working through alternative duplicate matching for IFPRI
- Their item types are sometimes different than ours...
- One thing I think I can say for sure is that the default similarity factor in my script is 0.6, and I rarely see legitimate duplicates with such similarity so I might increase this to 0.7 to reduce the number of items I have to check
- Also, the difference in issue dates is currently 365, but I should reduce that a bit, perhaps to 270 days (9 months)
## 2024-05-22
- Finalize and upload the IFPRI 20202021 batch set
- I used a new technique to get missing licenses via Crossref (it's Python 2 because of OpenRefine's Jython):
```python
import urllib2
doi = cells['cg.identifier.doi[en_US]'].value
url = "https://api.crossref.org/works/" + doi
useragent = "Python (mailto:a.o@cgiar.org)"
request = urllib2.Request(url.encode("utf-8"), headers={"User-Agent" : useragent})
get = urllib2.urlopen(request)
return get.read().decode('utf-8')
```
## 2024-05-23
- Finalize last of the duplicates I found for the IFPRI 20202021 batch set (those that we missed initially due to mismatched types)
- Export a new list of IFPRI redirects from CONTENTdm:
```console
$ csvgrep -c 'dc.description.provenance[en_US]' -r 'Original URLs? from IFPRI CONTENTdm' cgspace.csv \
| csvcut -c 'id,dc.description.provenance[en_US],dc.identifier.uri[en_US]' \
| tee /tmp/ifpri-redirects.csv \
| csvstat --count
4004
```
I found a way to get abstracts from PLOS
- They offer an API that returns XML including the JATS-formatted abstracts
- I created a new column in OpenRefine by fetching specially crafted URLs based on the DOIs using this GREL:
```console
"https://journals.plos.org/plosone/article/file?id=" + cells['doi'].value + '&type=manuscript'
```
Then used `value.parseXml()` on the resulting text to extract the abstract's text:
```console
value.parseXml().select("abstract")[0].xmlText()
```
This doesn't preserve `<p>` tags though...
- Oh, nice, this does!
```console
forEach(value.parseHtml().select("abstract p"), i, i.htmlText()).join("\r\n\r\n")
```
For each paragraph inside an abstract, get the inner text and join them as one string separated by two newlines...
- Ah, some articles have multiple abstracts, for example: https://journals.plos.org/plosone/article/file?id=https://doi.org/10.1371/journal.pntd.0001859&type=manuscript
- I need to select the abstract that does **not** have any attributes (using [Jsoup selector syntax](https://jsoup.org/apidocs/org/jsoup/select/Selector.html))
```console
forEach(value.parseXml().select("abstract:not([*]) p"), i, i.xmlText()).join("\r\n\r\n")
```
Testing `xsv` (Rust) versus `csvkit` (Python) to filter all items with DOIs from a DSpace dump with 118,000 items:
```console
$ time xsv search -s doi 'doi\.org' /tmp/cgspace-minimal.csv | xsv select doi | xsv count
27339
xsv search -s doi 'doi\.org' /tmp/cgspace-minimal.csv 0.06s user 0.03s system 98% cpu 0.091 total
xsv select doi 0.02s user 0.02s system 40% cpu 0.091 total
xsv count 0.01s user 0.00s system 9% cpu 0.090 total
$ time csvgrep -c doi -m 'doi.org' /tmp/cgspace-minimal.csv | csvcut -c doi | csvstat --count
27339
csvgrep -c doi -m 'doi.org' /tmp/cgspace-minimal.csv 1.15s user 0.06s system 95% cpu 1.273 total
csvcut -c doi 0.42s user 0.05s system 36% cpu 1.283 total
csvstat --count 0.20s user 0.03s system 18% cpu 1.298 total
```
## 2024-05-27
- Working on IFPRI datasets batch migration
- 732 items total
- 6 duplicates on CGSpace
- 6 duplicates within set that need investigation
## 2024-05-28
- I'm thinking of increasing the frequency of thumbnail generation on CGSpace
- Currently the `dspace filter-media` script runs once at 3AM for all media types and seems to take ~10 minutes to run for all 118,000 items...
- I think I will make the thumbnailer run explicitly more often using `-p "ImageMagick PDF Thumbnail"`
<!-- vim: set sw=2 ts=2: -->

119
content/posts/2024-06.md Normal file
View File

@@ -0,0 +1,119 @@
---
title: "June, 2024"
date: 2024-06-03T14:14:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-06-03
- Working on IFPRI datasets
- I noticed the licenses were missing from Nilam's original file so I found a way to check [Dataverse's API for a persistent identifier](https://guides.dataverse.org/en/latest/api/native-api.html#export-metadata-of-a-dataset-in-various-formats)
- We have both Handles and DOIs for these datasets, both from Harvard's Dataverse
<!--more-->
- I used this GREL in OpenRefine to create a new column based on URLs using the DOI (uppercasing the DOI for Dataverse):
```
"https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi:" + value.split('https://doi.org/')[-1].toUppercase()
```
- Then I was able to extract the license text from the JSON response using:
```
value.parseJson()['datasetVersion']['termsOfUse']
```
- Similar for the Handle...
## 2024-06-04
- Some Dataverse entries have the license in `['datasetVersion']['license']` instead...
- I finalized cleaning the 722 IFPRI datasets and uploaded them to CGSpace
## 2024-06-14
- Minor cleanups on IFPRI's 20162019 batch migration file
- I will start with duplicates on unique identifiers like DOIs
## 2026-06-18
- Merge and upload metadata for duplicates in IFPRI's 20162019 set:
- 144 exact match on CGSpace via DOI, type, and date
- 32 with CGSpace handles
- I also spent some time converting the `ilri/post_bitstreams.py` script to use the DSpace 7 REST API via dspace-rest-client
- There are 28 PDFs specified for these 176 duplicates, and a handful of them do not already exist on CGSpace so I will upload them
## 2024-06-19
- Spent some time checking the remaining 3312 IFPRI 20162019 migration set for duplicates on CGSpace
- There seem to be about 50 exact matches of title, type, and issue date
## 2024-06-20
- Finalize merging and uploading metadata for 48 duplicates from the IFPRI 20162019 migration set
- Heavy load on both CGSpace and DSpace 7 Test this afternoon
- Took me a while to figure out it was due to someone / something hammering `/search` for a bunch of facets
- The `pm2 logs` command was more useful than the nginx logs to see the requests at least, for example:
```
0|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&spc.page=1&f.accessRights=Open%20Access,equals&f.dateIssued.min=2023&f.dateIssued.max=2024&f.country=Colombia,equals&f.subject=climate%20change,equals&f.region=Latin%20America%20and%20the%20Caribbean,equals&f.publisher=CGIAR%20FOCUS%20Climate%20Security,equals - - ms - -
1|dspace-ui | GET /search?f.accessRights=Open%20Access,equals&spc.page=1&f.sponsorship=CGIAR%20Trust%20Fund,equals&f.impactArea=Climate%20adaptation%20and%20mitigation,equals&f.region=Eastern%20Africa,equals&f.publisher=International%20Institute%20of%20Tropical%20Agriculture,equals - - ms - -
3|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&f.sdg=SDG%2012%20-%20Responsible%20consumption%20and%20production,equals&spc.page=1&f.affiliation=CGIAR%20Research%20Program%20on%20Climate%20Change,%20Agriculture%20and%20Food%20Security,equals&f.affiliation=Alliance%20of%20Bioversity%20International%20and%20CIAT,equals&f.dateIssued.min=2020&f.dateIssued.max=2021&f.impactArea=Environmental%20health%20and%20biodiversity,equals - - ms - -
```
- Still difficult to find the client, because the logs are all [coming from Angular's user agent](https://github.com/DSpace/dspace-angular/issues/2902) and IP
- I changed the nginx logging to use the `X-Forwarded-For` header, as the default `combined` log format uses `$remote_addr` by default, which is only accurate if the request doesn't come from Angular (ie directly to the API)
- From what I can see now the IPs are all coming from Huawei Cloud and Tencent
- The ASNs are AS136907 (Huawei) and AS132203 (Tencent)
- For now I will just add those to the list of bot networks
## 2024-06-21
- Update the nginx logging to use [nginx's `real_ip` module](http://nginx.org/en/docs/http/ngx_http_realip_module.html) to log the correct client IP
- I think this means we will start sending 'bot' to the Angular / Express frontend because bot IPs will be properly classified now...
- I will have to re-work or at least re-think that nginx configuration for requests going to the frontend because the proposed fix in https://github.com/DSpace/dspace-angular/issues/2902 is to pass on the client's user-agent
- Then I updated the list of bot networks:
```console
$ wget https://asn.ipinfo.app/api/text/list/AS12876 \
https://asn.ipinfo.app/api/text/list/AS132203 \
https://asn.ipinfo.app/api/text/list/AS13238 \
https://asn.ipinfo.app/api/text/list/AS136907 \
https://asn.ipinfo.app/api/text/list/AS14061 \
https://asn.ipinfo.app/api/text/list/AS14618 \
https://asn.ipinfo.app/api/text/list/AS16276 \
https://asn.ipinfo.app/api/text/list/AS16509 \
https://asn.ipinfo.app/api/text/list/AS203020 \
https://asn.ipinfo.app/api/text/list/AS204287 \
https://asn.ipinfo.app/api/text/list/AS21859 \
https://asn.ipinfo.app/api/text/list/AS23576 \
https://asn.ipinfo.app/api/text/list/AS24940 \
https://asn.ipinfo.app/api/text/list/AS396982 \
https://asn.ipinfo.app/api/text/list/AS45102 \
https://asn.ipinfo.app/api/text/list/AS50245 \
https://asn.ipinfo.app/api/text/list/AS55286 \
https://asn.ipinfo.app/api/text/list/AS6939 \
https://asn.ipinfo.app/api/text/list/AS8075
$ cat AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
$ wc -l /tmp/networks.txt
8675 /tmp/networks.txt
```
- Update list of ORCID identifiers with new ones from Alliance and IFPRI
- Finalize uploading the remaining 3,264 items from IFPRI's 20162019 batch migration to CGSpace
## 2024-06-24
- Minor updates to [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) and [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) to normalize a few more invalid DOI formats
## 2024-06-25
- Work on uploading some missing PDFs from the IFPRI 20162019 batch migration
## 2024-06-26
- Did a big cleanup of several thousand journal articles based on metadata from Crossref
<!-- vim: set sw=2 ts=2: -->

57
content/posts/2024-07.md Normal file
View File

@@ -0,0 +1,57 @@
---
title: "July, 2024"
date: 2024-07-01T09:37:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-07-01
- A bit of work to clean up duplicate DOIs on CGSpace
- A handful of book chapters, working papers, and journal articles using the wrong DOI
- I tried to delete all users who have been inactive since six years ago (July 1, 2018):
<!--more-->
```console
$ dspace dsrun org.dspace.eperson.Groomer -a -b 07/01/2018 -d
```
- File an issue on DSpace GitHub: [Allow configuring disallowed domains for self registration](https://github.com/DSpace/DSpace/issues/9675)
## 2024-07-11
- Minor fixes to normalize the IFPRI CONTENTdm URLs in provenance fields:
```console
dspace=# BEGIN;
BEGIN
dspace=*# UPDATE metadatavalue SET text_value = replace(text_value, 'cdm/ref', 'digital') WHERE text_value LIKE '%CONTENTdm%cdm/ref/%';
UPDATE 1876
dspace=*# UPDATE metadatavalue SET text_value = replace(text_value, 'CONTENTdm: ', 'CONTENTdm: ') WHERE text_value LIKE '%CONTENTdm: %';
UPDATE 21
dspace=*# COMMIT;
COMMIT
```
- Then export a new list of CONTENTdm redirects, excluding withdrawn items:
```console
dspace= ☘ \COPY (SELECT m.text_value, h.handle FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=28 AND m.text_value LIKE '%URL from IFPRI CONTENTdm%' AND h.resource_type_id=2 AND m.dspace_object_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn)) to /tmp/ifpri.csv CSV HEADER;
COPY 8568
```
- Similarly, get a list of withdrawn item redirects:
```console
dspace= ☘ \COPY (SELECT m.text_value AS handle_from, h.handle AS handle_to FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=181 AND h.resource_type_id=2 AND h.resource_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn)) to /tmp/handle-redirects.csv CSV HEADER;
COPY 396
```
## 2024-07-18
- I experimented with adding a regular expression to validate DOIs to the submission form
- It is a slightly modified version of the one found here: https://stackoverflow.com/questions/27910/finding-a-doi-in-a-document-or-page
- I decided it will probably be confusing to people and will have limited benefit, since we are normalizing most forms of DOIs to our preferred form after submission anyway
<!-- vim: set sw=2 ts=2: -->

71
content/posts/2024-08.md Normal file
View File

@@ -0,0 +1,71 @@
---
title: "August, 2024"
date: 2024-08-08T23:07:00-07:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-08-08
- While working on the CGIAR Climate Change Synthesis I learned some new tricks with OpenRefine
<!--more-->
- The first was to retrieve affiliations from OpenAlex and extract them from JSON with this GREL:
```
forEach(
value.parseJson()['authorships'],
a,
forEach(
a.parseJson()['institutions'],
i,
i['display_name']
).join("||")
).join("||")
```
- It is a nested `forEach` to extract all institutions for all authors
- Second was a better way to deduplicate lists in Jython while preserving list order:
```python
# better dedupe preserves order
seen = set()
deduped_list = [x for x in value.split("||") if x not in seen and not seen.add(x)]
return "||".join(deduped_list)
```
## 2024-08-20
- Delete duplicate metadata values using the method I described in this GitHub issue: https://github.com/DSpace/DSpace/issues/8253#issuecomment-1331756418
## 2024-08-22
- Help IWMI with some OpenSearch RSS/Atom feeds for search results:
- https://cgspace.cgiar.org/server/opensearch/search?query=affiliation:"International Water Management Institute" AND initiative:"Climate Resilience" AND subject:flooding
- https://cgspace.cgiar.org/server/opensearch/search?query=affiliation:"International Water Management Institute" AND initiative:"Climate Resilience" AND subject:drought
- https://cgspace.cgiar.org/server/opensearch/search?query=affiliation:"International Water Management Institute" AND initiative:"Climate Resilience" AND subject:landslides
- Export list of withdrawn handle redirects:
```
dspace=# \COPY (SELECT m.text_value AS handle_from, h.handle AS handle_to FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=181 AND h.resource_type_id=2 AND h.resource_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn)) to /tmp/handle-redirects.csv CSV HEADER;
COPY 400
```
- Export list of IFPRI CONTENTdm redirects:
```
dspace-# \COPY (SELECT m.text_value, h.handle FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=28 AND m.text_value LIKE '%URL from IFPRI CONTENTdm%' AND h.resource_type_id=2 AND m.dspace_object_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn)) to /tmp/ifpri.csv CSV HEADER;
COPY 10794
```
- I filed [an issue](https://github.com/DSpace/dspace-angular/issues/3258) on DSpace Angular for anonymous users to be able to export search results to CSV
## 2024-08-26
- Spent some time trying to rebase our DSpace Angular themes on top of the massive header/navbar rework from [DSpace 7.6.2](https://github.com/DSpace/dspace-angular/pull/2858)
- Spent some time getting missing bibliographic metadata (issue dates, licenses, pages, volume, issue, publisher, etc) from Crossref for CGSpace
<!-- vim: set sw=2 ts=2: -->

147
content/posts/2024-09.md Normal file
View File

@@ -0,0 +1,147 @@
---
title: "September, 2024"
date: 2024-09-01T21:16:00-07:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-09-01
- Upgrade CGSpace to DSpace 7.6.2
<!--more-->
## 2024-09-05
- Finalize work on migrating DSpace Angular from Yarn to NPM
## 2024-09-06
- This morning Tomcat crashed due to an OOM kill:
```
Sep 06 00:00:24 server systemd[1]: tomcat9.service: A process of this unit has been killed by the OOM killer.
Sep 06 00:00:25 server systemd[1]: tomcat9.service: Main process exited, code=killed, status=9/KILL
Sep 06 00:00:25 server systemd[1]: tomcat9.service: Failed with result 'oom-kill'.
```
- According to the system journal, it was a Node.js dspace-angular process that tried to allocate memory and failed, thus invoking the OOM killer
- Currently I see high memory usage in those processes:
```console
$ pm2 status
┌────┬──────────────┬─────────────┬─────────┬─────────┬──────────┬────────┬──────┬───────────┬──────────┬──────────┬──────────┬──────────┐
│ id │ name │ namespace │ version │ mode │ pid │ uptime │ ↺ │ status │ cpu │ mem │ user │ watching │
├────┼──────────────┼─────────────┼─────────┼─────────┼──────────┼────────┼──────┼───────────┼──────────┼──────────┼──────────┼──────────┤
│ 0 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 994 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
│ 1 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 1015 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
│ 2 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 1029 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
│ 3 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 1042 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
└────┴──────────────┴─────────────┴─────────┴─────────┴──────────┴────────┴──────┴───────────┴──────────┴──────────┴──────────┴──────────┘
```
- I bet if I look in the logs I'd find some kind of heavy traffic on the frontend, causing high caching for Angular SSR
## 2024-09-08
- Analyzing memory use in our DSpace hosts, which have 32GB of memory
- Effective cache of PostgreSQL is estimated at 11GB, which seems way high since the database is only 2GB
- Realistically this should be how we adjust, with PostgreSQL using ~8GB (or less) and each dspace-angular process pinned at 2GB...
> Total - Solr - Tomcat Postgres - Nginx - Angular
> 31366 (1024×4.4) 7168 (8×1024) 512 - (4x2048) = 2796.4 left...
- I put some of these changes in on DSpace Test and will monitor this week
## 2024-09-10
- Some bot in South Africa made a ton of requests on the API and made the load hit the roof:
```
# grep -E '10/Sep/2024:[10-11]' /var/log/nginx/api-access.log | awk '{print $1}' | sort | uniq -c | sort -h
...
149720 102.182.38.90
```
- They are using several user agents so are obviously a bot:
```
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:130.0) Gecko/20100101 Firefox/130.0
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/111.0
Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0
```
- I added them to the list of bot networks in nginx and the load went down
## 2024-09-11
- Upgrade DSpace 7 Test to Ubuntu 24.04
- I did some minor maintenance to test dspace-statistics-api with Python 3.12
- I tagged version 1.4.4 and released it on GitHub
## 2024-09-14
- Noticed a persistent higher than usual load on CGSpace and checked the server logs
- Found some new data center subnets to block because they were making thousands of requests with normal user agents
- I enabled HTTP/3 in nginx
- I enabled the SSR patch in Angular: https://github.com/DSpace/dspace-angular/issues/3110
## 2024-09-16
- Experiment with the <a href="https://github.com/codeobia/dspace-statistics-api-js">dspace-statistics-api-js</a> on DSpace 7 Test
- In the past it always caused Solr to run out of memory, but I increased Solr's heap from 2g to 3g and it runs without crashing
- I attached VisualVM to Solr with a 3g and 4g heap and iterated over 1260 pages of results in the dspace-statistics-api-js:
![Solr with 3g heap](/cgspace-notes/2024/09/2024-09-16-Solr-3g-heap.png)
![Solr with 4g heap](/cgspace-notes/2024/09/2024-09-16-Solr-4g-heap.png)
## 2024-09-23
- Upgrade PostgreSQL from version 14 to 15 on DSpace Test the same way I did last year:
```console
# apt update
# apt install postgresql-15
# Update configs with Ansible
# systemctl stop tomcat9
# pg_ctlcluster 14 main stop
# tar -cvzpf var-lib-postgresql-14.tar.gz /var/lib/postgresql/14
# tar -cvzpf etc-postgresql-14.tar.gz /etc/postgresql/14
# pg_ctlcluster 15 main stop
# pg_dropcluster 15 main
# pg_upgradecluster 14 main
# pg_ctlcluster 15 main start
...
ERROR: could not find function "xml_is_well_formed" in file "/usr/lib/postgresql/15/lib/pgxml.so"
ERROR: function public.xml_is_well_formed(text) does not exist
ERROR: could not find function "xml_is_well_formed" in file "/usr/lib/postgresql/15/lib/pgxml.so"
ERROR: function public.xml_valid(text) does not exist
```
- After that I [re-indexed the database indexes](https://adamj.eu/tech/2021/04/13/reindexing-all-tables-after-upgrading-to-postgresql-13/) using a query:
```console
$ su - postgres
$ cat /tmp/generate-reindex.sql
SELECT 'REINDEX TABLE CONCURRENTLY ' || quote_ident(relname) || ' /*' || pg_size_pretty(pg_total_relation_size(C.oid)) || '*/;'
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname = 'public'
AND C.relkind = 'r'
AND nspname !~ '^pg_toast'
ORDER BY pg_total_relation_size(C.oid) ASC;
$ psql dspace < /tmp/generate-reindex.sql > /tmp/reindex.sql
$ <trim the extra stuff from /tmp/reindex.sql>
$ psql dspace < /tmp/reindex.sql
```
- The database shrunk by 186MB!
## 2024-09-29
- I upgraded the database on CGSpace to PostgreSQL 15
<!-- vim: set sw=2 ts=2: -->

82
content/posts/2024-10.md Normal file
View File

@@ -0,0 +1,82 @@
---
title: "October, 2024"
date: 2024-10-03T11:01:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-10-03
- I had an idea to get abstracts from OpenAlex
- For [copyright reasons they don't include plain abstracts](https://docs.openalex.org/api-entities/works/work-object#abstract_inverted_index), but the [pyalex](https://github.com/J535D165/pyalex) library can convert them on the fly
<!--more-->
- I filtered for journal articles that were Creative Commons and missing abstracts:
```console
$ csvcut -c 'id,dc.title[en_US],dcterms.abstract[en_US],cg.identifier.doi[en_US],dcterms.type[en_US],dcterms.language[en_US],dcterms.license[en_US]' ~/Downloads/2024-09-30-cgspace.csv | csvgrep -c 'dcterms.type[en_US]' -r '^Journal Article$' | csvgrep -c 'cg.identifier.doi[en_US]' -r '^.+$' | csvgrep -c 'dcterms.license[en_US]' -r '^CC-' | csvgrep -c 'dcterms.abstract[en_US]' -r '^$' | csvgrep -c 'dcterms.language[en_US]' -r '^en$' | grep -v "||" | grep -v -- '-ND' | grep -v -E 'https://doi.org/10.(2499|4160|17528)/' > /tmp/missing-abstracts.csv
```
- Then wrote a script to get them from OpenAlex
- After inspecting and cleaning a few dozen up in OpenRefine (removing "Keywords:" and copyright, and HTML entities, etc) I managed to get about 440
## 2024-10-06
- Since I increase Solr's heap from 2 to 3G a few weeks ago it seems like Solr is always using 100% CPU
- I don't understand this because it was running well before, and I only increased it in anticipation of running the dspace-statistics-api-js, though never got around to it
- I just realized that this may be related to the JMX monitoring, as I've seen gaps in the Grafana dashboards and remember that it took surprisingly long to scrape the metrics
- Maybe I need to change the scrape interval
## 2024-10-08
- I checked the VictoriaMetrics vmagent dashboard and saw that there were thousands of errors scraping the `jvm_solr` target from Solr
- So it seems like I do need to change the scrape interval
- I will increase it from 15s (global) to 20s for that job
- Reading some documentation I found [this reference from Brian Brazil that discusses this very problem](https://www.robustperception.io/keep-it-simple-scrape_interval-id/)
- He recommends keeping a single scrape interval for all targets, but also checking the slow exporter (`jmx_exporter` in this case) and seeing if we can limit the data we scrape
- To keep things simple for now I will increase the global scrape interval to 20s
- Long term I should limit the metrics...
- Oh wow, I found out that [Solr ships with a Prometheus exporter!](https://solr.apache.org/guide/8_11/monitoring-solr-with-prometheus-and-grafana.html) and even includes a Grafana dashboard
- I'm trying to run the Solr prometheus-exporter as a one-off systemd unit to test it:
```console
# cd /opt/solr-8.11.3/contrib/prometheus-exporter
# systemd-run --uid=victoriametrics --gid=victoriametrics --working-directory=/opt/solr-8.11.3/contrib/prometheus-exporter ./bin/solr-exporter -p 9854 -b http://localhost:8983/solr -f ./conf/solr-exporter-config.xml -s 20
```
- The default scrape interval is 60 seconds, so if we scrape it more than that the metrics will be stale
- From what I've seen this returns in less than one second so it should be safe to reduce the scrape interval
## 2024-10-19
- Heavy load on CGSpace today
- There is a noted increase just before 4PM local time
- I extracted a list of IPs:
```console
# grep -E '19/Oct/2024:1[567]' /var/log/nginx/api-access.log | awk '{print $1}' | sort -u > /tmp/ips.txt
```
- I looked them up and found some data center IPs that were using normal user agents with hundreds of IPs, for example:
- 154.47.29.168 # 212238 (CDNEXT - Datacamp Limited, GB)
- 91.210.64.12 # 29802 (HVC-AS, US) - HIVELOCITY, Inc.
- 103.221.57.120 # 132817 (DZCRD-AS-AP DZCRD Networks Ltd, BD)
- 109.107.150.136 # 201341 (CENTURION-INTERNET-SERVICES - trafficforce, UAB, LT) - Code200
- 185.210.207.1 # 209709 (CODE200-ISP1 - UAB code200, LT)
- 185.162.119.101 # 207223 (GLOBALCON - Global Connections Network LLC, US)
- 173.244.35.101 # 64286 (LOGICWEB, US) - Tesonet
- 139.28.160.141 # 396319 (US-INTERNET-396319, US) - OxyLabs
- 104.143.89.112 # 62874 (WEB2OBJECTS, US) - Web2Objects LLC
- I added some network blocks to the nginx conf
- Interestingly, I see so many IPs using the same user agent today:
```console
# grep "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.3" /var/log/nginx/api-access.log | awk '{print $1}' | sort -u | wc -l
767
```
- For reference, the current Chrome version is 129 or so...
- This is definitely worth looking into because it seems like one massive botnet
<!-- vim: set sw=2 ts=2: -->

50
content/posts/2024-11.md Normal file
View File

@@ -0,0 +1,50 @@
---
title: "November, 2024"
date: 2024-11-11T09:47:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-11-11
- Some IP in India is making tons of requests this morning with a normal user agent:
```console
# awk '{print $1}' /var/log/nginx/api-access.log | sort | uniq -c | sort -h | tail -n 40
...
513743 49.207.196.249
```
<!--more-->
- They are using this user agent:
```
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.3
```
## 2024-11-16
- I switched CGSpace to Node.js v20 since I've been using it in dev and test for months
## 2024-11-18
- I see a bot (188.34.177.10) on Hetzner has made 35,000 requests this morning and is pretending to be Googlebot, GoogleOther, etc
- Google publishes their range of IPs also: https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot
- Our nginx config doesn't rate limit the API but perhaps that needs to change...
- In DSpace 4/5/6 the API was separate from the user interface so we didn't need to enforce rate limits there because we encouraged using that over scraping the UI
- In DSpace 7 the API is used by the frontend and perhaps should have the same IP- and UA-based rate limiting
## 2024-11-19
- I notice 10,000 requests by a new bot yesterday:
```
20.38.174.208 - - [18/Nov/2024:07:02:50 +0100] "GET /server/oai/request?verb=ListRecords&resumptionToken=oai_dc%2F2024-10-18T13%3A00%3A49Z%2F%2F%2F400 HTTP/1.1" 503 190 "-" "Laminas_Http_Client"
```
- Seems to be some kind of PHP framework library
- Yesterday one IP in Argentina made nearly 1,000,000 requests using a normal user agent: 181.4.143.40
- 188.34.177.10 ended up making 700,000 requests using various Googlebot, GoogleOther, and even normal Chrome user agents
<!-- vim: set sw=2 ts=2: -->

28
content/posts/2024-12.md Normal file
View File

@@ -0,0 +1,28 @@
---
title: "December, 2024"
date: 2024-12-04T10:19:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-12-04
- We need to get view and download statistics for the last year from CGSpace
- The only way to get that is using Solr
<!--more-->
- After consulting the [Solr documentation](https://solr.apache.org/guide/8_11/working-with-dates.html) I came up with this facet query:
> facet.range=time&facet.range.start=NOW/MONTH-11MONTHS&facet.range.end=NOW/MONTH+1MONTH&facet.range.gap=+1MONTH
- [This StackOverflow answer](https://stackoverflow.com/questions/34290600/how-to-apply-facet-on-date-field-where-result-should-provide-number-of-records-f) helped too, recommending `NOW/MONTH` to get neatly bucketed months because this will use the beginning of the current month
- For views, I added the following query parameters: `q=type:2&fq=-isBot:true AND statistics_type:view`
> http://localhost:8983/solr/statistics/select?facet.range.end=NOW%2FMONTH%2B1MONTH&facet.range.gap=%2B1MONTH&facet.range.start=NOW%2FMONTH-11MONTHS&facet.range=time&facet=true&fq=-isBot%3Atrue%20AND%20statistics_type%3Aview&indent=true&q.op=OR&q=type%3A2&rows=0
- For downloads I added the following query parameters: `q=type:0&fq=-isBot:true AND statistics_type:view AND bundleName:ORIGINAL`
> http://localhost:8983/solr/statistics/select?facet.range.end=NOW%2FMONTH%2B1MONTH&facet.range.gap=%2B1MONTH&facet.range.start=NOW%2FMONTH-11MONTHS&facet.range=time&facet=true&fq=-isBot%3Atrue%20AND%20statistics_type%3Aview%20AND%20bundleName%3AORIGINAL&indent=true&q.op=OR&q=type%3A0&rows=0
<!-- vim: set sw=2 ts=2: -->

38
content/posts/2025-01.md Normal file
View File

@@ -0,0 +1,38 @@
---
title: "January, 2025"
date: 2025-01-03T11:09:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2025-01-03
- Trying to get search results for a large boolean query given to me by some researchers
- When searching via the Angular frontend I see an error in the Tomcat logs:
<!--more-->
```
Jan 03 09:08:40 dspace tomcat9[876]: Jan 03, 2025 9:08:40 AM org.apache.coyote.http11.Http11Processor service
Jan 03 09:08:40 dspace tomcat9[876]: INFO: Error parsing HTTP request header
Jan 03 09:08:40 dspace tomcat9[876]: Note: further occurrences of HTTP request parsing errors will be logged at DEBUG level.
Jan 03 09:08:40 dspace tomcat9[876]: java.lang.IllegalArgumentException: Request header is too large
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.http11.Http11InputBuffer.fill(Http11InputBuffer.java:778)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.http11.Http11InputBuffer.parseHeader(Http11InputBuffer.java:892)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.http11.Http11InputBuffer.parseHeaders(Http11InputBuffer.java:593)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:279)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:63)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:937)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1791)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1190)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:63)
Jan 03 09:08:40 dspace tomcat9[876]: at java.base/java.lang.Thread.run(Thread.java:840)
```
- The size of the query itself is 5362 bytes
- Increasing the `maxHttpHeaderSize` from the default of 8192 bytes to 16384 allows the search to complete successfully
- I notice that we had previously increased the `maxHttpHeaderSize` on the HTTP connector in Tomcat 7, which we are no longer using in Tomcat 9, so this is an overdue change
<!-- vim: set sw=2 ts=2: -->

View File

@@ -34,7 +34,7 @@ Last week I had increased the limit from 30 to 60, which seemed to help, but now
$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
78
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -242,15 +242,15 @@ db.statementpool = true
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
-rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -264,15 +264,15 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -28,7 +28,7 @@ Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_
I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
Update GitHub wiki for documentation of maintenance tasks.
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -200,15 +200,15 @@ $ find SimpleArchiveForBio/ -iname &ldquo;*.pdf&rdquo; -exec basename {} ; | sor
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -38,7 +38,7 @@ I noticed we have a very interesting list of countries on CGSpace:
Not only are there 49,000 countries, we have some blanks (25)&hellip;
Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&rdquo;
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -378,15 +378,15 @@ Bitstream: tést señora alimentación.pdf
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -28,7 +28,7 @@ Looking at issues with author authorities on CGSpace
For some reason we still have the index-lucene-update cron job active on CGSpace, but I&rsquo;m pretty sure we don&rsquo;t need it as of the latest few versions of Atmire&rsquo;s Listings and Reports module
Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -316,15 +316,15 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -32,7 +32,7 @@ After running DSpace for over five years I&rsquo;ve never needed to look in any
This will save us a few gigs of backup space we&rsquo;re paying for on S3
Also, I noticed the checker log has some errors we should pay attention to:
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -495,15 +495,15 @@ dspace.log.2016-04-27:7271
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -34,7 +34,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
# awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
3168
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -371,15 +371,15 @@ sys 0m20.540s
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -34,7 +34,7 @@ This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRec
You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund to cg.identifier.cpwfproject and then the rest to dc.description.sponsorship
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -409,15 +409,15 @@ $ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-D
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -44,7 +44,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
In this case the select query was showing 95 results before the update
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -325,15 +325,15 @@ discovery.index.authority.ignore-variants=true
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -42,7 +42,7 @@ $ git checkout -b 55new 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.5
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -389,15 +389,15 @@ $ JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx512m&#34; /home/cgspace.cgiar.org/bin
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -34,7 +34,7 @@ It looks like we might be able to use OUs now, instead of DCs:
$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &#34;dc=cgiarad,dc=org&#34; -D &#34;admigration1@cgiarad.org&#34; -W &#34;(sAMAccountName=admigration1)&#34;
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -606,15 +606,15 @@ $ ./delete-metadata-values.py -i ilrisubjects-delete-13.csv -f cg.subject.ilri -
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -42,7 +42,7 @@ I exported a random item&rsquo;s metadata as CSV, deleted all columns except id
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -372,15 +372,15 @@ dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;h
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -26,7 +26,7 @@ Add dc.type to the output options for Atmire&rsquo;s Listings and Reports module
Add dc.type to the output options for Atmire&rsquo;s Listings and Reports module (#286)
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -548,15 +548,15 @@ org.dspace.discovery.SearchServiceException: Error executing query
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -46,7 +46,7 @@ I see thousands of them in the logs for the last few months, so it&rsquo;s not r
I&rsquo;ve raised a ticket with Atmire to ask
Another worrying error from dspace.log is:
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -784,15 +784,15 @@ $ exit
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -28,7 +28,7 @@ I checked to see if the Solr sharding task that is supposed to run on January 1s
I tested on DSpace Test as well and it doesn&rsquo;t work there either
I asked on the dspace-tech mailing list because it seems to be broken, and actually now I&rsquo;m not sure if we&rsquo;ve ever had the sharding task run successfully over all these years
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -369,15 +369,15 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -50,7 +50,7 @@ DELETE 1
Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
Looks like we&rsquo;ll be using cg.identifier.ccafsprojectpii as the field name
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -423,15 +423,15 @@ COPY 1968
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -54,7 +54,7 @@ Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing reg
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600&#43;0&#43;0 8-bit CMYK 168KB 0.000u 0:00.000
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -355,15 +355,15 @@ $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.spon
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -40,7 +40,7 @@ Testing the CMYK patch on a collection with 650 items:
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -585,15 +585,15 @@ $ gem install compass -v 1.0.3
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="May, 2017"/>
<meta name="twitter:description" content="2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it&rsquo;s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire&rsquo;s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace."/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -391,15 +391,15 @@ UPDATE 187
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="June, 2017"/>
<meta name="twitter:description" content="2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg."/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -270,15 +270,15 @@ $ JAVA_OPTS=&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34; [dspace]/bin/dspace import
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ Merge changes for WLE Phase II theme rename (#329)
Looking at extracting the metadata registries from ICARDA&rsquo;s MEL DSpace database so we can compare fields with CGSpace
We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the output into quasi XML:
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -275,15 +275,15 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -60,7 +60,7 @@ This was due to newline characters in the dc.description.abstract column, which
I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -517,15 +517,15 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -32,7 +32,7 @@ Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two
Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account is both in the approvers step as well as the group
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -659,15 +659,15 @@ Cert Status: good
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -34,7 +34,7 @@ http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -443,15 +443,15 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -48,7 +48,7 @@ Generate list of authors on CGSpace for Peter to go through and correct:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -944,15 +944,15 @@ $ cat dspace.log.2017-11-28 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sor
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -30,7 +30,7 @@ The logs say &ldquo;Timeout waiting for idle object&rdquo;
PostgreSQL activity says there are 115 connections currently
The list of connections to XMLUI and REST API for today:
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -783,15 +783,15 @@ DELETE 20
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -150,7 +150,7 @@ dspace.log.2018-01-02:34
Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -1452,15 +1452,15 @@ Catalina:type=Manager,context=/,host=localhost activeSessions 8
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -30,7 +30,7 @@ We don&rsquo;t need to distinguish between internal and external works, so that
Yesterday I figured out how to monitor DSpace sessions using JMX
I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu&rsquo;s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -1038,15 +1038,15 @@ UPDATE 3
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -24,7 +24,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
Export a CSV of the IITA community metadata for Martin Mueller
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -585,15 +585,15 @@ Fixed 5 occurences of: GENEBANKS
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -26,7 +26,7 @@ Catalina logs at least show some memory errors yesterday:
I tried to test something on DSpace Test but noticed that it&rsquo;s down since god knows when
Catalina logs at least show some memory errors yesterday:
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -594,15 +594,15 @@ $ pg_restore -O -U dspacetest -d dspacetest -W -h localhost /tmp/dspace_2018-04-
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -38,7 +38,7 @@ http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E
Then I reduced the JVM heap size from 6144 back to 5120m
Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -523,15 +523,15 @@ $ psql -h localhost -U postgres dspacetest
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -58,7 +58,7 @@ real 74m42.646s
user 8m5.056s
sys 2m7.289s
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -517,15 +517,15 @@ $ sed &#39;/^id/d&#39; 10568-*.csv | csvcut -c 1,2 &gt; map-to-cifor-archive.csv
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ During the mvn package stage on the 5.8 branch I kept getting issues with java r
There is insufficient memory for the Java Runtime Environment to continue.
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -569,15 +569,15 @@ dspace=# select count(text_value) from metadatavalue where resource_type_id=2 an
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -46,7 +46,7 @@ Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did
The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes
I ran all system updates on DSpace Test and rebooted it
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -442,15 +442,15 @@ $ dspace database migrate ignored
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -30,7 +30,7 @@ I&rsquo;ll update the DSpace role in our Ansible infrastructure playbooks and ru
Also, I&rsquo;ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month
I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -748,15 +748,15 @@ UPDATE metadatavalue SET text_value=&#39;ja&#39; WHERE resource_type_id=2 AND me
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -26,7 +26,7 @@ I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nai
Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nairobi right now
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -656,15 +656,15 @@ $ curl -X GET -H &#34;Content-Type: application/json&#34; -H &#34;Accept: applic
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ Send a note about my dspace-statistics-api to the dspace-tech mailing list
Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage
Today these are the top 10 IPs:
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -553,15 +553,15 @@ $ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 -d
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ Then I ran all system updates and restarted the server
I noticed that there is another issue with PDF thumbnails on CGSpace, and I see there was another Ghostscript vulnerability last week
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -594,15 +594,15 @@ UPDATE 1
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -50,7 +50,7 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
357 207.46.13.1
903 54.70.40.11
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -952,6 +952,7 @@ $ http &#39;http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&am
<blockquote class="twitter-tweet"><p lang="en" dir="ltr"><a href="https://twitter.com/hashtag/ILRI?src=hash&amp;ref_src=twsrc%5Etfw">#ILRI</a> research: Towards unlocking the potential of the hides and skins value chain in Somaliland <a href="https://t.co/EZH7ALW4dp">https://t.co/EZH7ALW4dp</a></p>&mdash; ILRI.org (@ILRI) <a href="https://twitter.com/ILRI/status/1086330519904673793?ref_src=twsrc%5Etfw">January 18, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<ul>
<li>The shortened link is <a href="goo.gl/fb/VRj9Gq">goo.gl/fb/VRj9Gq</a> and it shows a &ldquo;Dynamic Link not found&rdquo; error from Firebase:</li>
</ul>
@@ -1264,15 +1265,15 @@ identify: CorruptImageProfile `xmp&#39; @ warning/profile.c/SetImageProfileInter
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -72,7 +72,7 @@ real 0m19.873s
user 0m22.203s
sys 0m1.979s
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -1344,15 +1344,15 @@ Please see the DSpace documentation for assistance.
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -46,7 +46,7 @@ Most worryingly, there are encoding errors in the abstracts for eleven items, fo
I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -1208,15 +1208,15 @@ sys 0m2.551s
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -64,7 +64,7 @@ $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u ds
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -1299,15 +1299,15 @@ UPDATE 14
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -48,7 +48,7 @@ DELETE 1
But after this I tried to delete the item from the XMLUI and it is still present&hellip;
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -631,15 +631,15 @@ COPY 64871
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -34,7 +34,7 @@ Run system updates on CGSpace (linode18) and reboot it
Skype with Marie-Angélique and Abenet about CG Core v2
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -317,15 +317,15 @@ UPDATE 2
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -38,7 +38,7 @@ CGSpace
Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -554,15 +554,15 @@ issn.validate(&#39;1020-3362&#39;)
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -46,7 +46,7 @@ After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s luck
Run system updates on DSpace Test (linode19) and reboot it
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -573,15 +573,15 @@ sys 2m27.496s
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -72,7 +72,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
7249 2a01:7e00::f03c:91ff:fe18:7396
9124 45.5.186.2
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -581,15 +581,15 @@ $ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institut
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="October, 2019"/>
<meta name="twitter:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U&#43;00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c &#39;id,dc."/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -385,15 +385,15 @@ $ dspace import -a -c 10568/104057 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -58,7 +58,7 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#34;/rest/bitstreams&#34;
106781
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -692,15 +692,15 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -46,7 +46,7 @@ Make sure all packages are up to date and the package manager is up to date, the
# dpkg -C
# reboot
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -404,15 +404,15 @@ UPDATE 1
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -56,7 +56,7 @@ I tweeted the CGSpace repository link
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -604,15 +604,15 @@ COPY 2900
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -38,7 +38,7 @@ The code finally builds and runs with a fresh install
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -1275,15 +1275,15 @@ Moving: 21993 into core statistics-2019
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -42,7 +42,7 @@ You need to download this into the DSpace 6.x source and compile it
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -484,15 +484,15 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -48,7 +48,7 @@ The third item now has a donut with score 1 since I tweeted it last week
On the same note, the one item Abenet pointed out last week now has a donut with score of 104 after I tweeted it last week
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -658,15 +658,15 @@ $ psql -c &#39;select * from pg_stat_activity&#39; | wc -l
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -34,7 +34,7 @@ I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -477,15 +477,15 @@ Caused by: java.lang.NullPointerException
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ I sent Atmire the dspace.log from today and told them to log into the server to
In other news, I checked the statistics API on DSpace 6 and it&rsquo;s working
I tried to build the OAI registry on the freshly migrated DSpace 6 on DSpace Test and I get an error:
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -811,15 +811,15 @@ $ csvcut -c &#39;id,cg.subject.ilri[],cg.subject.ilri[en_US],dc.subject[en_US]&#
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -38,7 +38,7 @@ I restarted Tomcat and PostgreSQL and the issue was gone
Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the 5_x-prod branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter&rsquo;s request
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -1142,15 +1142,15 @@ Fixed 4 occurences of: Muloi, D.M.
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ It is class based so I can easily add support for other vocabularies, and the te
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -798,15 +798,15 @@ $ grep -c added /tmp/2020-08-27-countrycodetagger.log
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -48,7 +48,7 @@ I filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/39
I filed an issue on OpenRXV to make some minor edits to the admin UI: https://github.com/ilri/OpenRXV/issues/40
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -717,15 +717,15 @@ solr_query_params = {
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -44,7 +44,7 @@ During the FlywayDB migration I got an error:
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -1241,15 +1241,15 @@ $ ./delete-metadata-values.py -i 2020-10-31-delete-74-sponsors.csv -db dspace -u
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -32,7 +32,7 @@ So far we&rsquo;ve spent at least fifty hours to process the statistics and stat
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -731,15 +731,15 @@ $ ./fix-metadata-values.py -i 2020-11-30-fix-hung-orcid.csv -db dspace63 -u dspa
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ I started processing those (about 411,000 records):
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -869,15 +869,15 @@ $ query-json &#39;.items | length&#39; /tmp/policy2.json
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -50,7 +50,7 @@ For example, this item has 51 views on CGSpace, but 0 on AReS
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -688,15 +688,15 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -60,7 +60,7 @@ $ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#3
}
}
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -898,15 +898,15 @@ dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE dspace_object_
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -34,7 +34,7 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -875,15 +875,15 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -44,7 +44,7 @@ Perhaps one of the containers crashed, I should have looked closer but I was in
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -1042,15 +1042,15 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ I looked at the top user agents and IPs in the Solr statistics for last month an
I will add the RI/1.0 pattern to our DSpace agents overload and purge them from Solr (we had previously seen this agent with 9,000 hits or so in 2020-09), but I think I will leave the Microsoft Word one&hellip; as that&rsquo;s an actual user&hellip;
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -685,15 +685,15 @@ May 26, 02:57 UTC
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ I simply started it and AReS was running again:
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -693,15 +693,15 @@ I simply started it and AReS was running again:
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -30,7 +30,7 @@ Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVO
localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -715,15 +715,15 @@ COPY 20994
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -32,7 +32,7 @@ Update Docker images on AReS server (linode20) and reboot the server:
I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -606,15 +606,15 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -48,7 +48,7 @@ The syntax Moayad showed me last month doesn&rsquo;t seem to honor the search qu
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -588,15 +588,15 @@ The syntax Moayad showed me last month doesn&rsquo;t seem to honor the search qu
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -46,7 +46,7 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
So we have 1879/7100 (26.46%) matching already
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -791,15 +791,15 @@ Try doing it in two imports. In first import, remove all authors. In second impo
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -32,7 +32,7 @@ First I exported all the 2019 stats from CGSpace:
$ ./run.sh -s http://localhost:8081/solr/statistics -f &#39;time:2019-*&#39; -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -494,15 +494,15 @@ $ zstd statistics-2019.json
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -40,7 +40,7 @@ Purging 455 hits from WhatsApp in statistics
Total number of bot hits purged: 3679
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -577,15 +577,15 @@ Total number of bot hits purged: 3679
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -24,7 +24,7 @@ Start a full harvest on AReS
Start a full harvest on AReS
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -380,15 +380,15 @@ Start a full harvest on AReS
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -38,7 +38,7 @@ We agreed to try to do more alignment of affiliations/funders with ROR
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -724,15 +724,15 @@ isNotNull(value.match(&#39;699&#39;))
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -34,7 +34,7 @@ $ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p &#39;fuuu&
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -476,15 +476,15 @@ isNotNull(value.match(&#39;889&#39;))
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="April, 2022"/>
<meta name="twitter:description" content="2022-04-01 I did G1GC tests on DSpace Test (linode26) to compliment the CMS tests I did yesterday The Discovery indexing took this long: real 334m33.625s user 227m51.331s sys 3m43.037s 2022-04-04 Start a full harvest on AReS Help Marianne with submit/approve access on a new collection on CGSpace Go back in Gaia&rsquo;s batch reports to find records that she indicated for replacing on CGSpace (ie, those with better new copies, new versions, etc) Looking at the Solr statistics for 2022-03 on CGSpace I see 54."/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -509,15 +509,15 @@
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -66,7 +66,7 @@ If I query Solr for time:2022-04* AND dns:*msnbot* AND dns:*.msn.com. I see a ha
I purged 93,974 hits from these IPs using my check-spider-ip-hits.sh script
"/>
<meta name="generator" content="Hugo 0.118.2">
<meta name="generator" content="Hugo 0.133.1">
@@ -445,15 +445,15 @@ I purged 93,974 hits from these IPs using my check-spider-ip-hits.sh script
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

Some files were not shown because too many files have changed in this diff Show More