mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Compare commits
153 Commits
572f4639ac
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
|
63a2dcfdee
|
|||
|
e7d7d4af89
|
|||
|
bd2d9779bb
|
|||
|
47b96e8370
|
|||
|
512848fc73
|
|||
|
f8a1876ad2
|
|||
|
bb1367025a
|
|||
|
dabbc20806
|
|||
|
edd2a8b306
|
|||
|
842373d26f
|
|||
|
35342f95dc
|
|||
|
79708bd30c
|
|||
|
a5298945a3
|
|||
|
062019463c
|
|||
|
f1c25111d0
|
|||
|
da6d73bc1f
|
|||
|
7be53639dc
|
|||
|
64b8957945
|
|||
|
89d1b61442
|
|||
|
668947909a
|
|||
|
7858008918
|
|||
|
c3436ea6c2
|
|||
|
bf4a6402d7
|
|||
|
8383cd466b
|
|||
|
6d574d645d
|
|||
|
befe3a3a58
|
|||
|
39d8d0876c
|
|||
|
28a0c82e96
|
|||
|
7fc97884df
|
|||
|
223453adbb
|
|||
|
1b523bf055
|
|||
|
908a75a5c7
|
|||
|
e323c15e8b
|
|||
|
8f156a0365
|
|||
|
515cc0650f
|
|||
|
6db3da2739
|
|||
|
60b244486f
|
|||
|
efd8eb7f79
|
|||
|
281827944a
|
|||
|
864b3b136e
|
|||
|
01a2ff5bfd
|
|||
|
d71c430a7d
|
|||
|
0e43fc97d7
|
|||
|
90c4d46607
|
|||
|
83c053f7ee
|
|||
|
ba68787282
|
|||
|
1fc45e8f1b
|
|||
|
11f1935f85
|
|||
|
5ff70af33b
|
|||
|
b60a58f56a
|
|||
|
cc28c0ccdc
|
|||
|
1e87242956
|
|||
|
483a170f06
|
|||
|
0692b8666c
|
|||
|
b2eaff29b1
|
|||
|
da0fd61b7e
|
|||
|
3f4b66bd08
|
|||
|
ed290fb6f8
|
|||
|
63c20dbef9
|
|||
|
300b2e4271
|
|||
|
57fe0587a4
|
|||
|
20ace46614
|
|||
|
3475d4fd5d
|
|||
|
1dfb54ef6b
|
|||
|
82c79fc257
|
|||
|
cf5c1e2155
|
|||
|
7418dae4b9
|
|||
|
264cdcf1db
|
|||
|
293b500b26
|
|||
|
17a241de5b
|
|||
|
7695eacf7a
|
|||
|
f4c985c16b
|
|||
|
bc6412de09
|
|||
|
2ecafafc17
|
|||
|
804a505ae2
|
|||
|
6c5fa7375f
|
|||
|
f2bee38014
|
|||
|
a50fe66c78
|
|||
|
177c3b796d
|
|||
|
eb218389a0
|
|||
|
1dd5900fbf
|
|||
|
d14dd7114a
|
|||
|
01fb17950b
|
|||
|
c6d514bef9
|
|||
|
34523acc47
|
|||
|
3a4ecbd82d
|
|||
|
c9bcfca903
|
|||
|
7e3a7951d6
|
|||
|
8d39fc7d71
|
|||
|
22dd379e9a
|
|||
|
98cdd21cb5
|
|||
|
62838a091c
|
|||
|
cb40610726
|
|||
|
249d9be387
|
|||
|
4a02a78186
|
|||
|
aa6cbb488d
|
|||
|
aeaa397612
|
|||
|
d60b85433d
|
|||
|
202d3fb88f
|
|||
|
afcbc67874
|
|||
|
22e47beeb6
|
|||
|
223979f267
|
|||
|
28d62f1c0c
|
|||
|
34bf124d5d
|
|||
|
011a1ec9db
|
|||
|
45781d590d
|
|||
|
d8e0004240
|
|||
|
bfb7da50af
|
|||
|
6ec5e4b006
|
|||
|
1529cfd80b
|
|||
|
6737febf95
|
|||
|
6fbcc342d2
|
|||
|
e83e681706
|
|||
|
33061dbe3a
|
|||
|
d2ad21bde1
|
|||
|
f38ecfb75e
|
|||
|
24dd6fefb5
|
|||
|
a659eef05f
|
|||
|
9944f61ed5
|
|||
|
87ccbfc0f0
|
|||
|
929ce9685a
|
|||
|
e0f9e484ee
|
|||
|
021a92c0d9
|
|||
|
c97d005aa4
|
|||
|
190a1ee4a3
|
|||
|
9a2de13f21
|
|||
|
c644f40491
|
|||
|
6e701ee9c2
|
|||
|
e4dc8a3ed0
|
|||
|
74f4afe72a
|
|||
|
8bebf47078
|
|||
|
8c1e898683
|
|||
|
89d3fb717c
|
|||
|
309ffad285
|
|||
|
0fab2a0f28
|
|||
|
ae41ef3682
|
|||
|
4415eec1a0
|
|||
|
6985b53a7b
|
|||
|
df88592009
|
|||
|
3a68bc3cc7
|
|||
|
943fa8f1a2
|
|||
|
363dbb4505
|
|||
|
bda3cb4cd1
|
|||
|
33c42ecd49
|
|||
|
a9dc98b2dd
|
|||
|
0b0d2ea87d
|
|||
|
825385562d
|
|||
|
416d2bc7a7
|
|||
|
7cde2ad26b
|
|||
|
5fbc484c80
|
|||
|
aa5fab70b7
|
|||
|
d8be9c001c
|
|||
|
a4a725f22e
|
@@ -209,7 +209,7 @@ dc.identifier.issn
|
||||
- I need to follow up with Moayad about the reporting functionality
|
||||
- Also, I need to email Harrison my notes on the CG Core v2 stuff
|
||||
- Also, Jane asked me to check the Data Portal to see which email address requests for confidential data are going
|
||||
- Yesterday Theirry from CTA asked me about an error he was getting while submitting an item on CGSpace: "Unable to load Submission Information, since WorkspaceID (ID:S106658) is not a valid in-process submission."
|
||||
- Yesterday Thierry from CTA asked me about an error he was getting while submitting an item on CGSpace: "Unable to load Submission Information, since WorkspaceID (ID:S106658) is not a valid in-process submission."
|
||||
- I looked in the DSpace logs and found this right around the time of the screenshot he sent me:
|
||||
|
||||
```
|
||||
|
||||
@@ -169,7 +169,7 @@ $ csvjoin --outer -c alpha2 ~/Downloads/clarisa-countries.csv ~/Downloads/UNSD\
|
||||
- Then re-export the UN M.49 countries to a clean list because the one I did yesterday somehow has errors:
|
||||
|
||||
```console
|
||||
csvcut -d ';' -c 'ISO-alpha2 Code,Country or Area' ~/Downloads/UNSD\ —\ Methodology.csv | sed -e '1s/ISO-alpha2 Code/alpha2/' -e '1s/Country or Area/UN M.49 Name/' > ~/Downloads/un-countries.csv
|
||||
$ csvcut -d ';' -c 'ISO-alpha2 Code,Country or Area' ~/Downloads/UNSD\ —\ Methodology.csv | sed -e '1s/ISO-alpha2 Code/alpha2/' -e '1s/Country or Area/UN M.49 Name/' > ~/Downloads/un-countries.csv
|
||||
```
|
||||
|
||||
- Check the number of lines in each file:
|
||||
|
||||
@@ -63,7 +63,7 @@ $ csvjoin -c doi /tmp/2023-02-01-cgspace-doi-metadata.csv ~/Downloads/2023-02-01
|
||||
|
||||
|
||||
```console
|
||||
curl -f -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.actionArea", "value":"Systems Transformation", "language": "en_US"}'
|
||||
$ curl -f -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.actionArea", "value":"Systems Transformation", "language": "en_US"}'
|
||||
```
|
||||
|
||||
- I need to ask on the DSpace Slack about this POST pagination
|
||||
|
||||
@@ -30,4 +30,168 @@ $ psql < locks-age.sql | grep " days " | awk -F"|" '{print $10}' | sort -u | xar
|
||||
- libvips does it correctly automatically and looks closer to the PDF
|
||||
- Meeting about CG Core types
|
||||
|
||||
## 2023-05-10
|
||||
|
||||
- Write a script to find the `metadata_field_id` values associated with the non-AGROVOC subjects I am working on for Sara
|
||||
- This is useful because we want to know who to contact for a definition
|
||||
- The script was:
|
||||
|
||||
```bash
|
||||
while read -r subject; do
|
||||
metadata_field_id=$(psql -h localhost -U postgres -d dspacetest -qtAX <<SQL
|
||||
SELECT DISTINCT(metadata_field_id) FROM metadatavalue WHERE LOWER(text_value)='$subject'
|
||||
SQL
|
||||
)
|
||||
metadata_field_id=$(echo $metadata_field_id | sed 's/[[:space:]]/||/g')
|
||||
|
||||
echo "$subject,$metadata_field_id"
|
||||
done < <(csvcut -c 1 ~/Downloads/2023-04-26\ CGIAR\ non-AGROVOC\ subjects.csv | sed 1d)
|
||||
```
|
||||
|
||||
- I also realized that Bernard Bett didn't have any items on CGSpace tagged with his ORCID identifier, so I tagged 230!
|
||||
|
||||
## 2023-05-11
|
||||
|
||||
- CG Core meeting
|
||||
- Finalize looking at the CGSpace non-AGROVOC subjects for FAO
|
||||
|
||||
## 2023-05-12
|
||||
|
||||
- Export the Alliance community to do some country/region fixes
|
||||
- I also sent Maria and Francesca the export because they want to add more regions and subregions
|
||||
- Export the entire CGSpace to check for missing Initiative collection mappings
|
||||
- I also adding missing regions
|
||||
|
||||
## 2023-05-16
|
||||
|
||||
- I finally cleaned up and published my latest evaluation of [JPEG, WebP, and AVIF](https://alanorth.github.io/improved-dspace-thumbnails/evaluating-jpeg-webp-avif.html)
|
||||
- I [filed an issue on DSpace](https://github.com/DSpace/DSpace/issues/8849) to track this
|
||||
|
||||
## 2023-05-17
|
||||
|
||||
- Re-sync CGSpace to DSpace 7 Test
|
||||
- I came up with a naive patch to use WebP instead of JPEG in the DSpace ImageMagick filter, and it works, but doesn't replace existing JPEGs... hmmm
|
||||
- Also, it does PDF to WebP to WebP haha
|
||||
|
||||
## 2023-05-18
|
||||
|
||||
- I created a [pull request](https://github.com/DSpace/DSpace/pull/8850) to improve some minor documentation, typo, and logic issues in the DSpace ImageMagick thumbnail filters
|
||||
- I realized that there is a quick win to the generation loss issue with ImageMagickThumbnailFilter
|
||||
- We can use ImageMagick's internal MIFF instead of JPEG when writing the intermediate image
|
||||
- According to the [libvips author PNG is very slow](https://github.com/libvips/libvips/issues/571)!
|
||||
- I re-ran my `generation-loss.sh` script using MIFF and found that it had essentially the same results as PNG, which is about 1.1 points higher on the ssimulacra2 (v2.1) scoring scale
|
||||
- Also, according to my tests with the cosmo rusage.com utility, I see that MIFF is indeed much faster than PNG
|
||||
- I updated my pull request to add this quick win
|
||||
- Weekly CG Core types meeting
|
||||
- Low attendance so I just kept working on the spreadsheet
|
||||
- We are at the stage of voting on definitions
|
||||
|
||||
## 2023-05-19
|
||||
|
||||
- I ported a few of the minor ImageMagick Thumbnail Filter improvements to our `6_x-prod` branch
|
||||
|
||||
## 2023-05-20
|
||||
|
||||
- I deployed the latest thumbnail changes on CGSpace, ran all updates, and rebooted it
|
||||
- I exported CGSpace to check for missing Initiative mappings
|
||||
- Then I started a harvest on AReS
|
||||
|
||||
## 2023-05-23
|
||||
|
||||
- Help Francesca with an import of a journal article with a few hundred authors
|
||||
- I used the DSpace 7 live import from PubMed
|
||||
- I also noticed a bug in the CrossRef live import if you change the DOI field, so I [filed an issue](https://github.com/DSpace/DSpace/issues/8865)
|
||||
|
||||
## 2023-05-25
|
||||
|
||||
- Meeting on output types
|
||||
- Make a [pull request on DSpace to capture publisher during live import from Crossref](https://github.com/DSpace/DSpace/pull/8866)
|
||||
|
||||
## 2023-05-26
|
||||
|
||||
- Make a [pull request on DSpace to update checkstyle](https://github.com/DSpace/DSpace/pull/8868)
|
||||
- Make a [pull request on DSpace-angular to fix an incorrect i18n UI string](https://github.com/DSpace/dspace-angular/pull/2274)
|
||||
- I'm experimenting with replacing old thumbnails
|
||||
- In the past we used to upload thumbnails for journal covers, but those were low quality and look horrible now
|
||||
- Using the provenance field I want to identify items with 1 bitstream of type gif or jpg, then extract the item IDs along with DOIs:
|
||||
|
||||
```sql
|
||||
\COPY (SELECT
|
||||
text_value,
|
||||
dspace_object_id
|
||||
FROM
|
||||
metadatavalue
|
||||
WHERE
|
||||
dspace_object_id IN (
|
||||
SELECT
|
||||
dspace_object_id
|
||||
FROM
|
||||
metadatavalue
|
||||
WHERE
|
||||
metadata_field_id = 28
|
||||
AND place = 0
|
||||
AND (text_value LIKE '%No. of bitstreams: 1%'
|
||||
AND text_value SIMILAR TO '%.(gif|jpg|jpeg)%'))
|
||||
AND metadata_field_id = 220) TO /tmp/items-with-old-bitstreams.csv WITH CSV HEADER;
|
||||
```
|
||||
|
||||
- I extract the DOIs and look them up on CrossRef to see which are CC-BY, then extract those:
|
||||
|
||||
```console
|
||||
$ csvcut -c text_value /tmp/items-with-old-bitstreams.csv | sed 1d > /tmp/dois.txt
|
||||
$ ./ilri/crossref_doi_lookup.py -i /tmp/dois.txt -e fuuu@example.com -o /tmp/dois-resolved.csv
|
||||
$ csvgrep -c license -m 'creativecommons' /tmp/dois-resolved.csv \
|
||||
| csvgrep -c license -m 'by-nc-nd' --invert-match \
|
||||
| csvcut -c doi \
|
||||
| sed '2,$s_^\(.*\)$_https://doi.org/\1_' \
|
||||
| sed 1d > /tmp/dois-for-cc-items-with-old-bitstreams.txt
|
||||
```
|
||||
|
||||
- This results in 262 items that have DOIs that are CC-BY (but not ND)
|
||||
- This is a good starting point, but misses some that had low-quality thumbnails uploaded after they were added (ie, there's no record of a bitstream in the provenance field)
|
||||
- I ran the list through my Sci-Hub download script and filtered out a few that downloaded invalid PDFs (manually), then generated thumbnails for all of them:
|
||||
|
||||
```console
|
||||
$ ~/src/git/DSpace/ilri/get_scihub_pdfs.py -i /tmp/dois-for-cc-items-with-old-bitstreams.txt -o bitstreams.csv
|
||||
$ chrt -b 0 vipsthumbnail *.pdf --export-profile srgb -s 600x600 -o './%s.pdf.jpg[Q=02,optimize_coding,strip]'
|
||||
```
|
||||
|
||||
- Then I joined the CSVs on the DOI column, filtered out any that we didn't find PDFs for, and formatted the resulting CSV with an id, filename, and bundle column:
|
||||
|
||||
```console
|
||||
$ csvjoin -c doi bitstreams.csv /tmp/items-with-old-bitstreams.csv \
|
||||
| csvgrep -c filename --invert-match -r '^$' \
|
||||
| sed '1s/dspace_object_id/id/' \
|
||||
| csvcut -c id,filename \
|
||||
| sed -e '1s/^\(.*\)$/\1,bundle/' -e '2,$s/^\(.*\)$/\1.jpg__description:libvips thumbnail,THUMBNAIL/' > new-thumbnails.csv
|
||||
```
|
||||
|
||||
- I did a dry run with `ilri/post_bitstreams.py` and it seems that most (all?) already have thumbnails from the last time I did a massive Sci-Hub check
|
||||
- So relying on the provenance field is not very reliable it seems, and that was a waste of two hours...
|
||||
- I did discover, while originally posting WebP thumbnails, that the format doesn't seem to be set correctly when uploading WebP via the REST API, but it does work when uploading via XMLUI—the format is set to Unknown
|
||||
- POSTing a JPG to the THUMBNAIL bundle sets the format to JPEG...
|
||||
- I am guessing that is a bug that I won't bother troubleshooting since the DSpace 6.x REST API is deprecated
|
||||
|
||||
## 2023-05-27
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- Then I also ran the csv-metadata-quality tool on the Initiatives to do some easy fixes like country/region mapping and whitespace fixes
|
||||
- Start a havest on AReS
|
||||
|
||||
## 2023-05-29
|
||||
|
||||
- Re-create my local PostgreSQL 14 container:
|
||||
|
||||
```console
|
||||
$ podman rm dspacedb14
|
||||
$ podman pull docker.io/postgres:14-alpine
|
||||
$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d docker.io/postgres:14-alpine -c shared_buffers=1024MB -c random_page_cost=1.1
|
||||
```
|
||||
|
||||
- Export CGSpace again to do some major cleanups in OpenRefine
|
||||
- I found a few countries that are in the ISO 3166-1 and UN M.49 lists, but not in ours so I added them to the list in `input-forms.xml` and regenerated the controlled vocabularies for the CGSpace Submission Guidelines
|
||||
- There were a handful of issues with ISSNs, ISBNs, DOIs, access status, licenses, and missing CGIAR Trust Fund donors for Initiatives outputs
|
||||
- This was about 455 items
|
||||
- Helping the Alliance web team understand the DSpace REST API for determining which collection an item belongs to
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
||||
252
content/posts/2023-06.md
Normal file
252
content/posts/2023-06.md
Normal file
@@ -0,0 +1,252 @@
|
||||
---
|
||||
title: "June, 2023"
|
||||
date: 2023-06-02T10:29:36+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2023-06-02
|
||||
|
||||
- Spend some time testing my `post_bitstreams.py` script to update thumbnails for items on CGSpace
|
||||
- Interestingly I found an item with a JFIF thumbnail and another with a WebP thumbnail...
|
||||
- Meeting with Valentina, Stefano, and Sara about MODS metadata in CGSpace
|
||||
- They have experience with improving the MODS interface in MELSpace's OAI-PMH for use with AGRIS and were curious if we could do the same in CGSpace
|
||||
- From what I can see we need to upgrade the MODS schema from 3.1 to 3.7 and then just add a bunch of our fields to the crosswalk
|
||||
|
||||
<!--more-->
|
||||
|
||||
## 2023-06-04
|
||||
|
||||
- Upgrade CGSpace to Ubuntu 22.04
|
||||
- The upgrade was mostly normal, but I had to unhold the openjdk package in order for `do-release-upgrade` to run:
|
||||
|
||||
```console
|
||||
# apt-mark hold openjdk-8-jdk-headless:amd64 openjdk-8-jre-headless:amd64
|
||||
```
|
||||
|
||||
- In [2022-11]({{< relref "2022-11.md" >}}) an upstream Java update broke the DSpace 6 Handle server so we will have to pin this again after the upgrade to Ubuntu 22.04
|
||||
- After the upgrade I made sure CGSpace was working, then proceeded to upgrade PostgreSQL from 12 to 14, like I did on [DSpace Test in 2023-03]({{< relref "2023-03.md" >}})
|
||||
- Then I had to downgrade OpenJDK to fix the Handle server using the ones I had previously downloaded for Ubuntu 20.04 because they no longer exist on Launchpad:
|
||||
|
||||
```console
|
||||
# dpkg -i openjdk-8-j*8u342-b07*.deb
|
||||
```
|
||||
|
||||
- Export CGSpace to fix missing Initiative collection mappings
|
||||
- Start a harvest on AReS
|
||||
- Work on the DSpace 7 migration a bit more
|
||||
- I decided to rebase and drop all the submission form edits because they conflict every time upstream changes!
|
||||
|
||||
## 2023-06-06
|
||||
|
||||
- Fix some incorrect ORCID identifiers for an Alliance author on CGSpace
|
||||
- Export our list of ORCID identifiers, resolve them, and update the records in CGSpace:
|
||||
|
||||
```console
|
||||
$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml 2022-09-22-add-orcids.csv| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-06-06-orcids.txt
|
||||
$ ./ilri/resolve_orcids.py -i /tmp/2023-06-06-orcids.txt -o /tmp/2023-06-06-orcids-names.txt -d
|
||||
$ ./ilri/update_orcids.py -i /tmp/2023-06-06-orcids-names.txt -db dspacetest -u dspace -p 'ffff' -m 247
|
||||
```
|
||||
|
||||
- Start working on updating the MODS schema in CGSpace from 3.1 to 3.8 based on Stefano and Salem's work last year
|
||||
|
||||
## 2023-06-08
|
||||
|
||||
- Continue working on the MODS schema mapping
|
||||
- Export CGSpace to check and update `dcterms.extent` fields
|
||||
- I normalized about 1,500 to use either "p. 1-6" or "5 p." format
|
||||
- Also, I used this GREL expression to extract missing pages from the citation field: `cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*(pp?\.\s?\d+[-–]\d+).*/)[0]`
|
||||
- This was over 4,000 items with a format like "p. 1-6" and "pp. 1-6" in the citation
|
||||
- I used another GREL expression to extract another 5,000: `cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*?(\d+\s+?[Pp]+\.).*/)[0]`
|
||||
- This was for the format like "1 p." (note we had to protect against the greedy `.*` in the beginning)
|
||||
- I also did some work to capture a handful of missing DOIs and ISSNs, but it was only about 100 items and I will have to wait until the 10,000+ above finish importing
|
||||
|
||||
## 2023-06-09
|
||||
|
||||
- I see there are ~200 users in CGSpace that have registered with their CGIAR email address using a password as opposed to using Active Directory:
|
||||
|
||||
```sql
|
||||
SELECT * FROM eperson WHERE email LIKE '%cgiar.org' AND netid IS NOT NULL AND password IS NOT NULL;
|
||||
```
|
||||
|
||||
- I am wondering if I should delete their passwords and tell them use log in using LDAP
|
||||
- As an initial test I will reset a few accounts including my own that have passwords and salts:
|
||||
|
||||
```sql
|
||||
UPDATE eperson SET password=DEFAULT,salt=DEFAULT,digest_algorithm=DEFAULT WHERE netid IN ('axxxx', 'axxxx', 'bxxxx');
|
||||
```
|
||||
|
||||
- I also decided to reset passwords/salts for CGIAR accounts that have not been active since 2021 (1.5 years ago):
|
||||
|
||||
```sql
|
||||
UPDATE eperson SET password=DEFAULT,salt=DEFAULT,digest_algorithm=DEFAULT WHERE email LIKE '%cgiar.org' AND netid IS NOT NULL AND password IS NOT NULL AND salt IS NOT NULL AND last_active < '2022-01-01'::date;
|
||||
```
|
||||
|
||||
- This was about 100 accounts...
|
||||
- I will wait some more time before I decide what to do about the more current ones
|
||||
- Add a few more ORCID identifiers to my list and tag them on CGSpace
|
||||
|
||||
## 2023-06-10
|
||||
|
||||
- Export CGSpace to check for missing Initiative mappings
|
||||
- Start a harvest on AReS
|
||||
|
||||
## 2023-06-11
|
||||
|
||||
- File [an issue](https://github.com/DSpace/DSpace/issues/8900) on DSpace for the `Content-Disposition` bug causing images to get downloaded instead of opened inline
|
||||
|
||||
## 2023-06-12
|
||||
|
||||
- Export CGSpace to do some more work extracting volume and issue from citations for items where they are missing
|
||||
- I found and fixed over 7,000!
|
||||
- Then I found and extracted another 7,000 items with no extents (pages)
|
||||
- Then I replaced all occurences of en dashes for ranges in pages with regular hyphens
|
||||
|
||||
## 2023-06-13
|
||||
|
||||
- Last night I finally figured out how to do basic overrides to the simple item view in Angular
|
||||
- Add a handful of new ORCID identifiers to my list and tag them on CGSpace
|
||||
- Extract a list of all the proposed actions for CG Core output types and create a [new issue for them on CG Core's GitHub repository](https://github.com/AgriculturalSemantics/cg-core/issues/45)
|
||||
- Extract a list of all the proposed actions for CG Core output types for MARLO and create [a new issue for them on MARLO's GitHub repository](https://github.com/CCAFS/MARLO/issues/2479)
|
||||
- Meeting with Indira, Ryan, and Abenet to discuss plans for the DSpace 7 focus group
|
||||
|
||||
## 2023-06-14
|
||||
|
||||
- Did some more work on the DSpace 7 Test to improve the submission forms and the look and feel
|
||||
- Extract a list of all the proposed actions for CG Core output types for MEL and create [a new issue for them on MEL's GitHub repository](https://github.com/CodeObia/MEL/issues/11216)
|
||||
- I filed [an issue about the yarn merge-i18n script](https://github.com/DSpace/dspace-angular/issues/2309)
|
||||
- I made [a pull request for some Finnish language i18n strings](https://github.com/DSpace/dspace-angular/pull/2306)
|
||||
- I made [a pull request to lint the i18n en.json5 file](https://github.com/DSpace/dspace-angular/pull/2306)
|
||||
|
||||
## 2023-06-15
|
||||
|
||||
- A lot more work on DSpace 7
|
||||
- I tested some pull requests and worked on the style of the item view and homepage
|
||||
|
||||
## 2023-06-16
|
||||
|
||||
- A lot more work on DSpace 7
|
||||
- I made [a pull request to adjust font weight in item counts ](https://github.com/DSpace/dspace-angular/pull/2316)
|
||||
- I made [a pull request to update the ESLint configuration for JSON5](https://github.com/DSpace/dspace-angular/pull/2317)
|
||||
|
||||
## 2023-06-17
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- I also spent some time doing sanity checks on countries, regions, DOIs, and more
|
||||
- I lowercased all our AGROVOC keywords in `dcterms.subject`:
|
||||
|
||||
```sql
|
||||
dspace=# BEGIN;
|
||||
BEGIN
|
||||
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
|
||||
UPDATE 2392
|
||||
dspace=*# COMMIT;
|
||||
COMMIT
|
||||
```
|
||||
|
||||
- Start a harvest on AReS
|
||||
|
||||
## 2023-06-19
|
||||
|
||||
- Today I started getting an error on DSpace 7 Test
|
||||
- The page loads, and then when it is almost done it goes blank to white with this in the console:
|
||||
|
||||
```console
|
||||
ERROR DOMException: CSSStyleSheet.cssRules getter: Not allowed to access cross-origin stylesheet
|
||||
```
|
||||
|
||||
- I restarted Angular, but it didn't fix it
|
||||
- The `yarn test:rest` script shows everything OK, and I haven't changed anything recently...
|
||||
- I re-compiled the Angular UI using the default theme and it was the same...
|
||||
- I tried in Firefox Nightly and it works...
|
||||
- So it must be something related to the browser
|
||||
- I tried clearing all the session storage / cookies and refreshing and it worked
|
||||
- I switched back to the CGSpace theme and it happened again
|
||||
- I had a hunch it might be due to the GDPR cookie plugin in my browser, so I disabled that and then refreshed and it worked... hmmm
|
||||
- Upload thumbnails for about 42 IITA Journal Articles after resolving their DOIs and making sure they were not CC ND
|
||||
- I fixed a few bugs in `get_scihub_pdfs.py` in the process
|
||||
|
||||
## 2023-06-21
|
||||
|
||||
- Stefano got back to me about the MODS OAI-PMH schema test on DSpace Test
|
||||
- He said that it's fine if we use iso8601 encoding for dates instead of w3cdtf and asked if we can create a custom end point for AGRIS that only includes types like Journal Articles similar to how Salem did it: https://melspace.loc.codeobia.com/oai/agris?verb=ListRecords&metadataPrefix=mods
|
||||
- I updated DSpace Test with the new date format and said I'd work on the custom AGRIS set
|
||||
|
||||
## 2023-06-25
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- I wanted to start a harvest on AReS but I've seen the load on the server high for a few days and I'm not sure what it is
|
||||
- I decided to run all updates and reboot it since it's Sunday anyway
|
||||
|
||||
## 2023-06-26
|
||||
|
||||
- Since the new DSpace 7 will respect newlines in metadata fields I am curious to see how many of our abstracts have poor newlines
|
||||
- I exported CGSpace and used a custom text facet with this GREL expression in OpenRefine to count the number of newlines in each cell:
|
||||
|
||||
```console
|
||||
value.split('\n').length()
|
||||
```
|
||||
|
||||
- Also useful to check for general length of the text in the cell to make sure it's a reasonably long string
|
||||
- I spent some time trying to find a pattern that I could use to identify "easy" targets, but there are so many exceptions that it will have to be done manually
|
||||
- I fixed a few dozen
|
||||
- Do a bit of work on thumbnails on CGSpace
|
||||
- I'm trying to troubleshoot the Discovery error I get on DSpace 7:
|
||||
|
||||
```console
|
||||
java.lang.NullPointerException: Cannot invoke "org.dspace.discovery.configuration.DiscoverySearchFilterFacet.getIndexFieldName()" because the return value of "org.dspace.content.authority.DSpaceControlledVocabularyIndex.getFacetConfig()" is null
|
||||
```
|
||||
|
||||
- I reverted to the default `submission-forms.xml` and the `getFacetConfig()` error goes away...
|
||||
- Kill some long-held locks on CGSpace PostgreSQL, as some users are complaining of slowness in archiving
|
||||
- I did some testing of the LDAP login issue related to groupmaps
|
||||
- It does seem to be a regression from the [LDAP auth patch](https://github.com/DSpace/DSpace/pull/8814) from last month, so I [filed an issue](https://github.com/DSpace/DSpace/issues/8920)
|
||||
- I spent some time on working on Angular and I figured out how to add a custom Angular component to show the UN SDG Goal icons on DSpace 7
|
||||
|
||||
## 2023-06-27
|
||||
|
||||
- I debugged the NullPointerException and somehow it disappeared
|
||||
- It seems to be related to the external controlled vocabularies in the submission form
|
||||
- I removed them all, then added them all back, and now the issue is solved... hmmmm
|
||||
- Oh now, now they are gone again, sigh...
|
||||
|
||||
## 2023-06-28
|
||||
|
||||
- Spent a lot of time debugging the browse indexes
|
||||
- Looking at the [DSpace 7 demo API](https://api7.dspace.org/server/api/discover/browses) I see the four default browse indexes from `dspace.cfg` and the one default `srsc` one that gets automatically enabled from the `<vocabulary>srsc</vocabulary>` in the `submission-forms.xml`
|
||||
- The same API call on my test DSpace 7 configuration results in the HTTP 500 I've been seeing for some time, and I am pretty sure it's due to the automagic configuration of hierarchical browses based on the submission form
|
||||
- Yes, if I remove them all from my submission form then this works: http://localhost:8080/server/api/discover/browses
|
||||
- I went through each of our vocabularies and tested them one by one:
|
||||
- dcterms-subject: OK
|
||||
- dc-contributor-author: NO
|
||||
- cg-creator-identifier: NO
|
||||
- cg-contributor-affiliation: OK (and with `facetType: "affiliation"` in API response?!)
|
||||
- cg-contributor-donor: OK (`facetType: "sponsorship"`)
|
||||
- cg-journal: NO
|
||||
- cg-coverage-subregion: NO
|
||||
- cg-species-breed: NO
|
||||
- Now I need to figure out what it is about those five that causes them to not work!
|
||||
- Ah, after debugging with someone on the DSpace Slack, I realized that DSpace expects these vocabularies to have corresponding indexes configured in `discovery.xml`, and they must be added as search filters AND sidebar facets.
|
||||
|
||||
## 2023-06-29
|
||||
|
||||
- I noticed there is now a [patched version of the Handle JAR for DSpace 6.x](https://github.com/DSpace/DSpace/issues/8557#issuecomment-1595340249)
|
||||
- This fixes the [issue in OpenJDK 1.8.0_352](https://groups.google.com/g/dspace-tech/c/PqjfA5mqG4w/m/FhxI5oXhFwAJ?pli=1), so we can remove the apt pin on JDK now
|
||||
- I deployed it on CGSpace and it's working!
|
||||
- I lowercased all our AGROVOC terms because I noticed a few that were not:
|
||||
|
||||
```console
|
||||
dspace=# BEGIN;
|
||||
BEGIN
|
||||
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
|
||||
UPDATE 53
|
||||
dspace=*# COMMIT;
|
||||
```
|
||||
|
||||
- After more discussion about the NullPointerException related to browse options, I filed [an issue](https://github.com/DSpace/DSpace/issues/8927)
|
||||
|
||||
## 2023-06-30
|
||||
|
||||
- I added another custom component to display CGIAR Impact Area icons in the DSpace 7 test
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
324
content/posts/2023-07.md
Normal file
324
content/posts/2023-07.md
Normal file
@@ -0,0 +1,324 @@
|
||||
---
|
||||
title: "July, 2023"
|
||||
date: 2023-07-01T17:14:36+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2023-07-01
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- Start harvesting on AReS
|
||||
|
||||
## 2023-07-02
|
||||
|
||||
- Minor edits to the `crossref_doi_lookup.py` script while running some checks from 22,000 CGSpace DOIs
|
||||
|
||||
## 2023-07-03
|
||||
|
||||
- I analyzed the licenses declared by Crossref and found with high confidence that ~400 of ours were incorrect
|
||||
- I took the more accurate ones from Crossref and updated the items on CGSpace
|
||||
- I took a few hundred ISBNs as well for where we were missing them
|
||||
- I also tagged ~4,700 items with missing licenses as "Copyrighted; all rights reserved" based on their Crossref license status being TDM, mostly from Elsevier, Wiley, and Springer
|
||||
- Checking a dozen or so manually, I confirmed that if Crossref only has a TDM license then it's usually copyrighted (could still be open access, but we can't tell via Crossref)
|
||||
- I would be curious to write a script to check the Unpaywall API for open access status...
|
||||
- In the past I found that their *license* status was not very accurate, but the open access status might be more reliable
|
||||
- More minor work on the DSpace 7 item views
|
||||
- I learned some new Angular template syntax
|
||||
- I created a custom component to show Creative Commons licenses on the simple item page
|
||||
- I also decided that I don't like the Impact Area icons as a component because they don't have any visual meaning
|
||||
|
||||
## 2023-07-04
|
||||
|
||||
- Focus group meeting with CGSpace partners about DSpace 7
|
||||
- I added a themed file selection component to the CGSpace theme
|
||||
- It displays the bistream description instead of the file name, just like we did in DSpace 6 XMLUI
|
||||
- I added a custom component to show share icons
|
||||
|
||||
## 2023-07-05
|
||||
|
||||
- I spent some time trying to update OpenRXV from Angular 9 to 10 to 11 to 12 to 13
|
||||
- Most things work but there are some minor bugs it seems
|
||||
- Mishell from CIP emailed me to say she was having problems approving an item on CGSpace
|
||||
- Looking at PostgreSQL I saw there were a dozen or so locks that were several hours and even over one day old so I killed those processes and told her to try again
|
||||
|
||||
## 2023-07-06
|
||||
|
||||
- Types meeting
|
||||
- I wrote a Python script to check Unpaywall for some information about DOIs
|
||||
|
||||
## 2023-07-7
|
||||
|
||||
- Continue exploring Unpaywall data for some of our DOIs
|
||||
- In the past I've found their _licensing_ information to not be very reliable (preferring Crossref), but I think their _open access status_ is more reliable, especially when the provider is listed as being the publisher
|
||||
- Even so, sometimes the version can be "acceptedVersion", which is presumably the author's version, as opposed to the "publishedVersion", which means it's available as open access on the publisher's website
|
||||
- I did some quality assurance and found ~100 that were marked as Limited Access, but should have been Open Access, and fixed a handful of licenses
|
||||
- Delete duplicate metadata as described in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
|
||||
- Start working on some statistics on AGROVOC usage for my presenation next week
|
||||
- I used the following SQL query to dump values from all subject fields and lower case them:
|
||||
|
||||
```console
|
||||
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119)) to /tmp/2023-07-07-cgspace-subjects.csv WITH CSV HEADER;
|
||||
COPY 26443
|
||||
Time: 2564.851 ms (00:02.565)
|
||||
```
|
||||
|
||||
- Then I extracted the subjects and looked them up against AGROVOC:
|
||||
|
||||
```console
|
||||
$ csvcut -c subject /tmp/2023-07-07-cgspace-subjects.csv | sed '1d' > /tmp/2023-07-07-cgspace-subjects.txt
|
||||
$ ./ilri/agrovoc_lookup.py -i /tmp/2023-07-07-cgspace-subjects.txt -o /tmp/2023-07-07-cgspace-subjects-results.csv
|
||||
```
|
||||
|
||||
- I did some more tests with Angular 13 on OpenRXV and found out why the repository type dropdown wasn't working
|
||||
- It was because of a missing 1-line JSON file in the data directory, which is runtime data, not code
|
||||
- I copied the data directory from the production serve and rebuild and the site is working well now
|
||||
- I did a full harvest with plugins and it worked!
|
||||
- So it seems Angular 13.4.0 will work, yay
|
||||
|
||||
## 2023-07-08
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- Start a harvest on AReS
|
||||
- The AGROVOC lookup finished, so I checked the number of matches:
|
||||
|
||||
```console
|
||||
$ csvgrep -c 'match type' -r '^.+$' ~/Downloads/2023-07-07-cgspace-subjects-resolved.csv | sed 1d | wc -l
|
||||
12528
|
||||
```
|
||||
|
||||
- So that's 12,528 out of 26,443 unique terms (47.3%)
|
||||
- I did a LOT of work on the OpenRXV frontend build dependencies to bring more in line with Angular 13
|
||||
|
||||
## 2023-07-10
|
||||
|
||||
- I did a lot more work on OpenRXV to test and update dependencies
|
||||
- I deployed the latest version on the production server
|
||||
|
||||
## 2023-07-12
|
||||
|
||||
- CGSpace upgrade meeting with Americas and Africa group
|
||||
|
||||
## 2023-07-13
|
||||
|
||||
- Michael Victor asked me to help Aditi extract some information from CGSpace
|
||||
- She was interested in journal articles published between 2018 and 2023 with a range of subjects related to drought, flooding, resilience, etc
|
||||
- I used an advanced query with some AGROVOC terms:
|
||||
|
||||
```console
|
||||
dcterms.issued:[2018 TO 2023] AND dcterms.type:"Journal Article" AND (dcterms.subject:flooding OR dcterms.subject:flood OR dcterms.subject:"extreme weather events" OR dcterms.subject:drought OR dcterms.subject:"drought resistance" OR dcterms.subject:"drought tolerance" OR dcterms.subject:"soil salinity" OR dcterms.subject:"pests of plants" OR dcterms.subject:pests OR dcterms.subject:heat OR dcterms.subject:fertilizers OR dcterms.subject:"fertilizer technology" OR dcterms.subject:"rice fields" OR dcterms.subject:"landscape conservation" OR dcterms.subject:"landscape restoration" OR dcterms.subject:livestock)
|
||||
```
|
||||
|
||||
- Interestingly, some variations of this same exact query produce no search results, and I see this error in the DSpace log:
|
||||
|
||||
```console
|
||||
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dcterms.issued:[2018 TO 2023] AND dcterms.type:"Journal Article" AND (dcterms.subject:flooding OR dcterms.subject:flood OR dcterms.subject:"extreme weather events" OR dcterms.subject:drought OR dcterms.subject:"drought resistance" OR dcterms.subject:"drought tolerance" OR dcterms.subject:"soil salinity" OR dcterms.subject:"pests of plants" OR dcterms.subject:pests OR dcterms.subject:heat OR dcterms.subject:fertilizers OR dcterms.subject:"fertilizer technology" OR dcterms.subject:"rice fields" OR dcterms.subject:livestock OR dcterms.subject:"landscape conservation" OR dcterms.subject:"landscape restoration\"\)': Lexical error at line 1, column 617. Encountered: <EOF> after : "\"landscape restoration\\\"\\)"
|
||||
```
|
||||
|
||||
- It seems to be when there is a quoted search term at the end of the parenthesized group
|
||||
- For what it's worth this same query worked fine on DSpace 7.6
|
||||
|
||||
## 2023-07-15
|
||||
|
||||
- Export CGSpace to fix missing Initiative collection mappings
|
||||
- Start a harvest on AReS
|
||||
|
||||
## 2023-07-17
|
||||
|
||||
- Rasika had sent me a list of new ORCID identifiers for new IWMI staff so I combined them with our existing list and ran `resolve_orcids.py` to refresh the names in our database
|
||||
- I updated the list, updated names in the database, and tagged new authors with missing identifiers in existing items
|
||||
|
||||
## 2023-07-18
|
||||
|
||||
- Meeting with IWMI, IRRI, and IITA colleagues about CGSpace upgrade plans
|
||||
- Maria from the Alliance mentioned having some submissions stuck on CGSpace
|
||||
- I looked and found a number of locks stuck for many nineteen, eighteen, and more hours...
|
||||
- I killed them and told her to try again
|
||||
|
||||
```console
|
||||
$ psql < locks-age.sql | less -S
|
||||
$ psql < locks-age.sql | grep -E " (19|18|17|16|12):" | awk -F"|" '{print $10}' | sort -u | xargs kill
|
||||
```
|
||||
|
||||
## 2023-07-19
|
||||
|
||||
- I had to kill a bunch more locked processes in PostgreSQL, I'm not sure what's going on
|
||||
- After some discussion about an advanced search bug with Tim on Slack, I filed [an issue on GitHub](https://github.com/DSpace/DSpace/issues/8962)
|
||||
|
||||
## 2023-07-20
|
||||
|
||||
- I added a new metadata field for CGIAR Impact Platforms (`cg.subject.impactPlatform`) to CGSpace
|
||||
|
||||
## 2023-07-22
|
||||
|
||||
- Export CGSpace tp fix missing Initiative collections
|
||||
- Start a harvest on AReS
|
||||
|
||||
## 2023-07-24
|
||||
|
||||
- Test Salem's new JavaScript-based DSpace Statistics API and send him some feedback
|
||||
- I noticed a few times that the Solr service on my DSpace 7 instance is getting OOM killed
|
||||
- I had been using a 4g Solr heap, but maybe we don't need that much
|
||||
- Tomcat is also using 4.6GB, and then there's PostgreSQL... so perhaps it's all a bit much on this system now
|
||||
|
||||
## 2023-07-25
|
||||
|
||||
- Start testing exporting DSpace 6 Solr cores to import on DSpace 7:
|
||||
|
||||
```console
|
||||
$ chrt -b 0 dspace solr-export-statistics -i statistics
|
||||
```
|
||||
|
||||
- I'm curious how long it takes and how much data there will be
|
||||
- The size of the Solr data directory is currently 82GB
|
||||
- The export took about 2.5 hours and created 6,000 individual CSVs, one for each day of Solr stats
|
||||
- The size of the exported CSVs is about 88GB
|
||||
- I will copy just a few years to import on the DSpace 7 test server
|
||||
- So importing these is going to require removing the Atmire custom fields:
|
||||
|
||||
```console
|
||||
$ dspace solr-import-statistics -i statistics
|
||||
Exception: Error from server at http://localhost:8983/solr/statistics: ERROR: [doc=1a92472e-e39d-4602-9b4d-da022df8f233] unknown field 'containerCommunity'
|
||||
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/statistics: ERROR: [doc=1a92472e-e39d-4602-9b4d-da022df8f233] unknown field 'containerCommunity'
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:681)
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:266)
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
|
||||
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
|
||||
at org.dspace.util.SolrImportExport.importIndex(SolrImportExport.java:465)
|
||||
at org.dspace.util.SolrImportExport.main(SolrImportExport.java:148)
|
||||
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
|
||||
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
|
||||
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
|
||||
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
|
||||
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:277)
|
||||
at org.dspace.app.launcher.ScriptLauncher.handleScript(ScriptLauncher.java:133)
|
||||
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:98)
|
||||
```
|
||||
|
||||
- I will try using solr-import-export-json, which I've used in the past to skip Atmire custom fields in Solr:
|
||||
|
||||
```console
|
||||
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2022.json -f 'time:[2022-01-01T00\:00\:00Z TO 2022-12-31T23\:59\:59Z]' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,geoIpCountryCode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId,core_update_run_nb
|
||||
```
|
||||
|
||||
- Some users complained that CGSpace was slow and I found a handful of locks that were hours and days old...
|
||||
- I killed those and told them to try again
|
||||
- After importing the Solr statistics into DSpace 7 I realized that my DSpace Statistics API will work fine
|
||||
- I made some minor modifications to the Ansible infrastructure scripts to make sure it is enabled and then activated it on DSpace 7 Test
|
||||
|
||||
## 2023-07-26
|
||||
|
||||
- Debugging lock issues on CGSpace
|
||||
- I see the blocking PIDs for some long-held locks are "idle in transaction":
|
||||
|
||||
```console
|
||||
$ ps auxw | grep -E "(1864132|1659487)"
|
||||
postgres 1659487 0.0 0.5 3269900 197120 ? Ss Jul25 0:03 postgres: 14/main: cgspace cgspace 127.0.0.1(61648) idle in transaction
|
||||
postgres 1864132 0.1 0.7 3275704 254528 ? Ss 07:27 0:08 postgres: 14/main: cgspace cgspace 127.0.0.1(36998) idle in transaction
|
||||
postgres 1880388 0.0 0.0 9208 2432 pts/3 S+ 08:48 0:00 grep -E (1864132|1659487)
|
||||
```
|
||||
|
||||
- I used some other scripts and found that those processes were executing the following statement:
|
||||
|
||||
```console
|
||||
select nextval ('public.tasklistitem_seq')
|
||||
```
|
||||
|
||||
- I don't know why these can get blocked for hours without resolution, but for now I just killed them
|
||||
- For what it's worth [these sequences were removed in DSpace 7.0](https://github.com/DSpace/DSpace/commit/16ae96b4c3d833c2a4acd1f05985d424c3a52bd7) along with the "traditional" item workflow—maybe that means we won't have such contention issues in DSpace 7!
|
||||
- I wrote a slightly longer regex to match locks that have been stuck for more than 1 hour based on the output of the `locks-age.sql` script and killed them:
|
||||
|
||||
```console
|
||||
$ psql < locks-age.sql | awk -F"|" '/ [[:digit:]][1-9]:[[:digit:]]{2}:[[:digit:]]{2}\./ {print $10}' | sort -u | xargs kill
|
||||
```
|
||||
|
||||
- I filed [an issue for missing Altmetric badges on DSpace 7 Angular](https://github.com/DSpace/dspace-angular/issues/2400)
|
||||
|
||||
## 2023-07-27
|
||||
|
||||
- Export CGSpace to check countries, regions, types, and Initiatives
|
||||
- There were a few minor issues in countries and regions, and I noticed 186 items without types!
|
||||
- Then I ran the file through csv-metadata-quality to make sure items with countries have appropriate regions
|
||||
- Brief discussion about OpenRXV bugs and fixes with Moayad
|
||||
- I was toying with the idea of using an expanded whitespace check/fix based on [ESLint's no-irregular-whitespace](https://eslint.org/docs/latest/rules/no-irregular-whitespace) rule in csv-metadata-quality
|
||||
- I found 176 items in CGSpace with such whitespace in their titles alone
|
||||
- I compared the results of removing these characters and replacing them with a space
|
||||
- In _most_ cases removing it is the correct thing to do, for example "Pesticides : une arme à double tranchant" → "Pesticides: une arme à double tranchant"
|
||||
- But in some items it is tricky, for example "L'environnement juridique est-il propice à la gestion" → "L'environnement juridique est-il propice àla gestion"
|
||||
- I guess it would really need some good heuristics or a human to verify...
|
||||
- I upgraded OpenRXV to Angular v14
|
||||
|
||||
## 2023-07-28
|
||||
|
||||
- After a bit more testing I merged the [Angular v14 changes to OpenRXV master](https://github.com/ilri/OpenRXV/pull/184)
|
||||
- I am getting an error trying to import the 2020 Solr statistics from CGSpace to DSpace 7:
|
||||
|
||||
```console
|
||||
Exception in thread "main" org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/statistics: ERROR: [doc=0008a7c1-e552-4a4e-93e4-4d23bf39964b] Error adding field 'workflowItemId'='0812be47-1bfe-45e2-9208-5bf10ee46f81' msg=For input string: "0812be47-1bfe-45e2-9208-5bf10ee46f81"
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:745)
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:259)
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
|
||||
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:234)
|
||||
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:102)
|
||||
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:69)
|
||||
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:82)
|
||||
at it.damore.solr.importexport.App.insertBatch(App.java:295)
|
||||
at it.damore.solr.importexport.App.lambda$writeAllDocuments$10(App.java:276)
|
||||
at it.damore.solr.importexport.BatchCollector.lambda$accumulator$0(BatchCollector.java:71)
|
||||
at java.base/java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
|
||||
at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
|
||||
at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
|
||||
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
|
||||
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
|
||||
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
|
||||
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
|
||||
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
|
||||
at it.damore.solr.importexport.App.writeAllDocuments(App.java:252)
|
||||
at it.damore.solr.importexport.App.main(App.java:150)
|
||||
```
|
||||
|
||||
- Ahhhh, in DSpace 6 this field was a string in the Solr statistics schema, but in DSpace 7 it is an integer...?
|
||||
- Oh, it seems to be an Atmire change in our DSpace 6... hmmm, so we need to ignore the `workflowItemId` field when exporting
|
||||
- Upstream: https://github.com/DSpace/DSpace/blob/dspace-6_x/dspace/solr/statistics/conf/schema.xml#L328
|
||||
- ILRI: https://github.com/ilri/DSpace/blob/6_x-prod/dspace/solr/statistics/conf/schema.xml#L344
|
||||
- I am wondering if we can skip all these workflow fields since I don't think we are using any aspects of statistics related to workflows
|
||||
- I diffed our Solr statistics schema with the one from vanilla DSpace 6 and got a list of all the fields that were different:
|
||||
|
||||
```
|
||||
isInternal,workflowItemId,containerCommunity,containerCollection,containerItem,containerBitstream,dateYear,dateYearMonth,filterquery,complete_query,simple_query,complete_query_search,simple_query_search,ngram_query_search,ngram_simplequery_search,text,storage_statistics_type,storage_size,storage_nb_of_bitstreams,name,first_name,last_name,p_communities_id,p_communities_name,p_communities_map,p_group_id,p_group_name,p_group_map,group_id,group_name,group_map,parent_count,bitstreamId,bitstreamCount,actingGroupId,actorMemberGroupId,actingGroupParentId,rangeDescription,range,version_id,file_id,cua_version,core_update_run_nb,orphaned
|
||||
```
|
||||
|
||||
- I will combine it with the other fields I was skipping above and try the export again:
|
||||
|
||||
```console
|
||||
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2020.json -f 'time:[2020-01-01T00\:00\:00Z TO 2020-12-31T23\:59\:59Z]' -k uid -S actingGroupId,actingGroupParentId,actorMemberGroupId,author_mtdt,author_mtdt_search,bitstreamCount,bitstreamId,complete_query,complete_query_search,containerBitstream,containerCollection,containerCommunity,containerItem,core_update_run_nb,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,file_id,filterquery,first_name,geoipcountrycode,geoIpCountryCode,group_id,group_map,group_name,ip_ngram,ip_search,isArchived,isInternal,iso_mtdt,iso_mtdt_search,isWithdrawn,last_name,name,ngram_query_search,ngram_simplequery_search,orphaned,parent_count,p_communities_id,p_communities_map,p_communities_name,p_group_id,p_group_map,p_group_name,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,referrer_ngram,referrer_search,simple_query,simple_query_search,solr_update_time_stamp,storage_nb_of_bitstreams,storage_size,storage_statistics_type,subject_mtdt,subject_mtdt_search,text,userAgent_ngram,userAgent_search,version_id,workflowItemId
|
||||
```
|
||||
|
||||
- Export a list of affiliations from the Initiatives community for Peter:
|
||||
|
||||
```console
|
||||
$ dspace metadata-export -i 10568/115087 -f /tmp/2023-07-28-initiatives.csv
|
||||
$ csvcut -c 'cg.contributor.affiliation[en_US]' ~/Downloads/2023-07-28-initiatives.csv \
|
||||
| sed -e 1d -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' -e '/^$/d' \
|
||||
| sort | uniq -c | sort -hr \
|
||||
| awk 'BEGIN { FS = "^[[:space:]]+[[:digit:]]+[[:space:]]+" } {print $2}'\
|
||||
| sed -e '1i cg.contributor.affiliation' -e 's/^\(.*\)$/"\1"/' \
|
||||
> /tmp/2023-07-28-initiatives-affiliations.csv
|
||||
```
|
||||
|
||||
- This is a method I first used in 2023-01 to export affiliations ONLY used in items in the Initiatives community
|
||||
- I did the same for authors and investors
|
||||
|
||||
## 2023-07-29
|
||||
|
||||
- Export CGSpace to look for missing Initiative collection mappings
|
||||
- I found a bunch of locks waiting for many hours and killed them:
|
||||
|
||||
```console
|
||||
$ psql < locks-age.sql | awk -F"|" '$9 ~ / [[:digit:]][1-9]:[[:digit:]]{2}:[[:digit:]]{2}\./ {print $10}' | sort -u | xargs kill
|
||||
```
|
||||
|
||||
- This looks for a pattern matching something like `11:30:48.598436` in the age column (not 00:00:00) and kills them
|
||||
- Start a harvest on AReS
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
266
content/posts/2023-08.md
Normal file
266
content/posts/2023-08.md
Normal file
@@ -0,0 +1,266 @@
|
||||
---
|
||||
title: "August, 2023"
|
||||
date: 2023-08-03T11:18:36+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2023-08-03
|
||||
|
||||
- I finally got around to working on Peter's cleanups for affiliations, authors, and donors from last week
|
||||
- I did some minor cleanups myself and applied them to CGSpace
|
||||
- Start working on some batch uploads for IFPRI
|
||||
|
||||
<!--more-->
|
||||
|
||||
## 2023-08-04
|
||||
|
||||
- Minor cleanups on IFPRI's batch uploads
|
||||
- I also did a duplicate check and found thirteen items that seem to be duplicates, so I sent them to Leigh to check
|
||||
- I read this [interesting blog post about PostgreSQL's `log_statement` function](https://www.endpointdev.com/blog/2012/06/logstatement-postgres-all-full-logging/)
|
||||
- Someone pointed out that this also lets you take advantage of [PgBadger](https://github.com/darold/pgbadger) analysis
|
||||
- I enabled statement logging on DSpace Test and I will check it in a few days
|
||||
- Reading about DSpace 7 REST API again
|
||||
- Here is how to get the first page of 100 items: https://dspace7test.ilri.org/server/api/discover/search/objects?dsoType=item&page=1&size=100
|
||||
- I really want to benchmark this to see how fast we can get all the pages
|
||||
- Another thing I notice is that the bitstreams are not here, so that will be an extra call...
|
||||
|
||||
## 2023-08-05
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- Start a harvest on AReS
|
||||
|
||||
## 2023-08-07
|
||||
|
||||
- I'm checking the PostgreSQL logs now that statement logging has been enabled for a few days on DSpace Test
|
||||
- I see the logs are about 7 or 8 GB, which is larger than expected—and this is the test server!
|
||||
- I will now play with pgbadger to see if it gives any useful insights
|
||||
- Hmm, it sems the `log_statement` advice was old as pgbadger itself says:
|
||||
|
||||
> Do not enable log_statement as its log format will not be parsed by pgBadger.
|
||||
|
||||
... and:
|
||||
|
||||
> Warning: Do not enable both log_min_duration_statement, log_duration and log_statement all together, this will result in wrong counter values. Note that this will also increase drastically the size of your log. log_min_duration_statement should always be preferred.
|
||||
|
||||
- So we need to follow pgbadger's instructions rather to get a suitable log file
|
||||
- After enabling the new settings I see that our log file is going to be reaallllly big... hmmmm will check tomorrow morning
|
||||
- More work on the IFPRI batch uploads
|
||||
|
||||
## 2023-08-08
|
||||
|
||||
- Apply more corrections to authors from Peter on CGSpace
|
||||
- I finally figured out a `log_line_prefix` for PostgreSQL that works for pgBadger:
|
||||
|
||||
```console
|
||||
log_line_prefix = '%t [%p]: user=%u,db=%d,app=%a,client=%h '
|
||||
```
|
||||
|
||||
- Now I can generate reports:
|
||||
|
||||
```console
|
||||
# /usr/bin/pgbadger -I -q /var/log/postgresql/postgresql-14-main.log -O /srv/www/pgbadger
|
||||
```
|
||||
|
||||
- Ideally we would run this incremental report every day on the postgresql-14-main.log.1 aka yesterday's version of the log file after it is rotated
|
||||
- Now I have to see how large the file will be...
|
||||
- I did some final updates to the ninety IFPRI records and uploaded them to DSpace Test first, then to CGSpace
|
||||
|
||||
## 2023-08-11
|
||||
|
||||
- Fix bug with header background on DSpace 7 on mobile
|
||||
|
||||
## 2023-08-12
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- I deployed the latest OpenRXV master branch with Angular v14 and backend updates on the server
|
||||
- Start a harvest on AReS
|
||||
|
||||
## 2023-08-14
|
||||
|
||||
- I ported the DSpace 6.x REST API patch to allow specifying a bundle name when POSTing a bitstream to the legacy REST API in DSpace 7.6
|
||||
|
||||
## 2023-08-16
|
||||
|
||||
- I noticed that the DSpace statistics pages don't seem to work on communities or collections
|
||||
- I finally took time to look in the DSpace log file and found this for one:
|
||||
|
||||
```console
|
||||
2023-08-16 14:30:31,873 WARN dace8f96-f034-488e-b38c-9f2eb5d0e002 6cbd0b18-6852-4294-99a5-02dfcab0a469 org.dspace.app.rest.exception.DSpaceApiExceptionControllerAdvice @ Request is invalid or incorrect (status:400 exception: Invalid UUID string: -1 at: java.base/java.util.UUID.fromString1(UUID.java:280))
|
||||
```
|
||||
|
||||
- I'm surprised to see this because those should have been dealt with when we upgraded to DSpace 6
|
||||
- Looking in the Solr statistics core I see ~1,000,000 documents with the ID `-1`, and about 57,000,000 that don't
|
||||
- Also interesting, faceting by `dateYear` I see:
|
||||
- 2023: 209566
|
||||
- 2022: 403871
|
||||
- 2021: 336548
|
||||
- 2020: 31659
|
||||
- ... none before 2020
|
||||
- They are all type 5, which is "Site" aka the home page, according to `dspace-api/src/main/java/org/dspace/core/Constants.java`
|
||||
- Ah hah, and I can see in my DSpace 7 test Solr there are a bunch of hits with `type: 5` that have "-1" of course, but also newer ones that have an actual UUID
|
||||
- I used the `/server/api/dso/find?uuid=3945ec23-2426-4fce-a2ea-48b38b91547f` endpoint to find out that there is a new `/server/api/core/sites` endpoint listing exactly one site (the home page) with this ID
|
||||
- So for now I can replace all the "-1" documents with this ID on the test server at least, then I will have to remember to do that during the migration of the production instance
|
||||
- I did a new export from DSpace 6 using solr-import-export-json with a query limiting it to documents of type 5 and negative 1 ID:
|
||||
|
||||
```console
|
||||
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-fix-uuid.json -f 'id:\-1 AND type:5 AND time:[2020-01-01T00\:00\:00Z TO 2023-12-31T23\:59\:59Z]' -k uid -S actingGroupId,actingGroupParentId,actorMemberGroupId,author_mtdt,author_mtdt_search,bitstreamCount,bitstreamId,complete_query,complete_query_search,containerBitstream,containerCollection,containerCommunity,containerItem,core_update_run_nb,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,file_id,filterquery,first_name,geoipcountrycode,geoIpCountryCode,group_id,group_map,group_name,ip_ngram,ip_search,isArchived,isInternal,iso_mtdt,iso_mtdt_search,isWithdrawn,last_name,name,ngram_query_search,ngram_simplequery_search,orphaned,parent_count,p_communities_id,p_communities_map,p_communities_name,p_group_id,p_group_map,p_group_name,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,referrer_ngram,referrer_search,simple_query,simple_query_search,solr_update_time_stamp,storage_nb_of_bitstreams,storage_size,storage_statistics_type,subject_mtdt,subject_mtdt_search,text,userAgent_ngram,userAgent_search,version_id,workflowItemId
|
||||
```
|
||||
|
||||
- Then I replaced the IDs with the UUID of the site homepage on DSpace 7 Test:
|
||||
|
||||
```console
|
||||
$ sed -i 's/"id":"-1"/"id":"3945ec23-2426-4fce-a2ea-48b38b91547f"/' /tmp/statistics-fix-uuid.json
|
||||
```
|
||||
|
||||
- I re-imported those records and I no longer see the "-1" IDs, but still get the same error in the log
|
||||
- I don't understand, maybe there is some voodoo, so I rebooted the server
|
||||
- Hmm, no, it's not a voodoo cache issue, so I really need to debug this:
|
||||
|
||||
```console
|
||||
2023-08-16 15:44:07,122 WARN dace8f96-f034-488e-b38c-9f2eb5d0e002 036b88e6-7548-4852-9646-f345ce3bfcc2 org.dspace.app.rest.exception.DSpaceApiExceptionControllerAdvice @ Request is invalid or incorrect (status:400 exception: Invalid UUID string: -1 at: java.base/java.util.UUID.fromString1(UUID.java:280))
|
||||
```
|
||||
|
||||
- On a related note, I figured out that the root site already has a UUID in DSpace 6, and it's exactly the one above (3945ec23-2426-4fce-a2ea-48b38b91547f)
|
||||
- I noticed it while looking at the [DSpace 6 REST API's hierarchy page](https://cgspace.cgiar.org/rest/hierarchy)
|
||||
- So I can update these "-1" IDs with "type:5" in our production I think...
|
||||
|
||||
## 2023-08-17
|
||||
|
||||
- I decided to update the "-1" IDs in Solr on DSpace 6
|
||||
- Unfortunately, in Solr there is no way to update only documents matching a query, so we have to export and re-import
|
||||
- I exported all documents with "type:5" (Homepage) and replaced the ID in the JSON:
|
||||
|
||||
```console
|
||||
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-fix-uuid.json -f 'type:5' -k uid -S actingGroupId,actingGroupParentId,actorMemberGroupId,author_mtdt,author_mtdt_search,bitstreamCount,bitstreamId,complete_query,complete_query_search,containerBitstream,containerCollection,containerCommunity,containerItem,core_update_run_nb,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,file_id,filterquery,first_name,geoipcountrycode,geoIpCountryCode,group_id,group_map,group_name,ip_ngram,ip_search,isArchived,isInternal,iso_mtdt,iso_mtdt_search,isWithdrawn,last_name,name,ngram_query_search,ngram_simplequery_search,orphaned,parent_count,p_communities_id,p_communities_map,p_communities_name,p_group_id,p_group_map,p_group_name,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,referrer_ngram,referrer_search,simple_query,simple_query_search,solr_update_time_stamp,storage_nb_of_bitstreams,storage_size,storage_statistics_type,subject_mtdt,subject_mtdt_search,text,userAgent_ngram,userAgent_search,version_id,workflowItemId
|
||||
$ sed -i 's/"id":"-1"/"id":"3945ec23-2426-4fce-a2ea-48b38b91547f"/' /tmp/statistics-fix-uuid.json
|
||||
```
|
||||
|
||||
- (Oops, skipping the fields above was not necessary, since I'm importing back into DSpace 6 where those fields exist)
|
||||
- Then I re-imported:
|
||||
|
||||
```
|
||||
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-fix-uuid.json -k uid
|
||||
```
|
||||
|
||||
- This worked, but I still see new records coming in that have "id:-1" so I will need to repeat this during the migration.
|
||||
- I also notice many stats records that have erroneous cities:
|
||||
- `"city":"com.maxmind.geoip2.record.City [ {} ]"`
|
||||
- `"city":"com.maxmind.geoip2.record.City [ {\"geoname_id\":1002145,\"names\":{\"de\":\"George\",\"en\":\"George\",\"ru\":\"Джордж\",\"fr\":\"George\",\"ja\":\"ジョージ\"}} ]"`
|
||||
|
||||
## 2023-08-18
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
|
||||
## 2023-08-19
|
||||
|
||||
- Start a harvest on AReS
|
||||
|
||||
## 2023-08-21
|
||||
|
||||
- Experiment with the DSpace 7 REST API
|
||||
- I wrote a Python script to benchmark harvesting all 100,000+ items using the `/api/discover/search/objects` endpoint 100 items at a time
|
||||
- I was able to harvest the entire 106,000 items in fifty-two minutes, which seems slow, but that's about ten times faster than with the legacy REST API...
|
||||
- Still, I need to benchmark a bit more, as the item response doesn't include collection mappings or thumbnails
|
||||
- Reading the [API docs](https://github.com/DSpace/RestContract/blob/main/README.md#etags--conditional-headers) it seems that we should be able to use the standard `If-Modified-Since` header for some endpoints
|
||||
- I tried it on the `/api/discover/search/objects` and `/api/core/items` endpoints, but apparently those don't support this header because I don't see a `Last-Modified` header in the response
|
||||
- According to the docs, it means that these endpoints indeed don't support it...
|
||||
|
||||
## 2023-08-22
|
||||
|
||||
- I was experimenting with the DSpace 7 REST API again
|
||||
- This time looking at the thumbnail responses in item endpoints
|
||||
- According to [the documentation](https://github.com/DSpace/RestContract/blob/main/items.md#main-thumbnail) the API will respond with HTTP 200 if there is a thumbnail, and HTTP 204 if there is no content
|
||||
- That means we need to make the request before we can even find out!
|
||||
- Tim on DSpace Slack pointed out the DSpace 7 REST API's [projections](https://github.com/DSpace/RestContract/blob/main/projections.md)
|
||||
- This means we can embed resources like thumbnail and owningCollection in the item (and other) requests, for example: https://dspace7test.ilri.org/server/api/discover/search/objects?dsoType=item&embed=thumbnail,owningCollection
|
||||
|
||||
## 2023-08-23
|
||||
|
||||
- I benchmarked the DSpace 7 REST API with the new embeds and it took four hours and seventeen minutes to get all 106,000 items on DSpace 7 Test
|
||||
- So this is much slower than the results I saw earlier this week, but maybe slightly faster than DSpace 6?
|
||||
- Maria from Alliance contacted me to say they have agreed to use UN M.49 regions more strictly in TIP, so they want to replace our non-standard "Latin America" region with "Latin America and the Caribbean", "Caribbean" and "Americas" on all Alliance outputs
|
||||
- I exported their community on CGSpace and fixed the metadata in OpenRefine
|
||||
- I tried to run `dspace cleanup -v` on CGSpace, but got this error:
|
||||
|
||||
```
|
||||
Caused by: org.postgresql.util.PSQLException: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
||||
Detail: Key (uuid)=(61bff7da-c8e3-420f-841c-ec5e8238d716) is still referenced from table "bundle".
|
||||
```
|
||||
|
||||
- The solution, as always, is to delete those IDs manually in PostgreSQL:
|
||||
|
||||
```
|
||||
$ psql -d dspace -c "UPDATE bundle SET primary_bitstream_id=NULL WHERE primary_bitstream_id IN ('61bff7da-c8e3-420f-841c-ec5e8238d716');"
|
||||
UPDATE 1
|
||||
```
|
||||
|
||||
- I also tried to delete all users who haven't logged in since 2017 using the groomer script, but it crashes due to those users still having items or workflows or whatever:
|
||||
|
||||
```console
|
||||
$ dspace dsrun org.dspace.eperson.Groomer -a -b 08/23/2017 -d
|
||||
```
|
||||
|
||||
- I see that it is now [possible in DSpace 7 to delete such users](https://github.com/DSpace/DSpace/pull/2229) so we will have to wait
|
||||
|
||||
## 2023-08-24
|
||||
|
||||
- I spent some time trying to get themes to extend in DSpace 7
|
||||
- I finally got a basic ILRI theme working, but there is a bug that causes theme components to get duplicated
|
||||
|
||||
## 2023-08-25
|
||||
|
||||
- Meeting with Altmetric about the next phase of their integration with CGSpace
|
||||
- A bit of cleanup on CGSpace metadata
|
||||
- I fixed DOIs, licenses, dates, subjects, affiliations, titles, publishers, types, and titles in 1,240 items
|
||||
|
||||
## 2023-08-26
|
||||
|
||||
- A few weeks ago we received a request from the Fruits and Vegetables Initiative saying that they've gotten approval to begin using the long name instead of the short one everywhere, apparently for SEO reasons
|
||||
- After communicating with PRMS and other teams working on systems using this metadata I finally updated them in CGSpace
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- Start a harvest on AReS
|
||||
- I fixed ~200 titles with new lines, excessive whitespace, and Unicode FFFD characters
|
||||
- There are many more with 00A0, 200B, etc, but those need more careful inspection
|
||||
|
||||
## 2023-08-28
|
||||
|
||||
- Day one of CGSpace partners meeting in Addis
|
||||
- Oh this is a game changer, I just realized that we can use Solr query syntax in the DSpace 7 REST API, so we can do this for example:
|
||||
|
||||
```
|
||||
https://dspace7test.ilri.org/server/api/discover/search/objects?query=lastModified%3A%5B2023-08-01T00%3A00%3A00Z%20TO%20%2A%5D
|
||||
```
|
||||
|
||||
- Which is this query: `lastModified:[2023-08-01T00:00:00Z TO *]`
|
||||
- The queries need to be URL encoded of course
|
||||
- Oh nice, and we can do the same for accession date:
|
||||
|
||||
```
|
||||
https://dspace7test.ilri.org/server/api/discover/search/objects?query=dc.date.accessioned_dt%3A%5B2023-08-01T00%3A00%3A00Z%20TO%20%2A%5D
|
||||
```
|
||||
|
||||
- That is this query: `dc.date.accessioned_dt:[2023-08-01T00:00:00Z TO *]`
|
||||
- We need to use the dt version of the accession date because that is the one that has a date type
|
||||
- This query give 290 results, which should be the items submitted in August!
|
||||
|
||||
## 2023-08-29
|
||||
|
||||
- Day two of CGSpace partners meeting in Addis
|
||||
|
||||
## 2023-08-30
|
||||
|
||||
- Day three of CGSpace partners meeting in Addis
|
||||
- I did a lot of work on the CGSpace Angular theme for DSpace 7
|
||||
- Many changes to Discovery filters and search results
|
||||
|
||||
## 2023-08-31
|
||||
|
||||
- Day four of CGSpace partners meeting in Addis
|
||||
- I removed the old Bioversity and CIAT subjects from Discovery facets on CGSpace
|
||||
- Maria and Leroy said they are no longer using them so we don't need to keep indexing and displaying them
|
||||
- I did a lot of work on the CGSpace Angular theme for DSpace 7
|
||||
- Now we have clickable keywords that go to Discovery instead of browse, as well as some new icons
|
||||
- We don't need to use the clunky browse links to get clickable links any more so I will disable those
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
243
content/posts/2023-09.md
Normal file
243
content/posts/2023-09.md
Normal file
@@ -0,0 +1,243 @@
|
||||
---
|
||||
title: "September, 2023"
|
||||
date: 2023-09-02T17:29:36+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2023-09-02
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- Start a harvest on AReS
|
||||
|
||||
<!--more-->
|
||||
|
||||
## 2023-09-03
|
||||
|
||||
- I figured out how to use Altmetric and Dimensions badges in the DSpace Angular frontend
|
||||
- It still feels hacky, but using [AfterViewInit](https://stackoverflow.com/questions/41936631/how-to-trigger-the-function-after-dom-markup-is-loaded-in-angular-style-applicat), and importing the Altmetric `embed.js` in the component works
|
||||
- The style on mobile also needs work...
|
||||
|
||||
## 2023-09-06
|
||||
|
||||
- Discussion with Marie about finalizing the output types list on GitHub
|
||||
- I did some review and cleanup in preparation for publishing the new list
|
||||
|
||||
## 2023-09-07
|
||||
|
||||
- Export CGSpace to start doing a review of the metadata
|
||||
- First I will start by extracting all items with DOIs, along with some fields I can compare against Crossref:
|
||||
|
||||
```console
|
||||
$ csvgrep -c 'cg.identifier.doi[en_US]' -r 'doi.org' ~/Downloads/2023-09-07-cgspace.csv \
|
||||
| csvcut -c 'id,dc.title[en_US],dcterms.issued[en_US],dcterms.available[en_US],cg.issn[en_US],cg.isbn[en_US],cg.volume[en_US],cg.issue[en_US],cg.number[en_US],dcterms.extent[en_US],cg.identifier.doi[en_US],cg.reviewStatus[en_US],cg.isijournal[en_US],dcterms.license[en_US],dcterms.accessRights[en_US],dcterms.type[en_US],dc.identifier.uri[en_US]' \
|
||||
> /tmp/2023-09-07-cgspace-dois.csv
|
||||
$ csvgrep -c 'cg.identifier.doi[en_US]' -r 'doi.org' ~/Downloads/2023-09-07-cgspace.csv | csvcut -c 'cg.identifier.doi[en_US]' | sed 1d > /tmp/2023-09-07-cgspace-dois.txt
|
||||
```
|
||||
|
||||
- Then I resolved the DOIs from Crossref:
|
||||
|
||||
```console
|
||||
$ ./ilri/crossref_doi_lookup.py -i /tmp/2023-09-07-cgspace-dois.txt -o /tmp/2023-09-07-cgspace-dois-results.csv -e a.orth@cgiar.org
|
||||
```
|
||||
|
||||
- A user emailed to ask about uploading a 180MB PDF to CGSpace
|
||||
- I used GhostScript to try reducing it using the `screen`, `ebook` and `prepress` presets:
|
||||
|
||||
```console
|
||||
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=primer-screen.pdf Primer\ \(digital\)_Climate-\ smart\ and\ regenerative\ agriculture\ in\ climate\ change\ adaptation.pdf
|
||||
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=primer-ebook.pdf Primer\ \(digital\)_Climate-\ smart\ and\ regenerative\ agriculture\ in\ climate\ change\ adaptation.pdf
|
||||
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/prepress -dNOPAUSE -dQUIET -dBATCH -sOutputFile=primer-prepress.pdf Primer\ \(digital\)_Climate-\ smart\ and\ regenerative\ agriculture\ in\ climate\ change\ adaptation.pdf
|
||||
```
|
||||
|
||||
- The `prepress` one is 300DPI and looks visually identical to the original, so I proposed that we use that one
|
||||
|
||||
## 2023-09-08
|
||||
|
||||
- I did a review of the metadata for our items with DOIs, comparing with data from Crossref
|
||||
- I spot checked a handful of issue / online dates and licenses, and saw that Crossref's dates are always more accurate than ours when they differ
|
||||
- I also filled in some missing volumes, issues, ISSNs, and extents
|
||||
- This results in 14,000 changes to existing items, which will take several days to import unfortunately
|
||||
- After eight hours the first file is only about 2/3 finished... sigh
|
||||
- Meet with Peter to discuss changes to the DSpace 7 test
|
||||
- Minor updates to submission forms and some new ideas for the home page and item page
|
||||
- I figured out how to use a themed home page component and add a cards UI to our CGSpace theme
|
||||
|
||||
## 2023-09-09
|
||||
|
||||
- I can't believe that almost 18 hours later the first CSV import with 5,000 changes is not done...
|
||||
- Run all system updates on CGSpace and reboot it, as it had been two months since the last time
|
||||
|
||||
## 2023-09-10
|
||||
|
||||
- Minor work on the DSpace 7 home page
|
||||
|
||||
## 2023-09-11
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- Start a harvest on AReS
|
||||
|
||||
## 2023-09-12
|
||||
|
||||
- Minor work on DSpace 7 home page
|
||||
- Minor work on CG Core types
|
||||
- I published a new HTML version of the updated IPtypes and archived the current version as v2.0.0 so we can still reference it
|
||||
|
||||
## 2023-09-13
|
||||
|
||||
- Stefano reminded me about the updated OAI MODS mappings on CGSpace so I re-applied them on DSpace Test and updated the OAI index so he could confirm
|
||||
- Now I'm ready to put it on CGSpace if he confirms
|
||||
- I created a basic theme for CIP on DSpace 7
|
||||
- While doing that I noticed that a bunch of CIP bitstreams didn't have the latest 500px thumbnails so I re-ran filter-media on a handful of their collections
|
||||
- I had two occurrences of an OOM kill of the Tomcat 9 java process on DSpace 7 test tonight
|
||||
- Once while doing a Discovery index, the other while doing filter media
|
||||
|
||||
## 2023-09-15
|
||||
|
||||
- Discuss issues with the Altmetric API with the Altmetric support team
|
||||
- Apparently we can use a different API, the [Explorer API](https://www.altmetric.com/explorer/documentation/api), since we already have access to the Explorer dashboard
|
||||
- I reduced the Solr heap size on DSpace 7 from 3GB to 2GB
|
||||
- Apparentlty I already did this from 4GB to 3GB a few months ago
|
||||
- The Solr admin interface was showing Solr taking ~1GB of RAM so I think this should be safe
|
||||
- Mark on DSpace Slack said he uses PM2's `--max-memory-restart` so the processes restart when they hit the limit
|
||||
- Also, he said he had to reduce `cache:serverSide:botCache:max` from 1000 to 500 to cache less SSR pages in memory
|
||||
- I decided to try deploying DSpace 7 Test on a Hetzner server with 64GB RAM, 6 CPUs, and 2x512GB NVMe SSD
|
||||
|
||||
## 2023-09-16
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- Start a harvest on AReS
|
||||
- Configure the privacy policy page on DSpace 7 using a themed component with the text from our DSpace 6 site
|
||||
- I realized that for all my custom Angular components I should be using `routerLink` instead of `href` when I am constructing links
|
||||
- The `routerLink` routes within the single page application and saves state, while the `href` reloads the page
|
||||
- Using the `routerLink` way is faster and results in less flashing and jumping in the page when navigating
|
||||
- See: https://stackoverflow.com/a/61588147
|
||||
|
||||
## 2023-09-17
|
||||
|
||||
- I added an About page to DSpace 7 Test using similar logic to the privacy page
|
||||
|
||||
## 2023-09-18
|
||||
|
||||
- I filed a GitHub issue for being unable to navigate dropdown lists using the keyboard on the dspace-angular submission form: https://github.com/DSpace/dspace-angular/issues/2500
|
||||
- I filed a GitHub issue for the search filters capitalizing metadata values: https://github.com/DSpace/dspace-angular/issues/2501
|
||||
|
||||
## 2023-09-19
|
||||
|
||||
- Complete migration of DSpace 7 Test from Linode to Hetzner
|
||||
- Export some years of Solr stats from CGSpace to import on the new DSpace 7 Test:
|
||||
|
||||
```console
|
||||
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2020-2022.json -f 'time:[2020-01-01T00\:00\:00Z TO 2022-12-31T23\:59\:59Z]' -k uid -S actingGroupId,actingGroupParentId,actorMemberGroupId,author_mtdt,author_mtdt_search,bitstreamCount,bitstreamId,complete_query,complete_query_search,containerBitstream,containerCollection,containerCommunity,containerItem,core_update_run_nb,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,file_id,filterquery,first_name,geoipcountrycode,geoIpCountryCode,group_id,group_map,group_name,ip_ngram,ip_search,isArchived,isInternal,iso_mtdt,iso_mtdt_search,isWithdrawn,last_name,name,ngram_query_search,ngram_simplequery_search,orphaned,parent_count,p_communities_id,p_communities_map,p_communities_name,p_group_id,p_group_map,p_group_name,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,referrer_ngram,referrer_search,simple_query,simple_query_search,solr_update_time_stamp,storage_nb_of_bitstreams,storage_size,storage_statistics_type,subject_mtdt,subject_mtdt_search,text,userAgent_ngram,userAgent_search,version_id,workflowItemId
|
||||
```
|
||||
|
||||
- Ben sent me an export of ILRI presentations from Slideshare and asked if we could see if any are missing on CGSpace
|
||||
- First I exported CGSpace and extracted the `cg.identifier.url` column so I could normalize all Slideshare URLs to use "https://www.slideshare.net" instead of localized variants (es.slideshare.net, fr.slideshare.net, etc) as well as non-https links and links with query params and slashes at the end
|
||||
- This was about 250 URLs
|
||||
- I extracted the URL field from both our list and the Slideshare list and then used [GNU `join` to print non-matched lines](https://unix.stackexchange.com/questions/274548/join-two-files-each-with-two-columns-including-non-matching-lines):
|
||||
|
||||
```console
|
||||
$ join -t, -v 2 -11 -21 -o auto /tmp/cgspace-ilri-slideshare-sorted-only-urls-sorted.csv /tmp/ilri-slideshare-sorted-sorted.csv | wc -l
|
||||
542
|
||||
```
|
||||
|
||||
- Important to note that you must use GNU `sort` on the fiels first, as I had tried sorting in vim and it didn't satisfy `join`
|
||||
- So it seems there are 542 Slideshare presentations we are missing
|
||||
|
||||
## 2023-09-20
|
||||
|
||||
- Regarding the incorrect city in Solr statistics, I see we have 1,600,000 of them
|
||||
- Before filing a GitHub issue, I want to check if they maybe come from an Atmire module, as I see them clustered around two particular CUA versions:
|
||||
|
||||
```json
|
||||
{
|
||||
"responseHeader": {
|
||||
"status": 0,
|
||||
"QTime": 2760,
|
||||
"params": {
|
||||
"q": "city:com.maxmind.geoip2.record.City*",
|
||||
"facet.field": "cua_version",
|
||||
"indent": "true",
|
||||
"rows": "0",
|
||||
"wt": "json",
|
||||
"facet": "true",
|
||||
"_": "1695192301927"
|
||||
}
|
||||
},
|
||||
"response": {
|
||||
"numFound": 1661863,
|
||||
"start": 0,
|
||||
"docs": []
|
||||
},
|
||||
"facet_counts": {
|
||||
"facet_queries": {},
|
||||
"facet_fields": {
|
||||
"cua_version": [
|
||||
"6.x-4.1.10-ilri-RC7",
|
||||
1112186,
|
||||
"6.x-4.1.10-ilri-RC5",
|
||||
451180,
|
||||
"6.x-4.1.10-ilri-RC9",
|
||||
0
|
||||
]
|
||||
},
|
||||
"facet_dates": {},
|
||||
"facet_ranges": {},
|
||||
"facet_intervals": {}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- I migrated AReS from Linode to Hetzner
|
||||
- I asked on Slack and someone told me that we need to edit `src/app/menu.resolver.ts` to add new drop down menus to the top navbar
|
||||
- It works, though is unfortunate that we can't do it in a theme
|
||||
|
||||
## 2023-09-21
|
||||
|
||||
- More minor work on DSpace 7 home page and menus
|
||||
- Meeting to discuss types and DSpace 7 migration plans
|
||||
- Create a DSpace 7 theme for IITA
|
||||
|
||||
## 2023-09-22
|
||||
|
||||
- Create a DSpace 7 theme for IWMI
|
||||
- I had some issues with pm2 on the new DSpace 7 Test
|
||||
- It seems to be due to mixing systemd starting versus manually starting / stopping...
|
||||
- After reading the discussion in [this pm2 issue](https://github.com/Unitech/pm2/issues/2914) I realize that we probably need to use `--no-daemon` to have systemd fully manage the processes without pm2 trying to save state
|
||||
|
||||
## 2023-09-23
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- Start a harvest on AReS
|
||||
|
||||
## 2023-09-25
|
||||
|
||||
- CGSpace metadata and community / collection cleanup
|
||||
- Review some patches on DSpace Angular
|
||||
- Create a basic Alliance theme for DSpace 7
|
||||
|
||||
## 2023-09-27
|
||||
|
||||
- I realized that we can get controlled vocabularies from DSpace 7's REST API, for both value-pairs and hierarchical controlled vocabularies, ie:
|
||||
|
||||
https://dspace7test.ilri.org/server/api/submission/vocabularies/common_iso_languages/entries
|
||||
|
||||
## 2023-09-29
|
||||
|
||||
- Meeting with Aditi and others to discuss plan for using CGSpace to do a systematic review of CGIAR research on climate change
|
||||
- I cleaned up metadata for a hundred or so items, and realized we will need to do more to make sure abstracts and open access status are correct since there will be a laser focus on the metadata
|
||||
|
||||
## 2023-09-30
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- Still working on checking Unpaywall for access rights and licenses for our DOIs
|
||||
- Regarding Unpaywall's "evidence" metadata about whether an item is open access or not, after looking at dozens of items manually:
|
||||
- evidence: "oa journal (via doaj)" <---- yes
|
||||
- evidence: "open (via free article)" <---- hmmm, not always correct
|
||||
- evidence: "open (via page says license)" <--- noooo, can't rely on that
|
||||
- evidence: "open (via page says Open Access)" <---- yes...?
|
||||
- evidence: "open (via free pdf)" <---- hmmm, not always correct
|
||||
- evidence: "oa journal (via publisher name)" <---- noooo
|
||||
- I updated access status for about four hundred more items based on this, and licenses for a dozen or so
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
150
content/posts/2023-10.md
Normal file
150
content/posts/2023-10.md
Normal file
@@ -0,0 +1,150 @@
|
||||
---
|
||||
title: "October, 2023"
|
||||
date: 2023-10-02T09:05:36+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2023-10-02
|
||||
|
||||
- Export CGSpace to check DOIs against Crossref
|
||||
- I found that [Crossref's metadata is in the public domain under the CC0 license](https://www.crossref.org/documentation/retrieve-metadata/rest-api/rest-api-metadata-license-information/)
|
||||
- One interesting thing is the abstracts, which are copyrighted by the copyright owner, meaning Crossref cannot waive the copyright under the terms of the CC0 license, because it is not theirs to waive
|
||||
- We can be on the safe side by using only abstracts for items that are licensed under Creative Commons
|
||||
|
||||
<!--more-->
|
||||
|
||||
- This GREL extracts the _text_ content of the `<jats:p>` tags (ie, no other JATS XML markup tags like `<jats:i>`, `<jats:sub>`, etc):
|
||||
|
||||
```console
|
||||
forEach(value.parseXml().select("jats|p"),i,i.xmlText()).join("")
|
||||
```
|
||||
|
||||
- Note that we need to use `select("jats|p")` instead of `select("jats:p")` for OpenRefine's parseXml, and we need to `join()` on the end
|
||||
- I updated metadata for about 3,000 items using Crossref metadata
|
||||
- I stripped trailing periods for titles where they were missing on the Crossref titles
|
||||
- I copied abstracts for about 600 items that were missing them, for items that were Creative Commons
|
||||
- I updated publishers for a few thousand more where ours and Crossref disagreed, checking a handful manually first
|
||||
- I also added subjects to the `crossref_doi_lookup.py` script to see if they will be useful for us
|
||||
- When checking with csv-metadata-quality I can validate those subjects against AGROVOC and add them if they are valid
|
||||
|
||||
## 2023-10-03
|
||||
|
||||
- I added the item type to the collection subscription email on DSpace 6
|
||||
- It's done differently on DSpace 7 so I'll have to see how to do it there...
|
||||
- Test a patch that fixes a bug with item versioning disabled in DSpace 7
|
||||
- I hadn't realized that DSpace 7 defaulted to versioning being enabled, whereas we never used this in DSpace 6 (yet)
|
||||
- Submit [an issue regarding duplicate Discovery sort fields](https://github.com/DSpace/DSpace/issues/9104) in DSpace 7
|
||||
|
||||
## 2023-10-05
|
||||
|
||||
- Some discussion this week about issue and online dates for journal articles, with regards to PRMS
|
||||
- I looked more closely at the [Crossref API docs](https://github.com/CrossRef/rest-api-doc/blob/master/api_format.md) and realized (again) that their "issue" date is not the same as our issue date—they take the earlier of the print and online dates!
|
||||
- Also, *very many* items have no print date at all, perhaps due to delays, errors, or simply because the journal is "online only"!
|
||||
- I suggested again that PRMS should consider both, and take the earlier of the two, then make sure whether the date is in the current reporting period
|
||||
- I managed to find 80 items with print publishing dates from 2023 and updated those from Crossref, but for the rest we will have to think about how we handle them
|
||||
|
||||
## 2023-10-06
|
||||
|
||||
- More discussion about dates after looking closely at them yesterday and today
|
||||
- Crossref doesn't always have both issued and online dates—sometimes they have one, sometimes the other, and sometimes both, so we cannot rely on them 100% for that.
|
||||
- In some cases, the item is available online for months (or even a year!), but has not been included in an issue yet, and thus has no "issue" date, for example:
|
||||
- https://doi.org/10.1002/csc2.20914 <--- published online January 2023!
|
||||
- https://doi.org/10.1111/mcn.13401 <--- published online July 2022!
|
||||
- Even journals make mistakes: this journal article was "issued" in 2022, but online in 2023! This is not Crossref's fault, but the journal's!
|
||||
- https://doi.org/10.1186/s40066-022-00400-6
|
||||
- I found a bunch more strange cases regarding dates and recommended to PRMS team that they use the earlier of the issued and online dates
|
||||
- Meet with Aditi to start discussing the scope of knowledge products we can get for the CGIAR climate change synthesis
|
||||
|
||||
## 2023-10-07
|
||||
|
||||
- I spent a few hours (!) debugging an issue in Python when downloading PDFs
|
||||
- I think it ended up being due to `requests_cache`!!! Grrrr
|
||||
- On a positive note I've greatly refactored my script for discovering and downloading PDFs from Unpaywall
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- Start a harvest on AReS
|
||||
|
||||
## 2023-10-08
|
||||
|
||||
- Starting to see some stuck locks on CGSpace this morning
|
||||
- I will give notice and restart CGSpace
|
||||
- Work on Python script to harvest DSpace REST API and save to CSV
|
||||
|
||||
## 2023-10-11
|
||||
|
||||
- File an issue on the DSpace issue tracker regarding the MaxMind JSON objects in our Solr statistics: https://github.com/DSpace/DSpace/issues/9118
|
||||
|
||||
## 2023-10-12
|
||||
|
||||
- Discuss MODS issues in CGSpace's OAI-PMH with Stefano and Valentina
|
||||
- AGRIS can currently only support MODS 3.7 so they need us to roll our 3.8 work from 2023-06 back down, which requires some minor changes to the crosswalk
|
||||
|
||||
## 2023-10-13
|
||||
|
||||
- I did some more minor work to get the MODS 3.7 changes ready for AGRIS on DSpace Test
|
||||
|
||||
## 2023-10-14
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- Start a harvest on AReS
|
||||
- I deployed the AGRIS changes for OAI-PMH on CGSpace
|
||||
|
||||
## 2023-10-16
|
||||
|
||||
- Fix some typos in ILRI subjects on CGSpace
|
||||
- These were affecting the taxonomy on ilri.org
|
||||
- I exported CGSpace and did some validation and cleanup on ILRI subjects, moving some to AGROVOC subjects
|
||||
- Port the MODS 3.7 crosswalk from DSpace 6 to DSpace 7
|
||||
- It works fine, we only need to take note that the OAI-PMH endpoint is now relative to the `/server` path instead of a dedicated OAI path
|
||||
|
||||
## 2023-10-17
|
||||
|
||||
- Export CGSpace to do some cleanups all over on invalid metadata values
|
||||
- I found many metadata values in the wrong field, wrong format, etc
|
||||
- This ended up being cleanups for 694 items
|
||||
|
||||
## 2023-10-20
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- I also did a run of looking up all Initiative outputs with DOIs against Crossref to check for missing dates, publishers, etc
|
||||
- I found issued dates for a few, and online dates for over 100
|
||||
- I also fixed some incorrect licenses, access status, and abstracts
|
||||
|
||||
## 2023-10-23
|
||||
|
||||
- Export a list of Internal Documents for Peter to review to see if we can re-classify some
|
||||
- Peter sent changes for 740 items so I applied them on CGSpace
|
||||
- Testing the changes for OpenRXV DSpace 7 compatibility
|
||||
|
||||
## 2023-10-24
|
||||
|
||||
- Sync DSpace 7 Test with a fresh CGSpace snapshot
|
||||
- Meeting with FARA to discuss DSpace training and support
|
||||
- Meeting with IFPRI about migrating to CGSpace
|
||||
|
||||
## 2023-10-25
|
||||
|
||||
- Maria was asking about an error deleting an item in the Alliance community
|
||||
- The error was "Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:..."
|
||||
- According to my notes this error happened a few times in the past and is some kind of corner case regarding permissions
|
||||
- I deleted the item for her
|
||||
- I deleted a handful of old CRP groups on CGSpace
|
||||
|
||||
## 2023-10-27
|
||||
|
||||
- Peter sent me a list of journal articles from Altmetric that have an ILRI affiliation, but no Handle
|
||||
- I used my `crossref_doi_lookup.py` script to fetch the metadata for them using their DOIs, then did a bunch of cleanup in OpenRefine
|
||||
- Test some LDAP patches for DSpace 7
|
||||
|
||||
## 2023-10-30
|
||||
|
||||
- Some work on metadata for Aditi's review
|
||||
- I found more preprints grrrr
|
||||
|
||||
## 2023-10-31
|
||||
|
||||
- Peter got back to me with the cleanups on ILRI journal articles from Altmetric that we didn't have on CGSpace
|
||||
- I did another duplicate check and found four more duplicates that had been uploaded yesterday
|
||||
- Then I did a quick sanity check and uploaded the remaining 19 items to CGSpace
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
215
content/posts/2023-11.md
Normal file
215
content/posts/2023-11.md
Normal file
@@ -0,0 +1,215 @@
|
||||
---
|
||||
title: "November, 2023"
|
||||
date: 2023-11-02T12:59:36+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2023-11-01
|
||||
|
||||
- Work a bit on the ETL pipeline for the CGIAR Climate Change Synthesis
|
||||
- I improved the filtering and wrote some Python using pandas to merge my sources more reliably
|
||||
|
||||
## 2023-11-02
|
||||
|
||||
- Export CGSpace to check missing Initiative collection mappings
|
||||
- Start a harvest on AReS
|
||||
|
||||
<!--more-->
|
||||
|
||||
- IFPRI contacted us about importing their Slideshare presentations to CGSpace
|
||||
- There are ~1,700 of them and date back to as early as 2008
|
||||
- I did a quick cleanup of the metadata export from Slideshare (including tagging with some AGROVOC in OpenRefine) and uploaded to DSpace Test
|
||||
|
||||
## 2023-11-03
|
||||
|
||||
- A little bit of work on the CGIAR Climate Change Synthesis
|
||||
- Discuss some CGSpace migration plans with Leigh from IFPRI
|
||||
- For their Slideshare content we agreed:
|
||||
- Exclude private
|
||||
- Exclude deleted
|
||||
- Exclude non presentation types
|
||||
- Exclude duplicates within the collection for now until we can sort them out
|
||||
- That leaves about 1,500 items out of the 1,700
|
||||
- I did a duplicate check against CGSpace and found 44 items with 1.0 similarity so I removed those
|
||||
|
||||
## 2023-11-04
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- I ran through the list of potential duplicates on the IFPRI Slideshare presentations
|
||||
|
||||
## 2023-11-05
|
||||
|
||||
- Work with Salem to migrate AReS to the new version
|
||||
|
||||
## 2023-11-07
|
||||
|
||||
- DSpace 7 Test went down and there is very high load on the server
|
||||
- I saw very high load from Java but didn't have time to check exactly what was wrong so I just rebooted the host
|
||||
- A few hours after restarting the system went down again, with very high load from Java again
|
||||
- I see lots of messages like this in the Tomcat log:
|
||||
|
||||
```
|
||||
tomcat9[732]: [9955.662s][info ][gc] GC(6291) Pause Full (G1 Compaction Pause) 4085M->4080M(4096M) 677.251ms
|
||||
tomcat9[732]: [9955.662s][info ][gc] GC(6290) Concurrent Mark Cycle 677.558ms
|
||||
tomcat9[732]: [9955.666s][info ][gc] GC(6292) To-space exhausted
|
||||
```
|
||||
|
||||
- I see some messages in `dspace.log` about heap space:
|
||||
|
||||
```
|
||||
Caused by: java.lang.OutOfMemoryError: Java heap space
|
||||
```
|
||||
|
||||
- I will increase Tomcat's heap from 4096m to 5120m
|
||||
- A few hours later it happened again, so I increased the heap from 5120m to 6144m
|
||||
- Not sure what's going on today...
|
||||
- I tested moving the CGIAR Fund Council community to the CGIAR historic archive on DSpace Test:
|
||||
|
||||
```console
|
||||
$ dspace community-filiator -r -p 10568/83389 -c 10947/2516
|
||||
$ dspace community-filiator -s -p 10947/2515 -c 10947/2516
|
||||
$ dspace index-discovery -r 10947/2516
|
||||
$ dspace index-discovery -r 10947/2515
|
||||
$ dspace index-discovery -r 10568/83389
|
||||
$ dspace index-discovery
|
||||
```
|
||||
|
||||
- I think this is the minimal we can do to avoid a full Discovery reindex which is very expensive
|
||||
- I helped Maria resize some massive PDFs for upload to CGSpace using GhostScript prepress mode as I had done before in [September, 2023]({{< relref "2023-09.md" >}}),
|
||||
|
||||
## 2023-11-08
|
||||
|
||||
- DSpace 7 Test has very high load again and I see more Java heap space errors in the log
|
||||
|
||||
```console
|
||||
# grep -c 'Caused by: java.lang.OutOfMemoryError: Java heap space' /home/dspace7/log/dspace.log-2023-11-07
|
||||
35
|
||||
# grep -c 'Caused by: java.lang.OutOfMemoryError: Java heap space' /home/dspace7/log/dspace.log
|
||||
7
|
||||
```
|
||||
|
||||
- I don't know what is happening... I will increase the heap size from 6144m to 7168m again...
|
||||
- I did some work on the value mappings in AReS
|
||||
- I wanted to test the import/export feature, and found that I could get a JSON and convert it to CSV for manipulation in OpenRefine
|
||||
- Importing duplicates records, so I deleted and re-created the index in Elasticsearch first
|
||||
- Then I started a new harvest on AReS to make sure the mappings are applied
|
||||
|
||||
## 2023-11-09
|
||||
|
||||
- Ryan asked me for help uploading a large PDF to CGSpace
|
||||
- I tried my usual GhostScript preprint invocation and found the size decrease significantly, but some minor artifacts appeared in the images
|
||||
- Interestingly, the [GhostScript docs](https://ghostscript.com/docs/9.54.0/VectorDevices.htm) mention that `prepress` doesn't give the best results:
|
||||
|
||||
> Please be aware that the /prepress setting does not indicate the highest quality conversion. Using any of these presets will involve altering the input, and as such may result in a PDF of poorer quality (compared to the input) than simply using the defaults. The 'best' quality (where best means closest to the original input) is obtained by not setting this parameter at all (or by using /default).
|
||||
|
||||
- Also, I found [a question on StackOverflow discussing some further techniques for PDFs with images](https://stackoverflow.com/questions/40849325/ghostscript-pdfwrite-specify-jpeg-quality):
|
||||
|
||||
```console
|
||||
$ gs -sOutputFile=137166-default-dct.pdf -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH -dPDFSETTINGS=/default -c "<< /ColorACSImageDict << /VSamples [ 1 1 1 1 ] /HSamples [ 1 1 1 1 ] /QFactor 0.08 /Blend 1 >> /ColorImageDownsampleType /Bicubic /ColorConversionStrategy /LeaveColorUnchanged >> setdistillerparams" -f 137166.pdf
|
||||
```
|
||||
|
||||
- This looks much better, and is still much smaller than the original
|
||||
- Also, I used `pdfimages` to extract all the images from the original and the one above and found:
|
||||
|
||||
```console
|
||||
$ du -sh images-*
|
||||
886M images-default-dct
|
||||
1012M images-original
|
||||
```
|
||||
|
||||
- And from [WeCompress's analysis](https://www.wecompress.com/en/analyze) I see that the images are 85% of the size of the PDF
|
||||
|
||||
## 2023-11-10
|
||||
|
||||
- I finished checking the IFPRI Slideshare records and added some tagging of countries, regions, and CRPs and then uploaded them to CGSpace
|
||||
|
||||
## 2023-11-11
|
||||
|
||||
- Salem fixed a bug on OpenRXV that was splitting country values by "," before matching them with ISO countries
|
||||
- I exported CGSpace to check for missing Initiative collection mappings
|
||||
- Start a fresh harvest on AReS
|
||||
|
||||
## 2023-11-16
|
||||
|
||||
- Discuss mapping ICARDA outputs from Initiatives to ICARDA collections on CGSpace
|
||||
- I added MEL's CGSpace user to the administrator group of a handful of collections
|
||||
- I also did a batch mapping of 274 existing Initiative outputs from ICARDA to the relevant collections
|
||||
|
||||
## 2023-11-18
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- Start a harvest on AReS
|
||||
|
||||
## 2023-11-22
|
||||
|
||||
- I was checking out the [DSpace 7 statistics](https://github.com/DSpace/RestContract/blob/main/statistics-reports.md) again and found that we have total visits and total downloads for each DSpace object, for example [this item](https://dspace7test.ilri.org/items/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748):
|
||||
- TotalVisits: https://dspace7test.ilri.org/server/api/statistics/usagereports/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748_TotalVisits
|
||||
- TotalDownloads: https://dspace7test.ilri.org/server/api/statistics/usagereports/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748_TotalDownloads
|
||||
- And the numbers match those in my dspace-statisitcs-api *exactly*!
|
||||
- This can be useful to get an individual DSpace object's stats, but there is no way to iterate over all objects like all items...
|
||||
- We can look at using this to draw stats on the community, collection, and item pages
|
||||
|
||||
## 2023-11-23
|
||||
|
||||
- Brian King was asking me how many PDFs we had in CGSpace so I got a rough estimate using this SQL query:
|
||||
|
||||
```console
|
||||
localhost/dspace7= ☘ SELECT COUNT(uuid) FROM bitstream WHERE bitstream_format_id=(SELECT bitstream_format_id FROM bitstreamformatregistry WHERE mimetype='application/pdf');
|
||||
count
|
||||
───────
|
||||
47818
|
||||
(1 row)
|
||||
```
|
||||
|
||||
- It's been some time since I looked at our Solr statistics to find new bots
|
||||
- I found a few new ones that I [submitted to COUNTER-Robots](https://github.com/atmire/COUNTER-Robots/pull/60) and added to our local bot list:
|
||||
- GuzzleHttp/7
|
||||
- Owler@ows.eu/1
|
||||
- newspaperjs
|
||||
- I ran my old `check-spider-hits.sh` script with a list of bots from our local overrides to purge hits from Solr:
|
||||
|
||||
```console
|
||||
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
|
||||
Purging 30 hits from ubermetrics in statistics
|
||||
Purging 59 hits from curb in statistics
|
||||
Purging 36 hits from bitdiscovery in statistics
|
||||
Purging 87 hits from omgili in statistics
|
||||
Purging 47 hits from Vizzit in statistics
|
||||
Purging 109 hits from Java\/17-ea in statistics
|
||||
Purging 40 hits from AdobeUxTechC4-Async in statistics
|
||||
Purging 21 hits from ZaloPC-win32-24v473 in statistics
|
||||
Purging 21 hits from nbertaupete95 in statistics
|
||||
Purging 52 hits from Scoop\.it in statistics
|
||||
Purging 16 hits from WebAPIClient in statistics
|
||||
Purging 241 hits from RStudio in statistics
|
||||
Purging 1255 hits from ^MEL in statistics
|
||||
Purging 47850 hits from GuzzleHttp in statistics
|
||||
Purging 8714 hits from Owler in statistics
|
||||
Purging 1083 hits from newspaperjs in statistics
|
||||
Purging 369 hits from ^Chrome$ in statistics
|
||||
Purging 1474 hits from curl in statistics
|
||||
|
||||
Total number of bot hits purged: 61504
|
||||
```
|
||||
|
||||
- I also noticed 35,000 requests over the past few years from lowercase user agents, which is [definitely weird](https://developers.whatismybrowser.com/api/features/user-agent-checks/weird/#all_lower_case), for example:
|
||||
- `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/89.0.4389.90 safari/537.36`
|
||||
- `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/90.0.4430.93 safari/537.36`
|
||||
- I'm gonna add those to our overrides and purge them:
|
||||
|
||||
```console
|
||||
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
|
||||
Purging 35816 hits from ^mozilla in statistics
|
||||
|
||||
Total number of bot hits purged: 35816
|
||||
```
|
||||
|
||||
## 2023-11-30
|
||||
|
||||
- Minor updates to our OAI MODS crosswalk
|
||||
- Stefano found a minor markup issue with our alternative titles (`<titleInfo>` tag)
|
||||
- Very high load on CGSpace since after lunch
|
||||
- I killed some locks that had been stuck for a few hours
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
271
content/posts/2023-12.md
Normal file
271
content/posts/2023-12.md
Normal file
@@ -0,0 +1,271 @@
|
||||
---
|
||||
title: "December, 2023"
|
||||
date: 2023-12-01T08:48:36+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2023-12-01
|
||||
|
||||
- There is still high load on CGSpace and I don't know why
|
||||
- I don't see a high number of sessions compared to previous days in the last few weeks
|
||||
|
||||
<!-- more -->
|
||||
|
||||
```console
|
||||
$ for file in dspace.log.2023-11-[23]*; do echo "$file"; grep -a -oE 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
|
||||
dspace.log.2023-11-20
|
||||
22865
|
||||
dspace.log.2023-11-21
|
||||
20296
|
||||
dspace.log.2023-11-22
|
||||
19688
|
||||
dspace.log.2023-11-23
|
||||
17906
|
||||
dspace.log.2023-11-24
|
||||
18453
|
||||
dspace.log.2023-11-25
|
||||
17513
|
||||
dspace.log.2023-11-26
|
||||
19037
|
||||
dspace.log.2023-11-27
|
||||
21103
|
||||
dspace.log.2023-11-28
|
||||
23023
|
||||
dspace.log.2023-11-29
|
||||
23545
|
||||
dspace.log.2023-11-30
|
||||
21298
|
||||
```
|
||||
|
||||
- Even the number of unique IPs is not very high compared to the last week or so:
|
||||
|
||||
```console
|
||||
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort | uniq | wc -l
|
||||
17023
|
||||
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.2.gz | sort | uniq | wc -l
|
||||
17294
|
||||
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.3.gz | sort | uniq | wc -l
|
||||
22057
|
||||
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.4.gz | sort | uniq | wc -l
|
||||
32956
|
||||
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.5.gz | sort | uniq | wc -l
|
||||
11415
|
||||
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.6.gz | sort | uniq | wc -l
|
||||
15444
|
||||
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.7.gz | sort | uniq | wc -l
|
||||
12648
|
||||
```
|
||||
|
||||
- It doesn't make any sense so I think I'm going to restart the server...
|
||||
- After restarting the server the load went down to normal levels... who knows...
|
||||
- I started trying to see how I'm going to generate the fake statistics for the Alliance bitstream that was replaced
|
||||
- I exported all the statistics for the owningItem now:
|
||||
|
||||
```console
|
||||
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/stats-export.json -f 'owningItem:b5862bfa-9799-4167-b1cf-76f0f4ea1e18' -k uid
|
||||
```
|
||||
|
||||
- Importing them into DSpace Test didn't show the statistics in the Atmire module, but I see them in Solr...
|
||||
|
||||
## 2023-12-02
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- Start a harvest on AReS
|
||||
|
||||
## 2023-12-04
|
||||
|
||||
- Send a message to Altmetric support because the item IWMI highlighted last month still doesn't show the attention score for the Handle after I tweeted it several times weeks ago
|
||||
- Spent some time writing a Python script to fix the literal MaxMind City JSON objects in our Solr statistics
|
||||
- There are about 1.6 million of these, so I exported them using solr-import-export-json with the query `city:com*` but ended up finding many that have missing bundles, container bitstreams, etc:
|
||||
|
||||
```
|
||||
city:com* AND -bundleName:[* TO *] AND -containerBitstream:[* TO *] AND -file_id:[* TO *] AND -owningItem:[* TO *] AND -version_id:[* TO *]
|
||||
```
|
||||
|
||||
- (Note the negation to find fields that are missing)
|
||||
- I don't know what I want to do with these yet
|
||||
|
||||
## 2023-12-05
|
||||
|
||||
- I finished the `fix_maxmind_stats.py` script and fixed 1.6 million records and imported them on CGSpace after testing on DSpace 7 Test
|
||||
- Altmetric said there was a glitch regarding the Handle and DOI linking and they successfully re-scraped the item page and linked them
|
||||
- They sent me a list of current production IPs and I notice that some of them are in our nginx bot network list:
|
||||
|
||||
```console
|
||||
$ for network in $(csvcut -c network /tmp/ips.csv | sed 1d | sort -u); do grepcidr $network ~/src/git/rmg-ansible-public/roles/dspace/files/nginx/bot-networks.conf; done
|
||||
108.128.0.0/13 'bot';
|
||||
46.137.0.0/16 'bot';
|
||||
52.208.0.0/13 'bot';
|
||||
52.48.0.0/13 'bot';
|
||||
54.194.0.0/15 'bot';
|
||||
54.216.0.0/14 'bot';
|
||||
54.220.0.0/15 'bot';
|
||||
54.228.0.0/15 'bot';
|
||||
63.32.242.35/32 'bot';
|
||||
63.32.0.0/14 'bot';
|
||||
99.80.0.0/15 'bot'
|
||||
```
|
||||
|
||||
- I will remove those for now so that Altmetric doesn't have any unexpected issues harvesting
|
||||
|
||||
## 2023-12-08
|
||||
|
||||
- Finalized the script to generate Solr statistics for Alliance research Mirjam
|
||||
- The script is `ilri/generate_solr_statistics.py`
|
||||
- I generated ~3,200 statistics based on her records of the download statistics of [that item](https://hdl.handle.net/10568/131997) and imported them on CGSpace
|
||||
- Did some work on the DSpace 7 submission form
|
||||
- Peter asked for lists of affiliations, investors, and publishers to do some cleanups
|
||||
- I generated a list from a CSV export instead of doing it based on a SQL dump...
|
||||
|
||||
```console
|
||||
$ csvcut -c 'cg.contributor.affiliation[en_US]' /tmp/initiatives.csv \
|
||||
| sed -e 1d -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' -e '/^$/d' \
|
||||
| sort | uniq -c | sort -hr \
|
||||
| awk 'BEGIN { FS = "^[[:space:]]+[[:digit:]]+[[:space:]]+" } {print $2}'\
|
||||
| sed -e '1i cg.contributor.affiliation' -e 's/^\(.*\)$/"\1"/' \
|
||||
> /tmp/2023-12-08-initiatives-affiliations.csv
|
||||
```
|
||||
|
||||
- Export a list of authors as well:
|
||||
|
||||
```console
|
||||
localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "dc.contributor.author", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 3 GROUP BY "dc.contributor.author" ORDER BY count DESC) to /tmp/2023-12-08-authors.csv WITH CSV HEADER;
|
||||
COPY 102435
|
||||
```
|
||||
|
||||
## 2023-12-11
|
||||
|
||||
- Work on OpenRXV dependencies and podman a bit
|
||||
- Peter noticed that the statistics for this month are very very low on CGSpace
|
||||
- I don't know what is going on, perhaps it is related to me adjusting the nginx config last week?
|
||||
- Ah, it's probably because of the spider patterns I updated on 2023-11
|
||||
|
||||
## 2023-12-16
|
||||
|
||||
- Export CGSpace to check for missing Initiative collection mappings
|
||||
- Start a harvest on AReS
|
||||
|
||||
## 2023-12-17
|
||||
|
||||
- Pull latest master branch for OpenRXV and deploy on the server
|
||||
- I threw away some changes in the tree regarding the Angular base ref, and it broke AReS
|
||||
- So note to self: we need to set the base ref in `frontend/Dockerfile` before building!
|
||||
- Now Salem fixed the country map
|
||||
|
||||
## 2023-12-18
|
||||
|
||||
- Work a bit on the IFPRI-ISNAR archive from Leigh
|
||||
- More work on the DSpace 7 home page
|
||||
|
||||
## 2023-12-19
|
||||
|
||||
- More work on the DSpace 7 home page
|
||||
- The Alliance TIP team is testing deposits to the DSpace 7 REST API and getting an HTTP 500 error
|
||||
- In the DSpace logs I see this after they log in, create the item, and update the metadata:
|
||||
|
||||
```
|
||||
2023-12-19 17:49:28,022 ERROR unknown unknown org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
|
||||
```
|
||||
|
||||
- I found some messages on the dspace-tech mailing list suggesting this might be an old bug: https://groups.google.com/g/dspace-tech/c/My1GUFYFGoU/m/tS7-WAJPAwAJ
|
||||
- I restarted Tomcat and told the Alliance TIP team to try again
|
||||
|
||||
## 2023-12-20
|
||||
|
||||
- The Alliance guys said that submitting via REST works now... sigh, so that's just some old DSpace 5/6 REST API bug
|
||||
- I lowercased all our AGROVOC keywords in `dcterms.subject` in SQL:
|
||||
|
||||
```console
|
||||
dspace=# BEGIN;
|
||||
BEGIN
|
||||
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
|
||||
UPDATE 462
|
||||
dspace=*# COMMIT;
|
||||
COMMIT
|
||||
```
|
||||
|
||||
## 2023-12-25
|
||||
|
||||
- Looking into [Solr backups](https://solr.apache.org/guide/8_11/making-and-restoring-backups.html)
|
||||
- Since we are not running in Solr Cloud mode we need to use the replication endpoint for Solr standalone
|
||||
- This works:
|
||||
|
||||
```console
|
||||
$ curl 'http://localhost:8983/solr/statistics/replication?command=backup'
|
||||
{
|
||||
"responseHeader":{
|
||||
"status":0,
|
||||
"QTime":26},
|
||||
"status":"OK"}
|
||||
```
|
||||
|
||||
- Then I saw the size of the snapshot reach the size of the index...
|
||||
|
||||
```console
|
||||
# du -sh /var/solr/data/configsets/statistics/data/*
|
||||
22G /var/solr/data/configsets/statistics/data/index
|
||||
16G /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
|
||||
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
|
||||
# du -sh /var/solr/data/configsets/statistics/data/*
|
||||
22G /var/solr/data/configsets/statistics/data/index
|
||||
20G /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
|
||||
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
|
||||
# du -sh /var/solr/data/configsets/statistics/data/*
|
||||
22G /var/solr/data/configsets/statistics/data/index
|
||||
21G /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
|
||||
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
|
||||
# du -sh /var/solr/data/configsets/statistics/data/*
|
||||
22G /var/solr/data/configsets/statistics/data/index
|
||||
22G /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
|
||||
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
|
||||
```
|
||||
|
||||
- Then I deleted the core and restored from the snapshot backup:
|
||||
|
||||
```console
|
||||
$ curl http://localhost:8983/solr/statistics/update -H "Content-type: text/xml" --data-binary '<delete><query>*:*</query></delete>'
|
||||
$ curl http://localhost:8983/solr/statistics/update -H "Content-type: text/xml" --data-binary '<commit />'
|
||||
$ curl 'http://localhost:8983/solr/statistics/replication?command=restore&name=statistics'
|
||||
```
|
||||
|
||||
- Interestingly the import worked fine, but created a new data index:
|
||||
|
||||
```console
|
||||
# du -sh /var/solr/data/configsets/statistics/data/*
|
||||
4.0K /var/solr/data/configsets/statistics/data/index.properties
|
||||
22G /var/solr/data/configsets/statistics/data/restore.20231225154626463
|
||||
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
|
||||
22G /var/solr/data/configsets/statistics/data/snapshot.statistics
|
||||
```
|
||||
|
||||
- Not sure the implications of that—Solr uses the data just fine
|
||||
- I can surely use this for atomic Solr backups
|
||||
|
||||
## 2023-12-27
|
||||
|
||||
- Delete duplicate metadata as described in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
|
||||
- Do some other metadata cleanups on CGSpace
|
||||
- I also looked up our DOIs on Crossref to get some missing abstracts and correct licenses and dates
|
||||
- Some minor work on the CGSpace DSpace 7 theme to fix the navbar on mobile
|
||||
- Some work on the IFPRI ISNAR archive
|
||||
|
||||
## 2023-12-28
|
||||
|
||||
- I started porting the [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) to DSpace 7
|
||||
- Some work on the IFPRI ISNAR archive
|
||||
- I ended up going through most of the PDFs to get better dates and abstracts
|
||||
|
||||
## 2023-12-29
|
||||
|
||||
- I created a new Hetzner server to replace the current DSpace 6 CGSpace next week when we migrate to DSpace 7
|
||||
- Interesting, I haven't checked for content pointing to legacy domains in several years (!)
|
||||
- `inurl:mahider.cgiar.org`: 0 results on Google!
|
||||
- `inurl:mahider.ilri.org`: 2,100 results on Google
|
||||
- `inurl:mahider.ilri.org inurl:https`: 2 results on Google (!)
|
||||
- `inurl:dspace.ilri.org:` 1,390 results on Google
|
||||
- `inurl:dspace.ilri.org inurl:https`: 0 results on Google (!)
|
||||
- So it seems I can do away with the HTTPS virtual hosts finally
|
||||
- Well my current certificates expired on 2021-02-13 and nobody noticed... so...
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
430
content/posts/2024-01.md
Normal file
430
content/posts/2024-01.md
Normal file
@@ -0,0 +1,430 @@
|
||||
---
|
||||
title: "January, 2024"
|
||||
date: 2024-01-02T10:08:00+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2024-01-02
|
||||
|
||||
- Work on preparation of new server for DSpace 7 migration
|
||||
- I'm not quite sure what we need to do for the Handle server
|
||||
- For now I just ran the `dspace make-handle-config` script and diffed it with the one from DSpace 6
|
||||
- I sent the bundle to the Handle admins to make sure it's OK before we do the migration
|
||||
- Continue testing and debugging the cgspace-java-helpers on DSpace 7
|
||||
- Work on IFPRI ISNAR archive cleanup
|
||||
|
||||
<!--more-->
|
||||
|
||||
## 2024-01-03
|
||||
|
||||
- I haven't heard from the Handle admins so I'm preparing a backup solution using nginx streams
|
||||
- This seems to work in my simple tests (this must be outside the `http {}` block):
|
||||
|
||||
```
|
||||
stream {
|
||||
upstream handle_tcp_9000 {
|
||||
server 188.34.177.10:9000;
|
||||
}
|
||||
|
||||
server {
|
||||
listen 9000;
|
||||
proxy_connect_timeout 1s;
|
||||
proxy_timeout 3s;
|
||||
proxy_pass handle_tcp_9000;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- Here I forwarded a test TCP port 9000 from one server to another and was able to retrieve a test HTML that was running on the target
|
||||
- I will have to do TCP and UDP on port 2641, and TCP/HTTP on port 8000.
|
||||
- I did some more minor work on the IFPRI ISNAR archive
|
||||
- I got some PDFs from the UMN AgEcon search and fixed some metadata
|
||||
- Then I did some duplicate checking and found five items already on CGSpace
|
||||
|
||||
## 2024-01-04
|
||||
|
||||
- Upload 692 items for the ISNAR archive to CGSpace: https://cgspace.cgiar.org/handle/10568/136192
|
||||
- Help Peter proof and upload 252 items from the 2023 Gender conference to CGSpace
|
||||
- Meeting with IFPRI to discuss their migration to CGSpace
|
||||
- We agreed to add two new fields, one for IFPRI project and one for IFPRI publication ranking
|
||||
- Most likely we will use `cg.identifier.project` as a general field and consolidate other project fields there
|
||||
- Not sure which field to use for the publication rank...
|
||||
|
||||
## 2024-01-05
|
||||
|
||||
- Proof and upload 51 items in bulk for IFPRI
|
||||
- I did a big cleanup of user groups in anticipation of complaints about slow workflow tasks etc in DSpace 7
|
||||
- I removed ILRI editors from all the dozens of CCAFS community and collection groups, and I should do the same for other CRPs since they are closed for two years now
|
||||
|
||||
## 2024-01-06
|
||||
|
||||
- Migrate CGSpace to DSpace 7
|
||||
|
||||
## 2024-01-07
|
||||
|
||||
- High load on the server and UptimeRobot saying the frontend is flapping
|
||||
- I noticed tons of logs from pm2 in the systemd journal, so I disabled those in the systemd unit because they are available from pm2's log directory anyway
|
||||
- I also noticed the same for Solr, so I disabled stdout for that systemd unit as well
|
||||
- I spent a lot of time bringing back the nginx rate limits we used in DSpace 6 and it seems to have helped
|
||||
- I see some client doing weird HEAD requests to search pages:
|
||||
|
||||
```
|
||||
47.76.35.19 - - [07/Jan/2024:00:00:02 +0100] "HEAD /search/?f.accessRights=Open+Access%2Cequals&f.actionArea=Resilient+Agrifood+Systems%2Cequals&f.author=Burkart%2C+Stefan%2Cequals&f.country=Kenya%2Cequals&f.impactArea=Climate+adaptation+and+mitigation%2Cequals&f.itemtype=Brief%2Cequals&f.publisher=CGIAR+System+Organization%2Cequals&f.region=Asia%2Cequals&f.sdg=SDG+12+-+Responsible+consumption+and+production%2Cequals&f.sponsorship=CGIAR+Trust+Fund%2Cequals&f.subject=environmental+factors%2Cequals&spc.page=1 HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.2504.63 Safari/537.36"
|
||||
```
|
||||
|
||||
- I will add their network blocks (AS45102) and regenerate my list of bot networks:
|
||||
|
||||
```console
|
||||
$ wget https://asn.ipinfo.app/api/text/list/AS16276 \
|
||||
https://asn.ipinfo.app/api/text/list/AS23576 \
|
||||
https://asn.ipinfo.app/api/text/list/AS24940 \
|
||||
https://asn.ipinfo.app/api/text/list/AS13238 \
|
||||
https://asn.ipinfo.app/api/text/list/AS14061 \
|
||||
https://asn.ipinfo.app/api/text/list/AS12876 \
|
||||
https://asn.ipinfo.app/api/text/list/AS55286 \
|
||||
https://asn.ipinfo.app/api/text/list/AS203020 \
|
||||
https://asn.ipinfo.app/api/text/list/AS204287 \
|
||||
https://asn.ipinfo.app/api/text/list/AS50245 \
|
||||
https://asn.ipinfo.app/api/text/list/AS6939 \
|
||||
https://asn.ipinfo.app/api/text/list/AS45102 \
|
||||
https://asn.ipinfo.app/api/text/list/AS21859
|
||||
$ cat AS* | sort | uniq | wc -l
|
||||
4897
|
||||
$ cat AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
|
||||
$ wc -l /tmp/networks.txt
|
||||
2017 /tmp/networks.txt
|
||||
```
|
||||
|
||||
- I'm surprised to see the number of networks reduced from my current ones... hmmm.
|
||||
- I will also update my list of Bing networks:
|
||||
|
||||
```console
|
||||
$ ./ilri/bing-networks-to-ips.sh
|
||||
$ ~/go/bin/mapcidr -a < /tmp/bing-ips.txt > /tmp/bing-networks.txt
|
||||
$ wc -l /tmp/bing-networks.txt
|
||||
250 /tmp/bing-networks.txt
|
||||
```
|
||||
|
||||
## 2024-01-08
|
||||
|
||||
- Export list of publishers for Peter to select some amount to use as a controlled vocabulary:
|
||||
|
||||
```console
|
||||
localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "dcterms.publisher", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 178 GROUP BY "dcterms.publisher" ORDER BY count DESC) to /tmp/2024-01-publishers.csv WITH CSV HEADER;
|
||||
COPY 4332
|
||||
```
|
||||
|
||||
- Address some feedback on DSpace 7 from users, including fileing some issues on GitHub
|
||||
- https://github.com/DSpace/dspace-angular/issues/2730: List of available metadata fields is truncated when adding new metadata in "Edit Item"
|
||||
- The Alliance TIP team was having issues posting to one collection via the legacy DSpace 6 REST API
|
||||
- In the DSpace logs I see the same issue that they had last month:
|
||||
|
||||
```
|
||||
ERROR unknown unknown org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
|
||||
```
|
||||
|
||||
## 2024-01-09
|
||||
|
||||
- I restarted Tomcat to see if it helps the REST issue
|
||||
- After talking with Peter about publishers we decided to get a clean list of the top ~100 publishers and then make sure all CGIAR centers, Initiatives, and Impact Platforms are there as well
|
||||
- I exported a list from PostgreSQL and then filtered by count > 40 in OpenRefine and then extracted the metadata values:
|
||||
|
||||
```
|
||||
$ csvcut -c dcterms.publisher ~/Downloads/2024-01-09-publishers4.csv | sed -e 1d -e 's/"//g' > /tmp/top-publishers.txt
|
||||
```
|
||||
|
||||
- Export a list of ORCID identifiers from PostgreSQL to look them up on ORCID and update our controlled vocabulary:
|
||||
|
||||
```console
|
||||
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=247) to /tmp/2024-01-09-orcid-identifiers.txt;
|
||||
localhost/dspace7= ☘ \q
|
||||
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/2024-01-09-orcid-identifiers.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2024-01-09-orcids.txt
|
||||
$ ./ilri/resolve_orcids.py -i /tmp/2024-01-09-orcids.txt -o /tmp/2024-01-09-orcids-names.txt -d
|
||||
```
|
||||
|
||||
- Then I updated existing ORCID identifiers in CGSpace:
|
||||
|
||||
```
|
||||
$ ./ilri/update_orcids.py -i /tmp/2024-01-09-orcids-names.txt -db dspace -u dspace -p bahhhh
|
||||
```
|
||||
|
||||
- Bizu seems to be having issues due to belonging to too many groups
|
||||
- I see some messages from Solr in the DSpace log:
|
||||
|
||||
```
|
||||
2024-01-09 06:23:35,893 ERROR unknown unknown org.dspace.authorize.AuthorizeServiceImpl @ Failed getting getting community/collection admin status for bahhhhh@cgiar.org The search error is: Error from server at http://localhost:8983/solr/search: org.apache.solr.search.SyntaxError: Cannot parse 'search.resourcetype:Community AND (admin:eef481147-daf3-4fd2-bb8d-e18af8131d8c OR admin:g80199ef9-bcd6-4961-9512-501dea076607 OR admin:g4ac29263-cf0c-48d0-8be7-7f09317d50ec OR admin:g0e594148-a0f6-4f00-970d-6b7812f89540 OR admin:g0265b87a-2183-4357-a971-7a5b0c7add3a OR admin:g371ae807-f014-4305-b4ec-f2a8f6f0dcfa OR admin:gdc5cb27c-4a5a-45c2-b656-a399fded70de OR admin:ge36d0ece-7a52-4925-afeb-6641d6a348cc OR admin:g15dc1173-7ddf-43cf-a89a-77a7f81c4cfc OR admin:gc3a599d3-c758-46cd-9855-c98f6ab58ae4 OR admin:g3d648c3e-58c3-4342-b500-07cba10ba52d OR admin:g82bf5168-65c1-4627-8eb4-724fa0ea51a7 OR admin:ge751e973-697d-419c-b59b-5a5644702874 OR admin:g44dd0a80-c1e6-4274-9be4-9f342d74928c OR admin:g4842f9c2-73ed-476a-a81a-7167d8aa7946 OR admin:g5f279b3f-c2ce-4c75-b151-1de52c1a540e OR admin:ga6df8adc-2e1d-40f2-8f1e-f77796d0eecd OR admin:gfdfc1621-382e-437a-8674-c9007627565c OR admin:g15cd114a-0b89-442b-a1b4-1febb6959571 OR admin:g12aede99-d018-4c00-b4d4-a732541d0017 OR admin:gc59529d7-002a-4216-b2e1-d909afd2d4a9 OR admin:gd0806714-bc13-460d-bedd-121bdd5436a4 OR admin:gce70739a-8820-4d56-b19c-f191855479e4 OR admin:g7d3409eb-81e3-4156-afb1-7f02de22065f OR admin:g54bc009e-2954-4dad-8c30-be6a09dc5093 OR admin:gc5e1d6b7-4603-40d7-852f-6654c159dec9 OR admin:g0046214d-c85b-4f12-a5e6-2f57a2c3abb0 OR admin:g4c7b4fd0-938f-40e9-ab3e-447c317296c1 OR admin:gcfae9b69-d8dd-4cf3-9a4e-d6e31ff68731 OR ... admin:g20f366c0-96c0-4416-ad0b-46884010925f)': too many boolean clauses The search resourceType filter was: search.resourcetype:Community
|
||||
```
|
||||
|
||||
- There are 1,805 OR clauses in the full log!
|
||||
- We previous had this issue in 2020-01 and 2020-02 with DSpace 5 and DSpace 6
|
||||
- At the time the solution was to increase the `maxBooleanClauses` in Solr and to disable access rights awareness, but I don't think we want to do the second one now
|
||||
- I saw many users of Solr in other applications increasing this to obscenely high numbers, so I think we should be OK to increase it from 1024 to 2048
|
||||
- Re-visiting the DSpace user groomer to delete inactive users
|
||||
- In 2023-08 I noticed that this was now [possible in DSpace 7](https://github.com/DSpace/DSpace/pull/2928)
|
||||
- As a test I tried to delete all users who have been inactive since six years ago (Janury 9, 2018):
|
||||
|
||||
```console
|
||||
$ dspace dsrun org.dspace.eperson.Groomer -a -b 01/09/2018 -d
|
||||
```
|
||||
|
||||
- I tested it on DSpace 7 Test and it worked... I am debating running it on CGSpace...
|
||||
- I see we have almost 9,000 users:
|
||||
|
||||
```console
|
||||
$ dspace user -L > /tmp/users-before.txt
|
||||
$ wc -l /tmp/users-before.txt
|
||||
8943 /tmp/users-before.txt
|
||||
```
|
||||
|
||||
- I decided to do the same on CGSpace and it worked without errors
|
||||
- I finished working on the controlled vocabulary for publishers
|
||||
|
||||
## 2024-01-10
|
||||
|
||||
- I spent some time deleting old groups on CGSpace
|
||||
- I looked into the use of the `cg.identifier.ciatproject` field and found there are only a handful of uses, with some even seeming to be a mistake:
|
||||
|
||||
```console
|
||||
localhost/dspace7= ☘ SELECT DISTINCT text_value AS "cg.identifier.ciatproject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata
|
||||
_field_id = 232 GROUP BY "cg.identifier.ciatproject" ORDER BY count DESC;
|
||||
cg.identifier.ciatproject │ count
|
||||
───────────────────────────┼───────
|
||||
D145 │ 4
|
||||
LAM_LivestockPlus │ 2
|
||||
A215 │ 1
|
||||
A217 │ 1
|
||||
A220 │ 1
|
||||
A223 │ 1
|
||||
A224 │ 1
|
||||
A227 │ 1
|
||||
A229 │ 1
|
||||
A230 │ 1
|
||||
CLIMATE CHANGE MITIGATION │ 1
|
||||
LIVESTOCK │ 1
|
||||
(12 rows)
|
||||
|
||||
Time: 240.041 ms
|
||||
```
|
||||
|
||||
- I think we can move those to a new `cg.identifier.project` if we create one
|
||||
- The `cg.identifier.cpwfproject` field is similarly sparse, but the CCAFS ones are widely used
|
||||
|
||||
## 2024-01-12
|
||||
|
||||
- Export a list of affiliations to do some cleanup:
|
||||
|
||||
```console
|
||||
localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 211 GROUP BY "cg.contributor.affiliation" ORDER BY count DESC) to /tmp/2024-01-affiliations.csv WITH CSV HEADER;
|
||||
COPY 11719
|
||||
```
|
||||
|
||||
- I first did some clustering and editing in OpenRefine, then I'll import those back into CGSpace and then do another export
|
||||
- Troubleshooting the statistics pages that aren't working on DSpace 7
|
||||
- On a hunch, I queried for for Solr statistics documents that **did not have an `id` matching the 36-character UUID pattern**:
|
||||
|
||||
```console
|
||||
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&rows=0'
|
||||
{
|
||||
"responseHeader":{
|
||||
"status":0,
|
||||
"QTime":0,
|
||||
"params":{
|
||||
"q":"-id:/.{36}/",
|
||||
"rows":"0"}},
|
||||
"response":{"numFound":800167,"start":0,"numFoundExact":true,"docs":[]
|
||||
}}
|
||||
```
|
||||
|
||||
- They seem to come mostly from 2020, 2023, and 2024:
|
||||
|
||||
```console
|
||||
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&facet.range=time&facet=true&facet.range.start=2010-01-01T00:00:00Z&facet.range.end=NOW&facet.range.gap=%2B1YEAR&rows=0'
|
||||
{
|
||||
"responseHeader":{
|
||||
"status":0,
|
||||
"QTime":13,
|
||||
"params":{
|
||||
"facet.range":"time",
|
||||
"q":"-id:/.{36}/",
|
||||
"facet.range.gap":"+1YEAR",
|
||||
"rows":"0",
|
||||
"facet":"true",
|
||||
"facet.range.start":"2010-01-01T00:00:00Z",
|
||||
"facet.range.end":"NOW"}},
|
||||
"response":{"numFound":800168,"start":0,"numFoundExact":true,"docs":[]
|
||||
},
|
||||
"facet_counts":{
|
||||
"facet_queries":{},
|
||||
"facet_fields":{},
|
||||
"facet_ranges":{
|
||||
"time":{
|
||||
"counts":[
|
||||
"2010-01-01T00:00:00Z",0,
|
||||
"2011-01-01T00:00:00Z",0,
|
||||
"2012-01-01T00:00:00Z",0,
|
||||
"2013-01-01T00:00:00Z",0,
|
||||
"2014-01-01T00:00:00Z",0,
|
||||
"2015-01-01T00:00:00Z",89,
|
||||
"2016-01-01T00:00:00Z",11,
|
||||
"2017-01-01T00:00:00Z",0,
|
||||
"2018-01-01T00:00:00Z",0,
|
||||
"2019-01-01T00:00:00Z",0,
|
||||
"2020-01-01T00:00:00Z",1339,
|
||||
"2021-01-01T00:00:00Z",0,
|
||||
"2022-01-01T00:00:00Z",0,
|
||||
"2023-01-01T00:00:00Z",653736,
|
||||
"2024-01-01T00:00:00Z",144993],
|
||||
"gap":"+1YEAR",
|
||||
"start":"2010-01-01T00:00:00Z",
|
||||
"end":"2025-01-01T00:00:00Z"}},
|
||||
"facet_intervals":{},
|
||||
"facet_heatmaps":{}}}
|
||||
```
|
||||
|
||||
- They seem to come from 2023-08 until now (so way before we migrated to DSpace 7):
|
||||
|
||||
```console
|
||||
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&facet.range=time&facet=true&facet.range.start=2023-01-01T00:00:00Z&facet.range.end=NOW&facet.range.gap=%2B1MONTH&rows=0'
|
||||
{
|
||||
"responseHeader":{
|
||||
"status":0,
|
||||
"QTime":196,
|
||||
"params":{
|
||||
"facet.range":"time",
|
||||
"q":"-id:/.{36}/",
|
||||
"facet.range.gap":"+1MONTH",
|
||||
"rows":"0",
|
||||
"facet":"true",
|
||||
"facet.range.start":"2023-01-01T00:00:00Z",
|
||||
"facet.range.end":"NOW"}},
|
||||
"response":{"numFound":800168,"start":0,"numFoundExact":true,"docs":[]
|
||||
},
|
||||
"facet_counts":{
|
||||
"facet_queries":{},
|
||||
"facet_fields":{},
|
||||
"facet_ranges":{
|
||||
"time":{
|
||||
"counts":[
|
||||
"2023-01-01T00:00:00Z",1,
|
||||
"2023-02-01T00:00:00Z",0,
|
||||
"2023-03-01T00:00:00Z",0,
|
||||
"2023-04-01T00:00:00Z",0,
|
||||
"2023-05-01T00:00:00Z",0,
|
||||
"2023-06-01T00:00:00Z",0,
|
||||
"2023-07-01T00:00:00Z",0,
|
||||
"2023-08-01T00:00:00Z",27621,
|
||||
"2023-09-01T00:00:00Z",59165,
|
||||
"2023-10-01T00:00:00Z",115338,
|
||||
"2023-11-01T00:00:00Z",96147,
|
||||
"2023-12-01T00:00:00Z",355464,
|
||||
"2024-01-01T00:00:00Z",125429],
|
||||
"gap":"+1MONTH",
|
||||
"start":"2023-01-01T00:00:00Z",
|
||||
"end":"2024-02-01T00:00:00Z"}},
|
||||
"facet_intervals":{},
|
||||
"facet_heatmaps":{}}}
|
||||
```
|
||||
|
||||
- I see that we had 31,744 statistic events yesterday, and 799 have no `id`!
|
||||
- I asked about this on Slack and will file an issue on GitHub if someone else also finds such records
|
||||
- Several people said they have them, so it's a bug of some sort in DSpace, not our configuration
|
||||
|
||||
## 2024-01-13
|
||||
|
||||
- Yesterday alone we had 37,000 unique IPs making requests to nginx
|
||||
- I looked up the ASNs and found 6,000 IPs from this network in Amazon Singapore: 47.128.0.0/14
|
||||
|
||||
## 2024-01-15
|
||||
|
||||
- Investigating the CSS selector warning that I've seen in PM2 logs:
|
||||
|
||||
```console
|
||||
0|dspace-ui | 1 rules skipped due to selector errors:
|
||||
0|dspace-ui | .custom-file-input:lang(en)~.custom-file-label -> unmatched pseudo-class :lang
|
||||
```
|
||||
|
||||
- It seems to be a bug in Angular, as this selector comes from Bootstrap 4.6.x and is not invalid
|
||||
- But that led me to a more interesting issue with `inlineCritical` optimization for styles in Angular SSR that might be responsible for causing high load in the frontend
|
||||
- See: https://github.com/angular/angular/issues/42098
|
||||
- See: https://github.com/angular/universal/issues/2106
|
||||
- See: https://github.com/GoogleChromeLabs/critters/issues/78
|
||||
- Since the production site was flapping a lot I decided to try disabling inlineCriticalCss
|
||||
- There have been on and off load issues with the Angular frontend today
|
||||
- I think I will just block all data center network blocks for now
|
||||
- In the last week I see almost 200,000 unique IPs:
|
||||
|
||||
```console
|
||||
# zcat -f /var/log/nginx/*access.log /var/log/nginx/*access.log.1 /var/log/nginx/*access.log.2.gz /var/log/nginx/*access.log.3.gz /var/log/nginx/*access.log.4.gz /var/log/nginx/*access.log.5.gz /var/log/nginx/*access.log.6.gz | awk '{print $1}' | sort -u |
|
||||
tee /tmp/ips.txt | wc -l
|
||||
196493
|
||||
```
|
||||
|
||||
- Looking these IPs up I see there are 18,000 coming from Comcast, 10,000 from AT&T, 4110 from Charter, 3500 from Cox and dozens of other residential IPs
|
||||
- I highly doubt these are home users browsing CGSpace... seems super fishy
|
||||
- Also, over 1,000 IPs from SpaceX Starlink in the last week. RIGHT
|
||||
- I will temporarily add a few new datacenter ISP network blocks to our rate limit:
|
||||
- 16509 Amazon-02
|
||||
- 701 UUNET
|
||||
- 8075 Microsoft
|
||||
- 15169 Google
|
||||
- 14618 Amazon-AES
|
||||
- 396982 Google Cloud
|
||||
- The load on the server *immediately* dropped
|
||||
|
||||
## 2024-01-17
|
||||
|
||||
- It turns out AS701 (UUNET) is Verizon Business, which is used as an ISP for many staff at IFPRI
|
||||
- This was causing them to see HTTP 429 "too many requests" errors on CGSpace
|
||||
- I removed this ASN from the rate limiting
|
||||
|
||||
## 2024-01-18
|
||||
|
||||
- Start looking at Solr stats again
|
||||
- I found one statistics record that has 22,000 of the same collection in `owningColl` and 22,000 of the same community in `owningComm`
|
||||
- The record is from 2015 and think it would be easier to delete it than fix it:
|
||||
|
||||
```console
|
||||
$ curl http://localhost:8983/solr/statistics/update -H "Content-type: text/xml" --data-binary '<delete><query>uid:3b4eefba-a302-4172-a286-dcb25d70129e</query></delete>'
|
||||
```
|
||||
|
||||
- Looking again, there are at least 1,000 of these so I will need to come up with an actual solution to fix these
|
||||
- I'm noticing we have 1,800+ links to defunct resources on bioversityinternational.org in the `cg.link.permalink` field
|
||||
- I should ask Alliance if they have any plans to fix those, or upload them to CGSpace
|
||||
|
||||
## 2024-01-22
|
||||
|
||||
- Meeting with IWMI about ORCID integration on CGSpace now that we've migrated to DSpace 7
|
||||
- File an issue for the inaccurate DSpace statistics: https://github.com/DSpace/DSpace/issues/9275
|
||||
|
||||
## 2024-01-23
|
||||
|
||||
- Meeting with IWMI about ORCID integration and the DSpace API for use with WordPress
|
||||
- IFPRI sent me an list of their author ORCIDs to add to our controlled vocabulary
|
||||
- I joined them with our current list and resolved their names on ORCID and updated them in our database:
|
||||
|
||||
```console
|
||||
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml ~/Downloads/IFPRI\ ORCiD\ All.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2024-01-23-orcids.txt
|
||||
$ ./ilri/resolve_orcids.py -i /tmp/2024-01-23-orcids.txt -o /tmp/2024-01-23-orcids-names.txt -d
|
||||
$ ./ilri/update_orcids.py -i /tmp/2024-01-23-orcids-names.txt -db dspace -u dspace -p fuuu
|
||||
```
|
||||
|
||||
- This adds about 400 new identifiers to the controlled vocabulary
|
||||
- I consolidated our various project identifier fields for closed programs into one `cg.identifer.project`:
|
||||
- `cg.identifier.ccafsproject`
|
||||
- `cg.identifier.ccafsprojectpii`
|
||||
- `cg.identifier.ciatproject`
|
||||
- `cg.identifier.cpwfproject`
|
||||
- I prefixed the existing 2,644 metadata values with "CCAFS", "CIAT", or "CPWF" so we can figure out where they came from if need be, and deleted the old fields from the metadata registry
|
||||
|
||||
## 2024-01-26
|
||||
|
||||
- Minor work on dspace-angular to clean up component styles
|
||||
- Add `cg.identifier.publicationRank` to CGSpace metadata registry and submission form
|
||||
|
||||
## 2024-01-29
|
||||
|
||||
- Rework the nginx bot and network limits slightly to remove some old patterns/networks and remove Google
|
||||
- The Google Scholar team contacted me to ask why their requests were timing out (well...)
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
118
content/posts/2024-02.md
Normal file
118
content/posts/2024-02.md
Normal file
@@ -0,0 +1,118 @@
|
||||
---
|
||||
title: "February, 2024"
|
||||
date: 2024-02-05T11:10:00+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2024-02-05
|
||||
|
||||
- Delete duplicate metadata as described in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
|
||||
- Lower case all the AGROVOC subjects on CGSpace
|
||||
|
||||
<!--more-->
|
||||
|
||||
```sql
|
||||
dspace=# BEGIN;
|
||||
BEGIN
|
||||
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
|
||||
UPDATE 180
|
||||
dspace=*# COMMIT;
|
||||
COMMIT
|
||||
```
|
||||
|
||||
## 2024-02-06
|
||||
|
||||
- Discuss IWMI using the CGSpace REST API for their new website
|
||||
- Export the IWMI community to extract their ORCID identifiers:
|
||||
|
||||
```console
|
||||
$ dspace metadata-export -i 10568/16814 -f /tmp/iwmi.csv
|
||||
$ csvcut -c 'cg.creator.identifier,cg.creator.identifier[en_US]' ~/Downloads/2024-02-06-iwmi.csv \
|
||||
| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' \
|
||||
| sort -u \
|
||||
| tee /tmp/iwmi-orcids.txt \
|
||||
| wc -l
|
||||
353
|
||||
$ ./ilri/resolve_orcids.py -i /tmp/iwmi-orcids.txt -o /tmp/iwmi-orcids-names.csv -d
|
||||
```
|
||||
|
||||
- I noticed some similar looking names in our list so I clustered them in OpenRefine and manually checked a dozen or so to update our list
|
||||
|
||||
## 2024-02-07
|
||||
|
||||
- Maria asked me about the "missing" item from last week again
|
||||
- I can see it when I used the Admin search, but not in her workflow
|
||||
- It was submitted by TIP so I checked that user's workspace and found it there
|
||||
- After depositing, it went into the workflow so Maria should be able to see it now
|
||||
|
||||
## 2024-02-09
|
||||
|
||||
- Minor edits to CGSpace submission form
|
||||
- Upload 55 ISNAR book chapters to CGSpace from Peter
|
||||
|
||||
## 2024-02-19
|
||||
|
||||
- Looking into the collection mapping issue on CGSpace
|
||||
- It seems to be by design in DSpace 7: https://github.com/DSpace/dspace-angular/issues/1203
|
||||
- This is a massive setback for us...
|
||||
|
||||
## 2024-02-20
|
||||
|
||||
- Minor work on OpenRXV to fix a bug in the ng-select drop downs
|
||||
- Minor work on the DSpace 7 nginx configuration to allow requesting robots.txt and sitemaps without hitting rate limits
|
||||
|
||||
## 2024-02-21
|
||||
|
||||
- Minor updates on OpenRXV, including one bug fix for missing mapped collections
|
||||
- Salem had to re-work the harvester for DSpace 7 since the mapped collections and parent collection list are separate!
|
||||
|
||||
## 2024-02-22
|
||||
|
||||
- Discuss tagging of datasets and re-work the submission form to encourage use of DOI field for any item that has a DOI, and the normal URL field if not
|
||||
- The "cg.identifier.dataurl" field will be used for "related" datasets
|
||||
- I still have to check and move some metadata for existing datasets
|
||||
|
||||
## 2024-02-23
|
||||
|
||||
- This morning Tomcat died due to an OOM kill from the kernel:
|
||||
|
||||
```console
|
||||
kernel: Out of memory: Killed process 698 (java) total-vm:14151300kB, anon-rss:9665812kB, file-rss:320kB, shmem-rss:0kB, UID:997 pgtables:20436kB oom_score_adj:0
|
||||
```
|
||||
|
||||
- I don't see any abnormal pattern in my Grafana graphs, for JVM or system load... very weird
|
||||
- I updated the submission form on CGSpace to include the new changes to URLs for datasets
|
||||
- I also updated about 80 datasets to move the URLs to the correct field
|
||||
|
||||
## 2024-02-25
|
||||
|
||||
- This morning Tomcat died while I was doing a CSV export, with an OOM kill from the kernel:
|
||||
|
||||
```console
|
||||
kernel: Out of memory: Killed process 720768 (java) total-vm:14079976kB, anon-rss:9301684kB, file-rss:152kB, shmem-rss:0kB, UID:997 pgtables:19488kB oom_score_adj:0
|
||||
```
|
||||
|
||||
- I don't know why this is happening so often recently...
|
||||
|
||||
## 2024-02-27
|
||||
|
||||
- IFPRI sent me a list of authors to add to our list for now, until we can find a better way of doing it
|
||||
- I extracted the existing authors from our controlled vocabulary and combined them with IFPRI's:
|
||||
|
||||
```console
|
||||
$ xmllint --xpath '//node/isComposedBy/node()' dspace/config/controlled-vocabularies/dc-contributor-author.xml \
|
||||
| grep -oE 'label=".*"' \
|
||||
| sed -e 's/label="//' -e 's/"$//' > /tmp/authors
|
||||
$ cat /tmp/authors /tmp/ifpri-authors | sort -u > /tmp/new-authors
|
||||
```
|
||||
|
||||
## 2024-02-28
|
||||
|
||||
- I figured out a way to add a new Angular component to handle all our relation fields
|
||||
|
||||
## 2024-02-29
|
||||
|
||||
- Clean up a bunch of metadata on CGSpace
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
207
content/posts/2024-03.md
Normal file
207
content/posts/2024-03.md
Normal file
@@ -0,0 +1,207 @@
|
||||
---
|
||||
title: "March, 2024"
|
||||
date: 2024-03-01T09:55:00+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2024-03-01
|
||||
|
||||
- Last week Bizu reported an issue with the "browse by issue date" drop down
|
||||
- I verified it, and suspect it could be due to missing issue dates...
|
||||
- It might be this issue: https://github.com/DSpace/dspace-angular/issues/2808
|
||||
|
||||
<!--more-->
|
||||
|
||||
- I spent some time trying to reproduce the bug affecting `onebox` fields that are configured to use external vocabularies and are not repeatable
|
||||
- I filed an issue: https://github.com/DSpace/dspace-angular/issues/2846
|
||||
|
||||
## 2024-03-03
|
||||
|
||||
- I did some cleanups on abstracts, licenses, and dates from CrossRef
|
||||
- I also did some minor cleanups to affiliations because I saw some incorrect and duplicate ones in our list
|
||||
|
||||
## 2024-03-05
|
||||
|
||||
- I tried a new technique to get some affiliations from Crossref using OpenRefine
|
||||
- First I split them and clustered, resolving a few hundred clusters out of 1500 (!)
|
||||
- Then I used a custom text facet with a few dozen CGIAR and other large affiliations to reduce the work
|
||||
- Then I joined them with our affiliations, paying no attention to duplicates
|
||||
- Then I deduped them using the Jython technique I learned in 2023-02
|
||||
|
||||
## 2024-03-06
|
||||
|
||||
- Peter sent me some more corrections for the authors that I had sent him in 2023-12
|
||||
|
||||
## 2024-03-08
|
||||
|
||||
- IFPRI sent me their 2023 records from CONTENTdm so I started working on those
|
||||
- I found a way to match their ORCID identifiers in our list using Jython in OpenRefine:
|
||||
|
||||
```python
|
||||
import re
|
||||
|
||||
with open(r"/tmp/cg-creator-identifier.txt",'r') as f :
|
||||
orcid_ids = [orcid_id.strip() for orcid_id in f]
|
||||
|
||||
matched = False
|
||||
for orcid_id in orcid_ids:
|
||||
if re.search(r'.+: {}'.format(value), orcid_id):
|
||||
matched = True
|
||||
break
|
||||
|
||||
if matched:
|
||||
return orcid_id
|
||||
else:
|
||||
return value
|
||||
```
|
||||
|
||||
|
||||
- I realized that [UNICEF was renamed to its current name in 1953](https://www.unicef.org/about-unicef/frequently-asked-questions#3) so I replaced all other variations in our vocabularies and metadata:
|
||||
|
||||
```sql
|
||||
UPDATE metadatavalue SET text_value='United Nations Children''s Fund' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value IN ('United Nations International Children''s Emergency Fund', 'United Nations International Children''s Emergency Fund', 'UNICEF');
|
||||
```
|
||||
|
||||
- Note the use of two single quotes to escape the one in the name
|
||||
|
||||
## 2024-03-11
|
||||
|
||||
- Experimenting with moving some of my Python scripts to the DSpace 7 REST API
|
||||
- I need a way to get UUIDs for Handles...
|
||||
- Seems that I can use a Discovery query like: https://dspace7test.ilri.org/server/api/discover/search/objects?dsoType=item&query=handle:10568/130864
|
||||
- Then just take the first result...?
|
||||
- I spent some time working on the script get abstracts from CGSpace, and found a bug in my logic
|
||||
- I also noticed that one item had two abstracts, but the first one was blank!
|
||||
- Looking deeper, I found 113 blank metadata values so I deleted those:
|
||||
|
||||
```sql
|
||||
BEGIN;
|
||||
DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='';
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
- I also found a few dozen items with "N/A" for their citation, so I deleted those too:
|
||||
|
||||
```sql
|
||||
BEGIN;
|
||||
DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='N/A' AND metadata_field_id=146;
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
- I deployed the change to disable Angular SSR's `inlineCriticalCss` on production because we had heavy load on the frontend and I've been meaning to do this permanently for some time
|
||||
- Maria asked me for a CSV with all the broken Bioversity permalinks so I exported them for her:
|
||||
|
||||
```console
|
||||
$ csvcut -c 'id,dc.title[en_US],dc.identifier.uri[en_US],cg.link.permalink[en_US]' ~/Downloads/2024-03-05-cgspace.csv \
|
||||
| csvgrep -c 'cg.link.permalink[en_US]' -r '^.+$' > /tmp/2024-03-11-Bioversity-Permalinks.csv
|
||||
```
|
||||
|
||||
## 2024-03-12
|
||||
|
||||
- Run the duplicate checker for IFPRI 2023 batch upload
|
||||
|
||||
## 2024-03-13
|
||||
|
||||
- I found about 428 duplicates in the IFPRI 2023 batch records
|
||||
- Alarmingly, I found about 18 that are duplicated on CGSpace as well!
|
||||
- I looked closer and decided that 11 were duplicates, so I merged the metadata and withdrew the later ones
|
||||
- Alliance asked me to get him the Handles for items submitted by TIP that are not discoverable
|
||||
- I found it easiest to use the `ds6_item2itemhandle` [DSpace SQL helper function](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) with a nested query on the provenance:
|
||||
|
||||
```sql
|
||||
SELECT ds6_item2itemhandle(dspace_object_id) AS handle FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item WHERE NOT discoverable) AND metadata_field_id=28 AND text_value LIKE 'Submitted by Alliance TIP Submit%';
|
||||
```
|
||||
|
||||
## 2024-03-14
|
||||
|
||||
- Looking in to reports of rate limiting of Altmetric's bot on CGSpace
|
||||
- I don't see any HTTP 429 responses for their user agents in any of our logs...
|
||||
- I tried myself on an item page and never hit a limit...
|
||||
|
||||
```console
|
||||
$ for num in {1..60}; do echo -n "Request ${num}: "; curl -s -o /dev/null -w "%{http_code}" https://dspace7test.ilri.org/items/c9b8999d-3001-42ba-a267-14f4bfa90b53 && echo; done
|
||||
Request 1: 200
|
||||
Request 2: 200
|
||||
Request 3: 200
|
||||
Request 4: 200
|
||||
...
|
||||
Request 60: 200
|
||||
```
|
||||
|
||||
- All responses were HTTP 200...
|
||||
- In any case, I whitelisted their production IPs and told them to try again
|
||||
- I imported 468 of IFPRI's 2023 records that were confirmed to not be duplicates to CGSpace
|
||||
- I also spent some time merging metadata from 415 of the remaining 432 duplicates with the metadata for the existing items on CGSpace
|
||||
- This was a bit of dirty work using csvkit, xsv, and OpenRefine
|
||||
|
||||
## 2024-03-17
|
||||
|
||||
- There are 17 records from IFPRI's 2023 batch that are remaining from the 432 that I identified as already being on CGSpace
|
||||
- These are different in that they are duplicates on CGSpace as well, so the csvjoin failed and the metadata got messed up in my migration
|
||||
- I looked closer and whittled this down to 14 actual records, and spent some time working on them
|
||||
- I isolated 12 of these items that existed on CGSpace and added publication ranks, project identifiers, and provenance links
|
||||
- Now there only remain two confusing records about the Inkomati catchment
|
||||
|
||||
## 2024-03-18
|
||||
|
||||
- Checking to see how many IFPRI records we have migrated so far:
|
||||
|
||||
```console
|
||||
$ csvgrep -c 'dc.description.provenance[en_US]' -m 'Original URL from IFPRI CONTENTdm' cgspace.csv \
|
||||
| csvcut -c 'id,dc.title[en_US],dc.identifier.uri[en_US],dc.description.provenance[en_US],dcterms.type[en_US]' \
|
||||
| tee /tmp/ifpri-records.csv \
|
||||
| csvstat --count
|
||||
898
|
||||
```
|
||||
|
||||
- I finalized the remaining two on Inkomati catchment and now we are at 900!
|
||||
|
||||
# 2024-03-19
|
||||
|
||||
- IWMI sent me some new author ORCID identifiers so I updated our list
|
||||
- Started working on updating my data for the Ontology CoP webinar on CGIAR and AGROVOC
|
||||
- First extracting all unique subjects on CGSpace:
|
||||
|
||||
```
|
||||
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119)) to /tmp/2024-03-19-cgspace-subjects.csv WITH CSV HEADER;
|
||||
COPY 28024
|
||||
```
|
||||
|
||||
- Then I extracted the subjects and looked them up against AGROVOC:
|
||||
|
||||
```console
|
||||
$ csvcut -c subject /tmp/2024-03-19-cgspace-subjects.csv | sed '1d' > /tmp/2024-03-19-cgspace-subjects.txt
|
||||
$ ./ilri/agrovoc_lookup.py -i /tmp/2024-03-19-cgspace-subjects.txt -o /tmp/2024-03-19-cgspace-subjects-results.csv
|
||||
```
|
||||
|
||||
## 2024-03-20
|
||||
|
||||
- Identify seven duplicates on CGSpace from the PRMS results and withdraw them from CGSpace
|
||||
|
||||
## 2024-03-21
|
||||
|
||||
- Look more closely at duplicates on CGSpace based on a fresh export
|
||||
- Using DOIs I found ~842 that occur more than once for journal articles alone, so probably around 400 duplicates
|
||||
- I did a handful of them, merging the metadata and withdrawing the duplicate, and decided to add `dcterms.replaces` with the handle in the original
|
||||
|
||||
## 2024-03-22
|
||||
|
||||
- Look at duplicate DOIs on CGSpace and address a dozen or so
|
||||
|
||||
## 2024-03-23
|
||||
|
||||
- Look at duplicate DOIs on CGSpace and address a dozen or so
|
||||
- Update Tomcat and Solr to latest versions
|
||||
- I had done some tests with these last week, and did a last minute test on DSpace 7 Test to make sure submission and searching worked
|
||||
|
||||
## 2024-03-24
|
||||
|
||||
- Slowly process several dozen more duplicate DOIs on CGSpace, sigh...
|
||||
|
||||
## 2024-03-26
|
||||
|
||||
- File an issue on dspace-angular about improving withdrawn item tombstones: https://github.com/DSpace/dspace-angular/issues/2880
|
||||
- Merge metadata and withdraw more duplicates on CGSpace
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
169
content/posts/2024-04.md
Normal file
169
content/posts/2024-04.md
Normal file
@@ -0,0 +1,169 @@
|
||||
---
|
||||
title: "April, 2024"
|
||||
date: 2024-04-04T10:23:00+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2024-04-04
|
||||
|
||||
- Work on CGSpace duplicate DOIs more
|
||||
|
||||
<!--more-->
|
||||
|
||||
## 2024-04-08
|
||||
|
||||
- Start working on IFPRI's 2022 batch import
|
||||
- I ran the duplicate checker against CGSpace and started downloading all linked PDFs
|
||||
|
||||
## 2024-04-09
|
||||
|
||||
- Continue working on IFPRI's 2022 batch import
|
||||
- I started validating the potential duplicates in OpenRefine
|
||||
|
||||
## 2024-04-12
|
||||
|
||||
- Finish working on the 650 IFPRI 2022 records that were not already on CGSpace, then uploaded them
|
||||
- I need to merge the metadata for the remaining 212 that are already on CGSpace
|
||||
- Spend some time looking at duplicate DOIs again...
|
||||
|
||||
## 2024-04-13
|
||||
|
||||
- Spend some time looking at duplicate DOIs again...
|
||||
|
||||
## 2024-04-14
|
||||
|
||||
- Spend some time looking at duplicate DOIs again...
|
||||
|
||||
## 2024-04-15
|
||||
|
||||
- Spend some time looking at duplicate DOIs again...
|
||||
- Delete ~260 duplicate metadata values using the elaborate SQL and sort method I documented here: https://github.com/DSpace/DSpace/issues/8253#issuecomment-1331756418
|
||||
- Tony noticed that the DSpace 7 REST API is very slow with the embeds so I profiled a bit:
|
||||
|
||||
```
|
||||
$ time curl -s -o /dev/null 'https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&page=0&size=100&embed=thumbnail,bundles/bitstreams&sort=dcterms.issued,desc'
|
||||
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 47.515 total
|
||||
$ time curl -s -o /dev/null 'https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&page=0&size=100&sort=dcterms.issued,desc'
|
||||
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 4.764 total
|
||||
```
|
||||
|
||||
- Finalize processing the remaining 206 items from the IFPRI 2022 batch set that already existed on CGSpace
|
||||
- I merged metadata with the existing items
|
||||
- There are still six remaining items that I identified as being duplicates (3x2) in the IFPRI set itself
|
||||
|
||||
## 2024-04-16
|
||||
|
||||
- Spend some time looking at duplicate DOIs again...
|
||||
- Assist Deborah with an advanced query on CGSpace for biodiversity and health:
|
||||
|
||||
```
|
||||
dcterms.issued:[2010 TO 2024] AND dcterms.type:"Journal Article" AND (dc.title:"biodiversity" OR dcterms.subject:"biodiversity" OR dc.title:"health" OR dcterms.subject:"health")
|
||||
```
|
||||
|
||||
- Remove CIMMYT URLs and citations from 277 journal articles on CGSpace since it is a bit tacky
|
||||
- I used this Jython expression in OpenRefine with [Crossref's content negotiation](https://citation.crosscite.org/docs.html) to get citations for all DOIs:
|
||||
|
||||
```python
|
||||
import urllib2
|
||||
|
||||
doi = cells['cg.identifier.doi[en_US]'].value
|
||||
url = "https://api.crossref.org/works/" + doi + "/transform/text/x-bibliography"
|
||||
useragent = "Python (mailto:a.o@cgiar.org)"
|
||||
|
||||
request = urllib2.Request(url.encode("utf-8"), headers={"User-Agent" : useragent})
|
||||
get = urllib2.urlopen(request)
|
||||
|
||||
return get.read().decode('utf-8')
|
||||
```
|
||||
|
||||
- It took ten or so minutes for it to finish (and note this is Python 2 inside OpenRefine so I had to be careful with Unicode), but worked well!
|
||||
|
||||
## 2024-04-18
|
||||
|
||||
- Write a SQL query to build the IFPRI CONTENTdm redirects to Handles:
|
||||
|
||||
```sql
|
||||
SELECT m.text_value, h.handle FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=28 AND m.text_value LIKE 'Original URL%' AND h.resource_type_id=2;
|
||||
```
|
||||
|
||||
- Similarly, I need a SQL query to get the redirects for duplicate Handles, querying for `dcterms.replaces`:
|
||||
|
||||
```sql
|
||||
SELECT m.text_value AS handle_from, h.handle AS handle_to FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=181 AND h.resource_type_id=2;
|
||||
```
|
||||
|
||||
- Then I can work that list into an nginx map with redirect, for example:
|
||||
|
||||
```console
|
||||
server {
|
||||
...
|
||||
|
||||
if ($new_uri) {
|
||||
return 301 $new_uri;
|
||||
}
|
||||
}
|
||||
|
||||
map $request_uri $new_uri {
|
||||
/handle/10568/112821 /handle/10568/97605;
|
||||
}
|
||||
```
|
||||
|
||||
## 2024-04-19
|
||||
|
||||
- Spend some time looking at duplicate DOIs again...
|
||||
- Refresh ORCID identifiers from ORCID API and update CGSpace metadata and controlled vocabulary
|
||||
|
||||
## 2024-04-20
|
||||
|
||||
- I read an [interesting thread about DOI casing](https://github.com/greenelab/scihub/issues/9)
|
||||
- Apparently the DOI specification says ASCII characters in DOIs are case insensitive
|
||||
- Indeed, [Crossref recommends lower case](https://www.crossref.org/documentation/member-setup/constructing-your-dois/) for all DOIs
|
||||
- I was curious about the DOIs in our database so I checked before and after lower casing:
|
||||
|
||||
```console
|
||||
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=220 AND text_value IS NOT NULL AND text_value !='') TO /tmp/dois-sql-before.txt;
|
||||
COPY 25675
|
||||
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(lower(text_value)) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=220 AND text_value IS NOT NULL AND text_value !='') TO /tmp/dois-sql-after.txt;
|
||||
COPY 25666
|
||||
```
|
||||
|
||||
- I need to investigate options for lower casing these in the repository, for example in a curation task, and in all workflows around DSpace metadata...
|
||||
|
||||
## 2024-04-23
|
||||
|
||||
- Spent some time writing a Java curation task to normalize DOIs in items when they enter the workflow edit step
|
||||
- The workflow curation tasks are not documented very well but I got a basic configuration working
|
||||
- I found a bug in DSpace curation tasks and discussed on Slack
|
||||
- I finalized the `NormalizeDOIs` curation task and released v7.6.1.1 of the [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) project
|
||||
|
||||
## 2024-04-24
|
||||
|
||||
- A bit more testing of the curation tasks
|
||||
- I tested a patch by Mark Wood
|
||||
- I added support for normalizing DOIs to this same format to my [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) project
|
||||
|
||||
## 2024-04-25
|
||||
|
||||
- I lowercased the remaining 3,900 DOIs on CGSpace that had uppercase ASCII characters
|
||||
- Spend some time looking at duplicate DOIs again...
|
||||
|
||||
## 2024-04-26
|
||||
|
||||
- Spend some time looking at duplicate DOIs again...
|
||||
|
||||
## 2024-04-29
|
||||
|
||||
- Start working on the IFPRI 2020–2021 batch migration
|
||||
- I modified my `check_duplicates.py` script to check for DOIs instead of titles, and use a similarity of 1.0 to make sure the match is exact
|
||||
- I noticed something in the Tomcat log:
|
||||
|
||||
```console
|
||||
tomcat9[690]: WARNING: The HTTP response header [Content-Disposition] with value [attachment; filename="Literature review on Women’s Empowerment and their Resilience2.pdf"] has been removed from the response because it is invalid
|
||||
tomcat9[690]: java.lang.IllegalArgumentException: The Unicode character [’] at code point [8,217] cannot be encoded as it is outside the permitted range of 0 to 255
|
||||
```
|
||||
|
||||
- I found the bitstream's ID and then used the `ds6_bitstream2itemhandle` [SQL helper function](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) to find the item's handle
|
||||
- Then I replaced the curly quote with a regular quote in all bistreams
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
197
content/posts/2024-05.md
Normal file
197
content/posts/2024-05.md
Normal file
@@ -0,0 +1,197 @@
|
||||
---
|
||||
title: "May, 2024"
|
||||
date: 2024-05-01T10:39:00+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2024-05-01
|
||||
|
||||
- I dumped all the CGSpace DOIs and resolved them with my `crossref_doi_lookup.py` script
|
||||
- Then I did some work to add missing abstracts (about 900!), volumes, issues, licenses, publishers, and types, etc
|
||||
|
||||
<!--more-->
|
||||
|
||||
## 2024-05-05
|
||||
|
||||
- Spend some time looking at duplicate DOIs again...
|
||||
|
||||
## 2024-05-06
|
||||
|
||||
- Spend some time looking at duplicate DOIs again...
|
||||
|
||||
## 2024-05-07
|
||||
|
||||
- Discuss RSS feeds and OpenSearch with IWMI
|
||||
- It seems our OpenSearch feed settings are using the defaults, so I need to copy some of those over from our old DSpace 6 branch
|
||||
- I saw a patch for an interesting issue on DSpace GitHub: [Error submitting or deleting items - URI too long when user is in a large number of groups](https://github.com/DSpace/DSpace/issues/9544)
|
||||
- I hadn't realized it, but we have lots of those errors:
|
||||
|
||||
```console
|
||||
$ zstdgrep -a 'URI Too Long' log/dspace.log-2024-04-* | wc -l
|
||||
1423
|
||||
```
|
||||
|
||||
- Spend some time looking at duplicate DOIs again...
|
||||
|
||||
## 2024-05-08
|
||||
|
||||
- Spend some time looking at duplicate DOIs again...
|
||||
- I finally finished looking at the duplicate DOIs for journal articles
|
||||
- I updated the list of handle redirects and there are 386 of them!
|
||||
|
||||
## 2024-05-09
|
||||
|
||||
- Spend some time working on the IFPRI 2020–2021 batch
|
||||
- I started by checking for exact duplicates (1.0 similarity) using DOI, type, and issue date
|
||||
|
||||
## 2024-05-12
|
||||
|
||||
- I couldn't figure out how to do a complex join on withdrawn items along with their metadata, so I pull out a few like titles, handles, and provenance separately:
|
||||
|
||||
```psql
|
||||
dspace=# \COPY (SELECT i.uuid, m.text_value AS uri FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=25) TO /tmp/withdrawn-handles.csv CSV HEADER;
|
||||
dspace=# \COPY (SELECT i.uuid, m.text_value AS title FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=64) TO /tmp/withdrawn-titles.csv CSV HEADER;
|
||||
dspace=# \COPY (SELECT i.uuid, m.text_value AS submitted_by FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=28 AND m.text_value LIKE 'Submitted by%') TO /tmp/withdrawn-submitted-by.csv CSV HEADER;
|
||||
```
|
||||
|
||||
- Then joined them:
|
||||
|
||||
```console
|
||||
$ csvjoin -c uuid /tmp/withdrawn-title.csv /tmp/withdrawn-handles.csv /tmp/withdrawn-submitted-by.csv > /tmp/withdrawn.csv
|
||||
```
|
||||
|
||||
- This gives me an insight into who submitted at 334 of the duplicates over the past few years...
|
||||
- I fixed a few hundred titles with leading/trailing whitespace, newlines, and ligatures like ff, fi, fl, ffi, and ffl
|
||||
|
||||
## 2024-05-13
|
||||
|
||||
- Export a list of IFPRI information products with handle links and CONTENTdm links:
|
||||
|
||||
```
|
||||
$ csvgrep -c 'dc.description.provenance[en_US]' -m 'CONTENTdm' cgspace.csv \
|
||||
| csvcut -c 'id,dc.description.provenance[en_US],dc.identifier.uri[en_US]' \
|
||||
| tee /tmp/ifpri-redirects.csv \
|
||||
| csvstat --count
|
||||
2645
|
||||
```
|
||||
|
||||
- I discovered the `/server/api/pid/find` endpoint today, which is much more direct and manageable than the `/server/api/discover/search/objects?query=` endpoint when trying to get metadata for a Handle (item, collection, or community)
|
||||
- The "pid" stands for permanent identifiers apparently, and we can use it like this:
|
||||
|
||||
```
|
||||
https://dspace7test.ilri.org/server/api/pid/find?id=10568/118424
|
||||
```
|
||||
|
||||
## 2024-05-15
|
||||
|
||||
- I got journal titles for 2,900 journal articles that were missing them from Crossref
|
||||
|
||||
## 2024-05-16
|
||||
|
||||
Helping IFPRI with some DSpace 7 API support, these are two queries for items issued in 2024:
|
||||
- https://dspace7test.ilri.org/server/api/discover/search/objects?query=dcterms.issued:2024
|
||||
- https://dspace7test.ilri.org/server/api/discover/search/objects?query=dcterms.issued_dt%3A%5B2024-01-01T00%3A00%3A00Z%20TO%20%2A%5D — note the Lucene search syntax is URL encoded version of `:[2024-01-01T00:00:00Z TO *]`
|
||||
|
||||
Both of them return the same number of results and seem identitical as far as I can see, but the second one uses Solr date indexes and requires the full Lucene datetime and range syntax
|
||||
|
||||
I wrote a new version of the `check_duplicates.py` script to help identify duplicates with different types
|
||||
- Initially I called it `check_duplicates_fast.py` but it's actually not faster
|
||||
- I need to find a way to deal with duplicates from IFPRI's repository because there are some mismatched types...
|
||||
|
||||
## 2024-05-20
|
||||
|
||||
Continue working through alternative duplicate matching for IFPRI
|
||||
- Their item types are sometimes different than ours...
|
||||
- One thing I think I can say for sure is that the default similarity factor in my script is 0.6, and I rarely see legitimate duplicates with such similarity so I might increase this to 0.7 to reduce the number of items I have to check
|
||||
- Also, the difference in issue dates is currently 365, but I should reduce that a bit, perhaps to 270 days (9 months)
|
||||
|
||||
## 2024-05-22
|
||||
|
||||
- Finalize and upload the IFPRI 2020–2021 batch set
|
||||
- I used a new technique to get missing licenses via Crossref (it's Python 2 because of OpenRefine's Jython):
|
||||
|
||||
```python
|
||||
import urllib2
|
||||
|
||||
doi = cells['cg.identifier.doi[en_US]'].value
|
||||
url = "https://api.crossref.org/works/" + doi
|
||||
useragent = "Python (mailto:a.o@cgiar.org)"
|
||||
|
||||
request = urllib2.Request(url.encode("utf-8"), headers={"User-Agent" : useragent})
|
||||
get = urllib2.urlopen(request)
|
||||
|
||||
return get.read().decode('utf-8')
|
||||
```
|
||||
|
||||
## 2024-05-23
|
||||
|
||||
- Finalize last of the duplicates I found for the IFPRI 2020–2021 batch set (those that we missed initially due to mismatched types)
|
||||
- Export a new list of IFPRI redirects from CONTENTdm:
|
||||
|
||||
```console
|
||||
$ csvgrep -c 'dc.description.provenance[en_US]' -r 'Original URLs? from IFPRI CONTENTdm' cgspace.csv \
|
||||
| csvcut -c 'id,dc.description.provenance[en_US],dc.identifier.uri[en_US]' \
|
||||
| tee /tmp/ifpri-redirects.csv \
|
||||
| csvstat --count
|
||||
4004
|
||||
```
|
||||
|
||||
I found a way to get abstracts from PLOS
|
||||
- They offer an API that returns XML including the JATS-formatted abstracts
|
||||
- I created a new column in OpenRefine by fetching specially crafted URLs based on the DOIs using this GREL:
|
||||
|
||||
```console
|
||||
"https://journals.plos.org/plosone/article/file?id=" + cells['doi'].value + '&type=manuscript'
|
||||
```
|
||||
|
||||
Then used `value.parseXml()` on the resulting text to extract the abstract's text:
|
||||
|
||||
```console
|
||||
value.parseXml().select("abstract")[0].xmlText()
|
||||
```
|
||||
|
||||
This doesn't preserve `<p>` tags though...
|
||||
- Oh, nice, this does!
|
||||
|
||||
```console
|
||||
forEach(value.parseHtml().select("abstract p"), i, i.htmlText()).join("\r\n\r\n")
|
||||
```
|
||||
|
||||
For each paragraph inside an abstract, get the inner text and join them as one string separated by two newlines...
|
||||
- Ah, some articles have multiple abstracts, for example: https://journals.plos.org/plosone/article/file?id=https://doi.org/10.1371/journal.pntd.0001859&type=manuscript
|
||||
- I need to select the abstract that does **not** have any attributes (using [Jsoup selector syntax](https://jsoup.org/apidocs/org/jsoup/select/Selector.html))
|
||||
|
||||
```console
|
||||
forEach(value.parseXml().select("abstract:not([*]) p"), i, i.xmlText()).join("\r\n\r\n")
|
||||
```
|
||||
|
||||
Testing `xsv` (Rust) versus `csvkit` (Python) to filter all items with DOIs from a DSpace dump with 118,000 items:
|
||||
|
||||
```console
|
||||
$ time xsv search -s doi 'doi\.org' /tmp/cgspace-minimal.csv | xsv select doi | xsv count
|
||||
27339
|
||||
xsv search -s doi 'doi\.org' /tmp/cgspace-minimal.csv 0.06s user 0.03s system 98% cpu 0.091 total
|
||||
xsv select doi 0.02s user 0.02s system 40% cpu 0.091 total
|
||||
xsv count 0.01s user 0.00s system 9% cpu 0.090 total
|
||||
$ time csvgrep -c doi -m 'doi.org' /tmp/cgspace-minimal.csv | csvcut -c doi | csvstat --count
|
||||
27339
|
||||
csvgrep -c doi -m 'doi.org' /tmp/cgspace-minimal.csv 1.15s user 0.06s system 95% cpu 1.273 total
|
||||
csvcut -c doi 0.42s user 0.05s system 36% cpu 1.283 total
|
||||
csvstat --count 0.20s user 0.03s system 18% cpu 1.298 total
|
||||
```
|
||||
|
||||
## 2024-05-27
|
||||
|
||||
- Working on IFPRI datasets batch migration
|
||||
- 732 items total
|
||||
- 6 duplicates on CGSpace
|
||||
- 6 duplicates within set that need investigation
|
||||
|
||||
## 2024-05-28
|
||||
|
||||
- I'm thinking of increasing the frequency of thumbnail generation on CGSpace
|
||||
- Currently the `dspace filter-media` script runs once at 3AM for all media types and seems to take ~10 minutes to run for all 118,000 items...
|
||||
- I think I will make the thumbnailer run explicitly more often using `-p "ImageMagick PDF Thumbnail"`
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
119
content/posts/2024-06.md
Normal file
119
content/posts/2024-06.md
Normal file
@@ -0,0 +1,119 @@
|
||||
---
|
||||
title: "June, 2024"
|
||||
date: 2024-06-03T14:14:00+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2024-06-03
|
||||
|
||||
- Working on IFPRI datasets
|
||||
- I noticed the licenses were missing from Nilam's original file so I found a way to check [Dataverse's API for a persistent identifier](https://guides.dataverse.org/en/latest/api/native-api.html#export-metadata-of-a-dataset-in-various-formats)
|
||||
- We have both Handles and DOIs for these datasets, both from Harvard's Dataverse
|
||||
|
||||
<!--more-->
|
||||
|
||||
- I used this GREL in OpenRefine to create a new column based on URLs using the DOI (uppercasing the DOI for Dataverse):
|
||||
|
||||
```
|
||||
"https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi:" + value.split('https://doi.org/')[-1].toUppercase()
|
||||
```
|
||||
|
||||
- Then I was able to extract the license text from the JSON response using:
|
||||
|
||||
```
|
||||
value.parseJson()['datasetVersion']['termsOfUse']
|
||||
```
|
||||
|
||||
- Similar for the Handle...
|
||||
|
||||
## 2024-06-04
|
||||
|
||||
- Some Dataverse entries have the license in `['datasetVersion']['license']` instead...
|
||||
- I finalized cleaning the 722 IFPRI datasets and uploaded them to CGSpace
|
||||
|
||||
## 2024-06-14
|
||||
|
||||
- Minor cleanups on IFPRI's 2016–2019 batch migration file
|
||||
- I will start with duplicates on unique identifiers like DOIs
|
||||
|
||||
## 2026-06-18
|
||||
|
||||
- Merge and upload metadata for duplicates in IFPRI's 2016–2019 set:
|
||||
- 144 exact match on CGSpace via DOI, type, and date
|
||||
- 32 with CGSpace handles
|
||||
- I also spent some time converting the `ilri/post_bitstreams.py` script to use the DSpace 7 REST API via dspace-rest-client
|
||||
- There are 28 PDFs specified for these 176 duplicates, and a handful of them do not already exist on CGSpace so I will upload them
|
||||
|
||||
## 2024-06-19
|
||||
|
||||
- Spent some time checking the remaining 3312 IFPRI 2016–2019 migration set for duplicates on CGSpace
|
||||
- There seem to be about 50 exact matches of title, type, and issue date
|
||||
|
||||
## 2024-06-20
|
||||
|
||||
- Finalize merging and uploading metadata for 48 duplicates from the IFPRI 2016–2019 migration set
|
||||
- Heavy load on both CGSpace and DSpace 7 Test this afternoon
|
||||
- Took me a while to figure out it was due to someone / something hammering `/search` for a bunch of facets
|
||||
- The `pm2 logs` command was more useful than the nginx logs to see the requests at least, for example:
|
||||
|
||||
```
|
||||
0|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&spc.page=1&f.accessRights=Open%20Access,equals&f.dateIssued.min=2023&f.dateIssued.max=2024&f.country=Colombia,equals&f.subject=climate%20change,equals&f.region=Latin%20America%20and%20the%20Caribbean,equals&f.publisher=CGIAR%20FOCUS%20Climate%20Security,equals - - ms - -
|
||||
1|dspace-ui | GET /search?f.accessRights=Open%20Access,equals&spc.page=1&f.sponsorship=CGIAR%20Trust%20Fund,equals&f.impactArea=Climate%20adaptation%20and%20mitigation,equals&f.region=Eastern%20Africa,equals&f.publisher=International%20Institute%20of%20Tropical%20Agriculture,equals - - ms - -
|
||||
3|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&f.sdg=SDG%2012%20-%20Responsible%20consumption%20and%20production,equals&spc.page=1&f.affiliation=CGIAR%20Research%20Program%20on%20Climate%20Change,%20Agriculture%20and%20Food%20Security,equals&f.affiliation=Alliance%20of%20Bioversity%20International%20and%20CIAT,equals&f.dateIssued.min=2020&f.dateIssued.max=2021&f.impactArea=Environmental%20health%20and%20biodiversity,equals - - ms - -
|
||||
```
|
||||
|
||||
- Still difficult to find the client, because the logs are all [coming from Angular's user agent](https://github.com/DSpace/dspace-angular/issues/2902) and IP
|
||||
- I changed the nginx logging to use the `X-Forwarded-For` header, as the default `combined` log format uses `$remote_addr` by default, which is only accurate if the request doesn't come from Angular (ie directly to the API)
|
||||
- From what I can see now the IPs are all coming from Huawei Cloud and Tencent
|
||||
- The ASNs are AS136907 (Huawei) and AS132203 (Tencent)
|
||||
- For now I will just add those to the list of bot networks
|
||||
|
||||
## 2024-06-21
|
||||
|
||||
- Update the nginx logging to use [nginx's `real_ip` module](http://nginx.org/en/docs/http/ngx_http_realip_module.html) to log the correct client IP
|
||||
- I think this means we will start sending 'bot' to the Angular / Express frontend because bot IPs will be properly classified now...
|
||||
- I will have to re-work or at least re-think that nginx configuration for requests going to the frontend because the proposed fix in https://github.com/DSpace/dspace-angular/issues/2902 is to pass on the client's user-agent
|
||||
- Then I updated the list of bot networks:
|
||||
|
||||
```console
|
||||
$ wget https://asn.ipinfo.app/api/text/list/AS12876 \
|
||||
https://asn.ipinfo.app/api/text/list/AS132203 \
|
||||
https://asn.ipinfo.app/api/text/list/AS13238 \
|
||||
https://asn.ipinfo.app/api/text/list/AS136907 \
|
||||
https://asn.ipinfo.app/api/text/list/AS14061 \
|
||||
https://asn.ipinfo.app/api/text/list/AS14618 \
|
||||
https://asn.ipinfo.app/api/text/list/AS16276 \
|
||||
https://asn.ipinfo.app/api/text/list/AS16509 \
|
||||
https://asn.ipinfo.app/api/text/list/AS203020 \
|
||||
https://asn.ipinfo.app/api/text/list/AS204287 \
|
||||
https://asn.ipinfo.app/api/text/list/AS21859 \
|
||||
https://asn.ipinfo.app/api/text/list/AS23576 \
|
||||
https://asn.ipinfo.app/api/text/list/AS24940 \
|
||||
https://asn.ipinfo.app/api/text/list/AS396982 \
|
||||
https://asn.ipinfo.app/api/text/list/AS45102 \
|
||||
https://asn.ipinfo.app/api/text/list/AS50245 \
|
||||
https://asn.ipinfo.app/api/text/list/AS55286 \
|
||||
https://asn.ipinfo.app/api/text/list/AS6939 \
|
||||
https://asn.ipinfo.app/api/text/list/AS8075
|
||||
$ cat AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
|
||||
$ wc -l /tmp/networks.txt
|
||||
8675 /tmp/networks.txt
|
||||
```
|
||||
|
||||
- Update list of ORCID identifiers with new ones from Alliance and IFPRI
|
||||
- Finalize uploading the remaining 3,264 items from IFPRI's 2016–2019 batch migration to CGSpace
|
||||
|
||||
## 2024-06-24
|
||||
|
||||
- Minor updates to [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) and [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) to normalize a few more invalid DOI formats
|
||||
|
||||
## 2024-06-25
|
||||
|
||||
- Work on uploading some missing PDFs from the IFPRI 2016–2019 batch migration
|
||||
|
||||
## 2024-06-26
|
||||
|
||||
- Did a big cleanup of several thousand journal articles based on metadata from Crossref
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
57
content/posts/2024-07.md
Normal file
57
content/posts/2024-07.md
Normal file
@@ -0,0 +1,57 @@
|
||||
---
|
||||
title: "July, 2024"
|
||||
date: 2024-07-01T09:37:00+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2024-07-01
|
||||
|
||||
- A bit of work to clean up duplicate DOIs on CGSpace
|
||||
- A handful of book chapters, working papers, and journal articles using the wrong DOI
|
||||
- I tried to delete all users who have been inactive since six years ago (July 1, 2018):
|
||||
|
||||
<!--more-->
|
||||
|
||||
```console
|
||||
$ dspace dsrun org.dspace.eperson.Groomer -a -b 07/01/2018 -d
|
||||
```
|
||||
|
||||
- File an issue on DSpace GitHub: [Allow configuring disallowed domains for self registration](https://github.com/DSpace/DSpace/issues/9675)
|
||||
|
||||
## 2024-07-11
|
||||
|
||||
- Minor fixes to normalize the IFPRI CONTENTdm URLs in provenance fields:
|
||||
|
||||
```console
|
||||
dspace=# BEGIN;
|
||||
BEGIN
|
||||
dspace=*# UPDATE metadatavalue SET text_value = replace(text_value, 'cdm/ref', 'digital') WHERE text_value LIKE '%CONTENTdm%cdm/ref/%';
|
||||
UPDATE 1876
|
||||
dspace=*# UPDATE metadatavalue SET text_value = replace(text_value, 'CONTENTdm: ', 'CONTENTdm: ') WHERE text_value LIKE '%CONTENTdm: %';
|
||||
UPDATE 21
|
||||
dspace=*# COMMIT;
|
||||
COMMIT
|
||||
```
|
||||
|
||||
- Then export a new list of CONTENTdm redirects, excluding withdrawn items:
|
||||
|
||||
```console
|
||||
dspace= ☘ \COPY (SELECT m.text_value, h.handle FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=28 AND m.text_value LIKE '%URL from IFPRI CONTENTdm%' AND h.resource_type_id=2 AND m.dspace_object_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn)) to /tmp/ifpri.csv CSV HEADER;
|
||||
COPY 8568
|
||||
```
|
||||
|
||||
- Similarly, get a list of withdrawn item redirects:
|
||||
|
||||
```console
|
||||
dspace= ☘ \COPY (SELECT m.text_value AS handle_from, h.handle AS handle_to FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=181 AND h.resource_type_id=2 AND h.resource_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn)) to /tmp/handle-redirects.csv CSV HEADER;
|
||||
COPY 396
|
||||
```
|
||||
|
||||
## 2024-07-18
|
||||
|
||||
- I experimented with adding a regular expression to validate DOIs to the submission form
|
||||
- It is a slightly modified version of the one found here: https://stackoverflow.com/questions/27910/finding-a-doi-in-a-document-or-page
|
||||
- I decided it will probably be confusing to people and will have limited benefit, since we are normalizing most forms of DOIs to our preferred form after submission anyway
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
71
content/posts/2024-08.md
Normal file
71
content/posts/2024-08.md
Normal file
@@ -0,0 +1,71 @@
|
||||
---
|
||||
title: "August, 2024"
|
||||
date: 2024-08-08T23:07:00-07:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2024-08-08
|
||||
|
||||
- While working on the CGIAR Climate Change Synthesis I learned some new tricks with OpenRefine
|
||||
|
||||
<!--more-->
|
||||
|
||||
- The first was to retrieve affiliations from OpenAlex and extract them from JSON with this GREL:
|
||||
|
||||
```
|
||||
forEach(
|
||||
value.parseJson()['authorships'],
|
||||
a,
|
||||
forEach(
|
||||
a.parseJson()['institutions'],
|
||||
i,
|
||||
i['display_name']
|
||||
).join("||")
|
||||
).join("||")
|
||||
```
|
||||
|
||||
- It is a nested `forEach` to extract all institutions for all authors
|
||||
- Second was a better way to deduplicate lists in Jython while preserving list order:
|
||||
|
||||
```python
|
||||
# better dedupe preserves order
|
||||
seen = set()
|
||||
deduped_list = [x for x in value.split("||") if x not in seen and not seen.add(x)]
|
||||
|
||||
return "||".join(deduped_list)
|
||||
```
|
||||
|
||||
## 2024-08-20
|
||||
|
||||
- Delete duplicate metadata values using the method I described in this GitHub issue: https://github.com/DSpace/DSpace/issues/8253#issuecomment-1331756418
|
||||
|
||||
## 2024-08-22
|
||||
|
||||
- Help IWMI with some OpenSearch RSS/Atom feeds for search results:
|
||||
- https://cgspace.cgiar.org/server/opensearch/search?query=affiliation:"International Water Management Institute" AND initiative:"Climate Resilience" AND subject:flooding
|
||||
- https://cgspace.cgiar.org/server/opensearch/search?query=affiliation:"International Water Management Institute" AND initiative:"Climate Resilience" AND subject:drought
|
||||
- https://cgspace.cgiar.org/server/opensearch/search?query=affiliation:"International Water Management Institute" AND initiative:"Climate Resilience" AND subject:landslides
|
||||
|
||||
- Export list of withdrawn handle redirects:
|
||||
|
||||
```
|
||||
dspace=# \COPY (SELECT m.text_value AS handle_from, h.handle AS handle_to FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=181 AND h.resource_type_id=2 AND h.resource_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn)) to /tmp/handle-redirects.csv CSV HEADER;
|
||||
COPY 400
|
||||
```
|
||||
|
||||
- Export list of IFPRI CONTENTdm redirects:
|
||||
|
||||
```
|
||||
dspace-# \COPY (SELECT m.text_value, h.handle FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=28 AND m.text_value LIKE '%URL from IFPRI CONTENTdm%' AND h.resource_type_id=2 AND m.dspace_object_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn)) to /tmp/ifpri.csv CSV HEADER;
|
||||
COPY 10794
|
||||
```
|
||||
|
||||
- I filed [an issue](https://github.com/DSpace/dspace-angular/issues/3258) on DSpace Angular for anonymous users to be able to export search results to CSV
|
||||
|
||||
## 2024-08-26
|
||||
|
||||
- Spent some time trying to rebase our DSpace Angular themes on top of the massive header/navbar rework from [DSpace 7.6.2](https://github.com/DSpace/dspace-angular/pull/2858)
|
||||
- Spent some time getting missing bibliographic metadata (issue dates, licenses, pages, volume, issue, publisher, etc) from Crossref for CGSpace
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
147
content/posts/2024-09.md
Normal file
147
content/posts/2024-09.md
Normal file
@@ -0,0 +1,147 @@
|
||||
---
|
||||
title: "September, 2024"
|
||||
date: 2024-09-01T21:16:00-07:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2024-09-01
|
||||
|
||||
- Upgrade CGSpace to DSpace 7.6.2
|
||||
|
||||
<!--more-->
|
||||
|
||||
## 2024-09-05
|
||||
|
||||
- Finalize work on migrating DSpace Angular from Yarn to NPM
|
||||
|
||||
## 2024-09-06
|
||||
|
||||
- This morning Tomcat crashed due to an OOM kill:
|
||||
|
||||
```
|
||||
Sep 06 00:00:24 server systemd[1]: tomcat9.service: A process of this unit has been killed by the OOM killer.
|
||||
Sep 06 00:00:25 server systemd[1]: tomcat9.service: Main process exited, code=killed, status=9/KILL
|
||||
Sep 06 00:00:25 server systemd[1]: tomcat9.service: Failed with result 'oom-kill'.
|
||||
```
|
||||
|
||||
- According to the system journal, it was a Node.js dspace-angular process that tried to allocate memory and failed, thus invoking the OOM killer
|
||||
- Currently I see high memory usage in those processes:
|
||||
|
||||
```console
|
||||
$ pm2 status
|
||||
┌────┬──────────────┬─────────────┬─────────┬─────────┬──────────┬────────┬──────┬───────────┬──────────┬──────────┬──────────┬──────────┐
|
||||
│ id │ name │ namespace │ version │ mode │ pid │ uptime │ ↺ │ status │ cpu │ mem │ user │ watching │
|
||||
├────┼──────────────┼─────────────┼─────────┼─────────┼──────────┼────────┼──────┼───────────┼──────────┼──────────┼──────────┼──────────┤
|
||||
│ 0 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 994 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
|
||||
│ 1 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 1015 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
|
||||
│ 2 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 1029 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
|
||||
│ 3 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 1042 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
|
||||
└────┴──────────────┴─────────────┴─────────┴─────────┴──────────┴────────┴──────┴───────────┴──────────┴──────────┴──────────┴──────────┘
|
||||
```
|
||||
|
||||
- I bet if I look in the logs I'd find some kind of heavy traffic on the frontend, causing high caching for Angular SSR
|
||||
|
||||
## 2024-09-08
|
||||
|
||||
- Analyzing memory use in our DSpace hosts, which have 32GB of memory
|
||||
- Effective cache of PostgreSQL is estimated at 11GB, which seems way high since the database is only 2GB
|
||||
- Realistically this should be how we adjust, with PostgreSQL using ~8GB (or less) and each dspace-angular process pinned at 2GB...
|
||||
|
||||
> Total - Solr - Tomcat Postgres - Nginx - Angular
|
||||
> 31366 − (1024×4.4) − 7168 − (8×1024) − 512 - (4x2048) = 2796.4 left...
|
||||
|
||||
- I put some of these changes in on DSpace Test and will monitor this week
|
||||
|
||||
## 2024-09-10
|
||||
|
||||
- Some bot in South Africa made a ton of requests on the API and made the load hit the roof:
|
||||
|
||||
```
|
||||
# grep -E '10/Sep/2024:[10-11]' /var/log/nginx/api-access.log | awk '{print $1}' | sort | uniq -c | sort -h
|
||||
...
|
||||
149720 102.182.38.90
|
||||
```
|
||||
|
||||
- They are using several user agents so are obviously a bot:
|
||||
|
||||
```
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:130.0) Gecko/20100101 Firefox/130.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/111.0
|
||||
Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0
|
||||
```
|
||||
|
||||
- I added them to the list of bot networks in nginx and the load went down
|
||||
|
||||
## 2024-09-11
|
||||
|
||||
- Upgrade DSpace 7 Test to Ubuntu 24.04
|
||||
- I did some minor maintenance to test dspace-statistics-api with Python 3.12
|
||||
- I tagged version 1.4.4 and released it on GitHub
|
||||
|
||||
## 2024-09-14
|
||||
|
||||
- Noticed a persistent higher than usual load on CGSpace and checked the server logs
|
||||
- Found some new data center subnets to block because they were making thousands of requests with normal user agents
|
||||
- I enabled HTTP/3 in nginx
|
||||
- I enabled the SSR patch in Angular: https://github.com/DSpace/dspace-angular/issues/3110
|
||||
|
||||
## 2024-09-16
|
||||
|
||||
- Experiment with the <a href="https://github.com/codeobia/dspace-statistics-api-js">dspace-statistics-api-js</a> on DSpace 7 Test
|
||||
- In the past it always caused Solr to run out of memory, but I increased Solr's heap from 2g to 3g and it runs without crashing
|
||||
- I attached VisualVM to Solr with a 3g and 4g heap and iterated over 1260 pages of results in the dspace-statistics-api-js:
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
## 2024-09-23
|
||||
|
||||
- Upgrade PostgreSQL from version 14 to 15 on DSpace Test the same way I did last year:
|
||||
|
||||
```console
|
||||
# apt update
|
||||
# apt install postgresql-15
|
||||
# Update configs with Ansible
|
||||
# systemctl stop tomcat9
|
||||
# pg_ctlcluster 14 main stop
|
||||
# tar -cvzpf var-lib-postgresql-14.tar.gz /var/lib/postgresql/14
|
||||
# tar -cvzpf etc-postgresql-14.tar.gz /etc/postgresql/14
|
||||
# pg_ctlcluster 15 main stop
|
||||
# pg_dropcluster 15 main
|
||||
# pg_upgradecluster 14 main
|
||||
# pg_ctlcluster 15 main start
|
||||
...
|
||||
|
||||
ERROR: could not find function "xml_is_well_formed" in file "/usr/lib/postgresql/15/lib/pgxml.so"
|
||||
ERROR: function public.xml_is_well_formed(text) does not exist
|
||||
ERROR: could not find function "xml_is_well_formed" in file "/usr/lib/postgresql/15/lib/pgxml.so"
|
||||
ERROR: function public.xml_valid(text) does not exist
|
||||
```
|
||||
|
||||
|
||||
- After that I [re-indexed the database indexes](https://adamj.eu/tech/2021/04/13/reindexing-all-tables-after-upgrading-to-postgresql-13/) using a query:
|
||||
|
||||
```console
|
||||
$ su - postgres
|
||||
$ cat /tmp/generate-reindex.sql
|
||||
SELECT 'REINDEX TABLE CONCURRENTLY ' || quote_ident(relname) || ' /*' || pg_size_pretty(pg_total_relation_size(C.oid)) || '*/;'
|
||||
FROM pg_class C
|
||||
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
|
||||
WHERE nspname = 'public'
|
||||
AND C.relkind = 'r'
|
||||
AND nspname !~ '^pg_toast'
|
||||
ORDER BY pg_total_relation_size(C.oid) ASC;
|
||||
$ psql dspace < /tmp/generate-reindex.sql > /tmp/reindex.sql
|
||||
$ <trim the extra stuff from /tmp/reindex.sql>
|
||||
$ psql dspace < /tmp/reindex.sql
|
||||
```
|
||||
|
||||
- The database shrunk by 186MB!
|
||||
|
||||
## 2024-09-29
|
||||
|
||||
- I upgraded the database on CGSpace to PostgreSQL 15
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
82
content/posts/2024-10.md
Normal file
82
content/posts/2024-10.md
Normal file
@@ -0,0 +1,82 @@
|
||||
---
|
||||
title: "October, 2024"
|
||||
date: 2024-10-03T11:01:00+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2024-10-03
|
||||
|
||||
- I had an idea to get abstracts from OpenAlex
|
||||
- For [copyright reasons they don't include plain abstracts](https://docs.openalex.org/api-entities/works/work-object#abstract_inverted_index), but the [pyalex](https://github.com/J535D165/pyalex) library can convert them on the fly
|
||||
|
||||
<!--more-->
|
||||
|
||||
- I filtered for journal articles that were Creative Commons and missing abstracts:
|
||||
|
||||
```console
|
||||
$ csvcut -c 'id,dc.title[en_US],dcterms.abstract[en_US],cg.identifier.doi[en_US],dcterms.type[en_US],dcterms.language[en_US],dcterms.license[en_US]' ~/Downloads/2024-09-30-cgspace.csv | csvgrep -c 'dcterms.type[en_US]' -r '^Journal Article$' | csvgrep -c 'cg.identifier.doi[en_US]' -r '^.+$' | csvgrep -c 'dcterms.license[en_US]' -r '^CC-' | csvgrep -c 'dcterms.abstract[en_US]' -r '^$' | csvgrep -c 'dcterms.language[en_US]' -r '^en$' | grep -v "||" | grep -v -- '-ND' | grep -v -E 'https://doi.org/10.(2499|4160|17528)/' > /tmp/missing-abstracts.csv
|
||||
```
|
||||
|
||||
- Then wrote a script to get them from OpenAlex
|
||||
- After inspecting and cleaning a few dozen up in OpenRefine (removing "Keywords:" and copyright, and HTML entities, etc) I managed to get about 440
|
||||
|
||||
## 2024-10-06
|
||||
|
||||
- Since I increase Solr's heap from 2 to 3G a few weeks ago it seems like Solr is always using 100% CPU
|
||||
- I don't understand this because it was running well before, and I only increased it in anticipation of running the dspace-statistics-api-js, though never got around to it
|
||||
- I just realized that this may be related to the JMX monitoring, as I've seen gaps in the Grafana dashboards and remember that it took surprisingly long to scrape the metrics
|
||||
- Maybe I need to change the scrape interval
|
||||
|
||||
## 2024-10-08
|
||||
|
||||
- I checked the VictoriaMetrics vmagent dashboard and saw that there were thousands of errors scraping the `jvm_solr` target from Solr
|
||||
- So it seems like I do need to change the scrape interval
|
||||
- I will increase it from 15s (global) to 20s for that job
|
||||
- Reading some documentation I found [this reference from Brian Brazil that discusses this very problem](https://www.robustperception.io/keep-it-simple-scrape_interval-id/)
|
||||
- He recommends keeping a single scrape interval for all targets, but also checking the slow exporter (`jmx_exporter` in this case) and seeing if we can limit the data we scrape
|
||||
- To keep things simple for now I will increase the global scrape interval to 20s
|
||||
- Long term I should limit the metrics...
|
||||
- Oh wow, I found out that [Solr ships with a Prometheus exporter!](https://solr.apache.org/guide/8_11/monitoring-solr-with-prometheus-and-grafana.html) and even includes a Grafana dashboard
|
||||
- I'm trying to run the Solr prometheus-exporter as a one-off systemd unit to test it:
|
||||
|
||||
```console
|
||||
# cd /opt/solr-8.11.3/contrib/prometheus-exporter
|
||||
# systemd-run --uid=victoriametrics --gid=victoriametrics --working-directory=/opt/solr-8.11.3/contrib/prometheus-exporter ./bin/solr-exporter -p 9854 -b http://localhost:8983/solr -f ./conf/solr-exporter-config.xml -s 20
|
||||
```
|
||||
|
||||
- The default scrape interval is 60 seconds, so if we scrape it more than that the metrics will be stale
|
||||
- From what I've seen this returns in less than one second so it should be safe to reduce the scrape interval
|
||||
|
||||
## 2024-10-19
|
||||
|
||||
- Heavy load on CGSpace today
|
||||
- There is a noted increase just before 4PM local time
|
||||
- I extracted a list of IPs:
|
||||
|
||||
```console
|
||||
# grep -E '19/Oct/2024:1[567]' /var/log/nginx/api-access.log | awk '{print $1}' | sort -u > /tmp/ips.txt
|
||||
```
|
||||
|
||||
- I looked them up and found some data center IPs that were using normal user agents with hundreds of IPs, for example:
|
||||
- 154.47.29.168 # 212238 (CDNEXT - Datacamp Limited, GB)
|
||||
- 91.210.64.12 # 29802 (HVC-AS, US) - HIVELOCITY, Inc.
|
||||
- 103.221.57.120 # 132817 (DZCRD-AS-AP DZCRD Networks Ltd, BD)
|
||||
- 109.107.150.136 # 201341 (CENTURION-INTERNET-SERVICES - trafficforce, UAB, LT) - Code200
|
||||
- 185.210.207.1 # 209709 (CODE200-ISP1 - UAB code200, LT)
|
||||
- 185.162.119.101 # 207223 (GLOBALCON - Global Connections Network LLC, US)
|
||||
- 173.244.35.101 # 64286 (LOGICWEB, US) - Tesonet
|
||||
- 139.28.160.141 # 396319 (US-INTERNET-396319, US) - OxyLabs
|
||||
- 104.143.89.112 # 62874 (WEB2OBJECTS, US) - Web2Objects LLC
|
||||
- I added some network blocks to the nginx conf
|
||||
- Interestingly, I see so many IPs using the same user agent today:
|
||||
|
||||
```console
|
||||
# grep "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.3" /var/log/nginx/api-access.log | awk '{print $1}' | sort -u | wc -l
|
||||
767
|
||||
```
|
||||
|
||||
- For reference, the current Chrome version is 129 or so...
|
||||
- This is definitely worth looking into because it seems like one massive botnet
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
50
content/posts/2024-11.md
Normal file
50
content/posts/2024-11.md
Normal file
@@ -0,0 +1,50 @@
|
||||
---
|
||||
title: "November, 2024"
|
||||
date: 2024-11-11T09:47:00+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2024-11-11
|
||||
|
||||
- Some IP in India is making tons of requests this morning with a normal user agent:
|
||||
|
||||
```console
|
||||
# awk '{print $1}' /var/log/nginx/api-access.log | sort | uniq -c | sort -h | tail -n 40
|
||||
...
|
||||
513743 49.207.196.249
|
||||
```
|
||||
|
||||
<!--more-->
|
||||
|
||||
- They are using this user agent:
|
||||
|
||||
```
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.3
|
||||
```
|
||||
|
||||
## 2024-11-16
|
||||
|
||||
- I switched CGSpace to Node.js v20 since I've been using it in dev and test for months
|
||||
|
||||
## 2024-11-18
|
||||
|
||||
- I see a bot (188.34.177.10) on Hetzner has made 35,000 requests this morning and is pretending to be Googlebot, GoogleOther, etc
|
||||
- Google publishes their range of IPs also: https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot
|
||||
- Our nginx config doesn't rate limit the API but perhaps that needs to change...
|
||||
- In DSpace 4/5/6 the API was separate from the user interface so we didn't need to enforce rate limits there because we encouraged using that over scraping the UI
|
||||
- In DSpace 7 the API is used by the frontend and perhaps should have the same IP- and UA-based rate limiting
|
||||
|
||||
## 2024-11-19
|
||||
|
||||
- I notice 10,000 requests by a new bot yesterday:
|
||||
|
||||
```
|
||||
20.38.174.208 - - [18/Nov/2024:07:02:50 +0100] "GET /server/oai/request?verb=ListRecords&resumptionToken=oai_dc%2F2024-10-18T13%3A00%3A49Z%2F%2F%2F400 HTTP/1.1" 503 190 "-" "Laminas_Http_Client"
|
||||
```
|
||||
|
||||
- Seems to be some kind of PHP framework library
|
||||
- Yesterday one IP in Argentina made nearly 1,000,000 requests using a normal user agent: 181.4.143.40
|
||||
- 188.34.177.10 ended up making 700,000 requests using various Googlebot, GoogleOther, and even normal Chrome user agents
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
28
content/posts/2024-12.md
Normal file
28
content/posts/2024-12.md
Normal file
@@ -0,0 +1,28 @@
|
||||
---
|
||||
title: "December, 2024"
|
||||
date: 2024-12-04T10:19:00+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2024-12-04
|
||||
|
||||
- We need to get view and download statistics for the last year from CGSpace
|
||||
- The only way to get that is using Solr
|
||||
|
||||
<!--more-->
|
||||
|
||||
- After consulting the [Solr documentation](https://solr.apache.org/guide/8_11/working-with-dates.html) I came up with this facet query:
|
||||
|
||||
> facet.range=time&facet.range.start=NOW/MONTH-11MONTHS&facet.range.end=NOW/MONTH+1MONTH&facet.range.gap=+1MONTH
|
||||
|
||||
- [This StackOverflow answer](https://stackoverflow.com/questions/34290600/how-to-apply-facet-on-date-field-where-result-should-provide-number-of-records-f) helped too, recommending `NOW/MONTH` to get neatly bucketed months because this will use the beginning of the current month
|
||||
- For views, I added the following query parameters: `q=type:2&fq=-isBot:true AND statistics_type:view`
|
||||
|
||||
> http://localhost:8983/solr/statistics/select?facet.range.end=NOW%2FMONTH%2B1MONTH&facet.range.gap=%2B1MONTH&facet.range.start=NOW%2FMONTH-11MONTHS&facet.range=time&facet=true&fq=-isBot%3Atrue%20AND%20statistics_type%3Aview&indent=true&q.op=OR&q=type%3A2&rows=0
|
||||
|
||||
- For downloads I added the following query parameters: `q=type:0&fq=-isBot:true AND statistics_type:view AND bundleName:ORIGINAL`
|
||||
|
||||
> http://localhost:8983/solr/statistics/select?facet.range.end=NOW%2FMONTH%2B1MONTH&facet.range.gap=%2B1MONTH&facet.range.start=NOW%2FMONTH-11MONTHS&facet.range=time&facet=true&fq=-isBot%3Atrue%20AND%20statistics_type%3Aview%20AND%20bundleName%3AORIGINAL&indent=true&q.op=OR&q=type%3A0&rows=0
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
38
content/posts/2025-01.md
Normal file
38
content/posts/2025-01.md
Normal file
@@ -0,0 +1,38 @@
|
||||
---
|
||||
title: "January, 2025"
|
||||
date: 2025-01-03T11:09:00+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2025-01-03
|
||||
|
||||
- Trying to get search results for a large boolean query given to me by some researchers
|
||||
- When searching via the Angular frontend I see an error in the Tomcat logs:
|
||||
|
||||
<!--more-->
|
||||
|
||||
```
|
||||
Jan 03 09:08:40 dspace tomcat9[876]: Jan 03, 2025 9:08:40 AM org.apache.coyote.http11.Http11Processor service
|
||||
Jan 03 09:08:40 dspace tomcat9[876]: INFO: Error parsing HTTP request header
|
||||
Jan 03 09:08:40 dspace tomcat9[876]: Note: further occurrences of HTTP request parsing errors will be logged at DEBUG level.
|
||||
Jan 03 09:08:40 dspace tomcat9[876]: java.lang.IllegalArgumentException: Request header is too large
|
||||
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.http11.Http11InputBuffer.fill(Http11InputBuffer.java:778)
|
||||
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.http11.Http11InputBuffer.parseHeader(Http11InputBuffer.java:892)
|
||||
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.http11.Http11InputBuffer.parseHeaders(Http11InputBuffer.java:593)
|
||||
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:279)
|
||||
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:63)
|
||||
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:937)
|
||||
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1791)
|
||||
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52)
|
||||
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1190)
|
||||
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
|
||||
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:63)
|
||||
Jan 03 09:08:40 dspace tomcat9[876]: at java.base/java.lang.Thread.run(Thread.java:840)
|
||||
```
|
||||
|
||||
- The size of the query itself is 5362 bytes
|
||||
- Increasing the `maxHttpHeaderSize` from the default of 8192 bytes to 16384 allows the search to complete successfully
|
||||
- I notice that we had previously increased the `maxHttpHeaderSize` on the HTTP connector in Tomcat 7, which we are no longer using in Tomcat 9, so this is an overdue change
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
66
content/posts/cgcore-types-harmonization.md
Normal file
66
content/posts/cgcore-types-harmonization.md
Normal file
@@ -0,0 +1,66 @@
|
||||
+++
|
||||
title = "Harmonization of CG Core Output Types"
|
||||
date = 2021-02-21T13:27:35+02:00
|
||||
description = "Proposed changes to CG Core types after review of several CGIAR repositories."
|
||||
categories = ["Notes"]
|
||||
tags = ["Migration"]
|
||||
url = "cgcore-types-harmonization"
|
||||
draft = true
|
||||
|
||||
+++
|
||||
|
||||
Proposed changes to the CG Core controlled vocabulary for output types after review of actual usage by several CGIAR open access repositories.
|
||||
|
||||
With reference to [CG Core v2 draft standard](https://agriculturalsemantics.github.io/cg-core/cgcore.html) by Marie-Angélique as well as [DCMI DCTERMS](http://www.dublincore.org/specifications/dublin-core/dcmi-terms/).
|
||||
|
||||
<!--more-->
|
||||
|
||||
- [Proposed Changes](#proposed-changes)
|
||||
- [Out of Scope](#out-of-scope)
|
||||
- [Implementation Progress](#implementation-progress)
|
||||
|
||||
## Proposed Changes
|
||||
As of 2021-01-18 the scope of the changes includes the following fields:
|
||||
|
||||
- cg.creator.id→cg.creator.identifier
|
||||
- ORCID identifiers
|
||||
- dc.format.extent→dcterms.extent
|
||||
- dc.date.issued→dcterms.issued
|
||||
- dc.description.abstract→dcterms.abstract
|
||||
- dc.description→dcterms.description
|
||||
- dc.description.sponsorship→cg.contributor.donor
|
||||
- values from CrossRef or Grid.ac if possible
|
||||
- dc.description.version→cg.reviewStatus
|
||||
- cg.fulltextstatus→cg.howPublished
|
||||
- CGSpace uses values like "Formally Published" or "Grey Literature"
|
||||
- dc.identifier.citation→dcterms.bibliographicCitation
|
||||
- cg.identifier.status→dcterms.accessRights
|
||||
- current values are "Open Access" and "Limited Access"
|
||||
- future values are possibly "Open" and "Restricted"?
|
||||
- dc.language.iso→dcterms.language
|
||||
- current values are ISO 639-1 (aka Alpha 2)
|
||||
- future values are possibly ISO 639-3 (aka Alpha 3)?
|
||||
- cg.link.reference→dcterms.relation
|
||||
- dc.publisher→dcterms.publisher
|
||||
- dc.relation.ispartofseries will be split into:
|
||||
- series name: dcterms.isPartOf
|
||||
- series number: cg.number
|
||||
- dc.rights→dcterms.license
|
||||
- Using [SPDX license identifiers](https://spdx.org/licenses/) if possible
|
||||
- dc.source→cg.journal
|
||||
- dc.subject→dcterms.subject
|
||||
- dc.type→dcterms.type
|
||||
- dc.identifier.isbn→cg.isbn
|
||||
- dc.identifier.issn→cg.issn
|
||||
- cg.targetaudience→dcterms.audience
|
||||
|
||||
### Out of Scope
|
||||
The following fields are currently out of the scope of this migration because they are used internally by DSpace 5.x/6.x and would be difficult to change without significant modifications to the core of the code:
|
||||
|
||||
- dc.title (`IncludePageMeta.java` only considers DC when building pageMeta, which we rely on in XMLUI because of XSLT from DRI)
|
||||
- dc.title.alternative
|
||||
- dc.date.available
|
||||
- dc.date.accessioned
|
||||
- dc.identifier.uri (hard coded for Handle assignment upon item submission)
|
||||
- dc.description.provenance
|
||||
- dc.contributor.author (`IncludePageMeta.java` only considers DC when building pageMeta, which we rely on in XMLUI because of XSLT from DRI)
|
||||
@@ -34,7 +34,7 @@ Last week I had increased the limit from 30 to 60, which seemed to help, but now
|
||||
$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
|
||||
78
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -242,15 +242,15 @@ db.statementpool = true
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -36,7 +36,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
|
||||
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
|
||||
-rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -264,15 +264,15 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -28,7 +28,7 @@ Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_
|
||||
I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
|
||||
Update GitHub wiki for documentation of maintenance tasks.
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -200,15 +200,15 @@ $ find SimpleArchiveForBio/ -iname “*.pdf” -exec basename {} ; | sor
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -38,7 +38,7 @@ I noticed we have a very interesting list of countries on CGSpace:
|
||||
Not only are there 49,000 countries, we have some blanks (25)…
|
||||
Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE”
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -378,15 +378,15 @@ Bitstream: tést señora alimentación.pdf
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -28,7 +28,7 @@ Looking at issues with author authorities on CGSpace
|
||||
For some reason we still have the index-lucene-update cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module
|
||||
Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -316,15 +316,15 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -32,7 +32,7 @@ After running DSpace for over five years I’ve never needed to look in any
|
||||
This will save us a few gigs of backup space we’re paying for on S3
|
||||
Also, I noticed the checker log has some errors we should pay attention to:
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -495,15 +495,15 @@ dspace.log.2016-04-27:7271
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -34,7 +34,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
|
||||
# awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l
|
||||
3168
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -371,15 +371,15 @@ sys 0m20.540s
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -34,7 +34,7 @@ This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRec
|
||||
You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
|
||||
Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund to cg.identifier.cpwfproject and then the rest to dc.description.sponsorship
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -409,15 +409,15 @@ $ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-D
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -44,7 +44,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
|
||||
|
||||
In this case the select query was showing 95 results before the update
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -325,15 +325,15 @@ discovery.index.authority.ignore-variants=true
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -42,7 +42,7 @@ $ git checkout -b 55new 5_x-prod
|
||||
$ git reset --hard ilri/5_x-prod
|
||||
$ git rebase -i dspace-5.5
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -389,15 +389,15 @@ $ JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" /home/cgspace.cgiar.org/bin
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -34,7 +34,7 @@ It looks like we might be able to use OUs now, instead of DCs:
|
||||
|
||||
$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -478,8 +478,8 @@ $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2
|
||||
</code></pre><ul>
|
||||
<li>It actually works really well, and search results return much less hits now (before, after):</li>
|
||||
</ul>
|
||||
<p><img src="/cgspace-notes/2016/09/cgspace-search.png" alt="CGSpace search with &ldquo;OR&rdquo; boolean logic">
|
||||
<img src="/cgspace-notes/2016/09/dspacetest-search.png" alt="DSpace Test search with &ldquo;AND&rdquo; boolean logic"></p>
|
||||
<p><img src="/cgspace-notes/2016/09/cgspace-search.png" alt="CGSpace search with “OR” boolean logic">
|
||||
<img src="/cgspace-notes/2016/09/dspacetest-search.png" alt="DSpace Test search with “AND” boolean logic"></p>
|
||||
<ul>
|
||||
<li>Found a way to improve the configuration of Atmire’s Content and Usage Analysis (CUA) module for date fields</li>
|
||||
</ul>
|
||||
@@ -606,15 +606,15 @@ $ ./delete-metadata-values.py -i ilrisubjects-delete-13.csv -f cg.subject.ilri -
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -42,7 +42,7 @@ I exported a random item’s metadata as CSV, deleted all columns except id
|
||||
|
||||
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -372,15 +372,15 @@ dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'h
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -26,7 +26,7 @@ Add dc.type to the output options for Atmire’s Listings and Reports module
|
||||
Add dc.type to the output options for Atmire’s Listings and Reports module (#286)
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -548,15 +548,15 @@ org.dspace.discovery.SearchServiceException: Error executing query
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -46,7 +46,7 @@ I see thousands of them in the logs for the last few months, so it’s not r
|
||||
I’ve raised a ticket with Atmire to ask
|
||||
Another worrying error from dspace.log is:
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -668,7 +668,7 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
|
||||
<li>This is how DSpace works, and I need to ask if there is a way to override someone’s submission, as the other reviewer seems to not be paying attention, or has perhaps taken the item from the task pool?</li>
|
||||
<li>Run a batch edit to add “RANGELANDS” ILRI subject to all items containing the word “RANGELANDS” in their metadata for Peter Ballantyne</li>
|
||||
</ul>
|
||||
<p><img src="/cgspace-notes/2016/12/batch-edit1.png" alt="Select all items with &ldquo;rangelands&rdquo; in metadata">
|
||||
<p><img src="/cgspace-notes/2016/12/batch-edit1.png" alt="Select all items with “rangelands” in metadata">
|
||||
<img src="/cgspace-notes/2016/12/batch-edit2.png" alt="Add RANGELANDS ILRI subject"></p>
|
||||
<h2 id="2016-12-18">2016-12-18</h2>
|
||||
<ul>
|
||||
@@ -784,15 +784,15 @@ $ exit
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -28,7 +28,7 @@ I checked to see if the Solr sharding task that is supposed to run on January 1s
|
||||
I tested on DSpace Test as well and it doesn’t work there either
|
||||
I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -369,15 +369,15 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -50,7 +50,7 @@ DELETE 1
|
||||
Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
|
||||
Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -423,15 +423,15 @@ COPY 1968
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -54,7 +54,7 @@ Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing reg
|
||||
$ identify ~/Desktop/alc_contrastes_desafios.jpg
|
||||
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -355,15 +355,15 @@ $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.spon
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -40,7 +40,7 @@ Testing the CMYK patch on a collection with 650 items:
|
||||
|
||||
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -585,15 +585,15 @@ $ gem install compass -v 1.0.3
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -18,7 +18,7 @@
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="May, 2017"/>
|
||||
<meta name="twitter:description" content="2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace."/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -391,15 +391,15 @@ UPDATE 187
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -18,7 +18,7 @@
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="June, 2017"/>
|
||||
<meta name="twitter:description" content="2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg."/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -270,15 +270,15 @@ $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -36,7 +36,7 @@ Merge changes for WLE Phase II theme rename (#329)
|
||||
Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
|
||||
We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML:
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -275,15 +275,15 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -60,7 +60,7 @@ This was due to newline characters in the dc.description.abstract column, which
|
||||
I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
|
||||
Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -517,15 +517,15 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -32,7 +32,7 @@ Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two
|
||||
|
||||
Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -659,15 +659,15 @@ Cert Status: good
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -34,7 +34,7 @@ http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
|
||||
There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
|
||||
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -443,15 +443,15 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -48,7 +48,7 @@ Generate list of authors on CGSpace for Peter to go through and correct:
|
||||
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
|
||||
COPY 54701
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -944,15 +944,15 @@ $ cat dspace.log.2017-11-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sor
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -30,7 +30,7 @@ The logs say “Timeout waiting for idle object”
|
||||
PostgreSQL activity says there are 115 connections currently
|
||||
The list of connections to XMLUI and REST API for today:
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -783,15 +783,15 @@ DELETE 20
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -150,7 +150,7 @@ dspace.log.2018-01-02:34
|
||||
|
||||
Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -1452,15 +1452,15 @@ Catalina:type=Manager,context=/,host=localhost activeSessions 8
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -30,7 +30,7 @@ We don’t need to distinguish between internal and external works, so that
|
||||
Yesterday I figured out how to monitor DSpace sessions using JMX
|
||||
I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -1038,15 +1038,15 @@ UPDATE 3
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -24,7 +24,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
|
||||
|
||||
Export a CSV of the IITA community metadata for Martin Mueller
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -585,15 +585,15 @@ Fixed 5 occurences of: GENEBANKS
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -26,7 +26,7 @@ Catalina logs at least show some memory errors yesterday:
|
||||
I tried to test something on DSpace Test but noticed that it’s down since god knows when
|
||||
Catalina logs at least show some memory errors yesterday:
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -594,15 +594,15 @@ $ pg_restore -O -U dspacetest -d dspacetest -W -h localhost /tmp/dspace_2018-04-
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -38,7 +38,7 @@ http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E
|
||||
Then I reduced the JVM heap size from 6144 back to 5120m
|
||||
Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -523,15 +523,15 @@ $ psql -h localhost -U postgres dspacetest
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -58,7 +58,7 @@ real 74m42.646s
|
||||
user 8m5.056s
|
||||
sys 2m7.289s
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -517,15 +517,15 @@ $ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -36,7 +36,7 @@ During the mvn package stage on the 5.8 branch I kept getting issues with java r
|
||||
|
||||
There is insufficient memory for the Java Runtime Environment to continue.
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -569,15 +569,15 @@ dspace=# select count(text_value) from metadatavalue where resource_type_id=2 an
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -46,7 +46,7 @@ Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did
|
||||
The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
|
||||
I ran all system updates on DSpace Test and rebooted it
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -442,15 +442,15 @@ $ dspace database migrate ignored
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -30,7 +30,7 @@ I’ll update the DSpace role in our Ansible infrastructure playbooks and ru
|
||||
Also, I’ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
|
||||
I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -748,15 +748,15 @@ UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND me
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -26,7 +26,7 @@ I created a GitHub issue to track this #389, because I’m super busy in Nai
|
||||
Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
|
||||
I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -656,15 +656,15 @@ $ curl -X GET -H "Content-Type: application/json" -H "Accept: applic
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -36,7 +36,7 @@ Send a note about my dspace-statistics-api to the dspace-tech mailing list
|
||||
Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage
|
||||
Today these are the top 10 IPs:
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -553,15 +553,15 @@ $ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 -d
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -36,7 +36,7 @@ Then I ran all system updates and restarted the server
|
||||
|
||||
I noticed that there is another issue with PDF thumbnails on CGSpace, and I see there was another Ghostscript vulnerability last week
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -594,15 +594,15 @@ UPDATE 1
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -50,7 +50,7 @@ I don’t see anything interesting in the web server logs around that time t
|
||||
357 207.46.13.1
|
||||
903 54.70.40.11
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -952,6 +952,7 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&am
|
||||
<blockquote class="twitter-tweet"><p lang="en" dir="ltr"><a href="https://twitter.com/hashtag/ILRI?src=hash&ref_src=twsrc%5Etfw">#ILRI</a> research: Towards unlocking the potential of the hides and skins value chain in Somaliland <a href="https://t.co/EZH7ALW4dp">https://t.co/EZH7ALW4dp</a></p>— ILRI.org (@ILRI) <a href="https://twitter.com/ILRI/status/1086330519904673793?ref_src=twsrc%5Etfw">January 18, 2019</a></blockquote>
|
||||
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
|
||||
|
||||
|
||||
<ul>
|
||||
<li>The shortened link is <a href="goo.gl/fb/VRj9Gq">goo.gl/fb/VRj9Gq</a> and it shows a “Dynamic Link not found” error from Firebase:</li>
|
||||
</ul>
|
||||
@@ -1264,15 +1265,15 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInter
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -72,7 +72,7 @@ real 0m19.873s
|
||||
user 0m22.203s
|
||||
sys 0m1.979s
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -1344,15 +1344,15 @@ Please see the DSpace documentation for assistance.
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -46,7 +46,7 @@ Most worryingly, there are encoding errors in the abstracts for eleven items, fo
|
||||
|
||||
I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -1208,15 +1208,15 @@ sys 0m2.551s
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -64,7 +64,7 @@ $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u ds
|
||||
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
|
||||
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -1299,15 +1299,15 @@ UPDATE 14
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -48,7 +48,7 @@ DELETE 1
|
||||
|
||||
But after this I tried to delete the item from the XMLUI and it is still present…
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -631,15 +631,15 @@ COPY 64871
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -34,7 +34,7 @@ Run system updates on CGSpace (linode18) and reboot it
|
||||
|
||||
Skype with Marie-Angélique and Abenet about CG Core v2
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -317,15 +317,15 @@ UPDATE 2
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -21,7 +21,7 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-07/" />
|
||||
<meta property="article:published_time" content="2019-07-01T12:13:51+03:00" />
|
||||
<meta property="article:modified_time" content="2019-10-28T13:39:25+02:00" />
|
||||
<meta property="article:modified_time" content="2023-08-14T10:39:08+02:00" />
|
||||
|
||||
|
||||
|
||||
@@ -38,7 +38,7 @@ CGSpace
|
||||
|
||||
Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -50,7 +50,7 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2019-07/",
|
||||
"wordCount": "2330",
|
||||
"datePublished": "2019-07-01T12:13:51+03:00",
|
||||
"dateModified": "2019-10-28T13:39:25+02:00",
|
||||
"dateModified": "2023-08-14T10:39:08+02:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@@ -330,7 +330,7 @@ dc.identifier.issn
|
||||
<li>Also, Jane asked me to check the Data Portal to see which email address requests for confidential data are going</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>Yesterday Theirry from CTA asked me about an error he was getting while submitting an item on CGSpace: “Unable to load Submission Information, since WorkspaceID (ID:S106658) is not a valid in-process submission.”</li>
|
||||
<li>Yesterday Thierry from CTA asked me about an error he was getting while submitting an item on CGSpace: “Unable to load Submission Information, since WorkspaceID (ID:S106658) is not a valid in-process submission.”</li>
|
||||
<li>I looked in the DSpace logs and found this right around the time of the screenshot he sent me:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>2019-07-10 11:50:27,433 INFO org.dspace.submit.step.CompleteStep @ lewyllie@cta.int:session_id=A920730003BCAECE8A3B31DCDE11A97E:submission_complete:Completed submission with id=106658
|
||||
@@ -554,15 +554,15 @@ issn.validate('1020-3362')
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -46,7 +46,7 @@ After rebooting, all statistics cores were loaded… wow, that’s luck
|
||||
|
||||
Run system updates on DSpace Test (linode19) and reboot it
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -573,15 +573,15 @@ sys 2m27.496s
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -72,7 +72,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
|
||||
7249 2a01:7e00::f03c:91ff:fe18:7396
|
||||
9124 45.5.186.2
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -581,15 +581,15 @@ $ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institut
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -18,7 +18,7 @@
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="October, 2019"/>
|
||||
<meta name="twitter:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc."/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -385,15 +385,15 @@ $ dspace import -a -c 10568/104057 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -58,7 +58,7 @@ Let’s see how many of the REST API requests were for bitstreams (because t
|
||||
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
|
||||
106781
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -692,15 +692,15 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -46,7 +46,7 @@ Make sure all packages are up to date and the package manager is up to date, the
|
||||
# dpkg -C
|
||||
# reboot
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -404,15 +404,15 @@ UPDATE 1
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -56,7 +56,7 @@ I tweeted the CGSpace repository link
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -604,15 +604,15 @@ COPY 2900
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -38,7 +38,7 @@ The code finally builds and runs with a fresh install
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -1275,15 +1275,15 @@ Moving: 21993 into core statistics-2019
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -42,7 +42,7 @@ You need to download this into the DSpace 6.x source and compile it
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -484,15 +484,15 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -48,7 +48,7 @@ The third item now has a donut with score 1 since I tweeted it last week
|
||||
|
||||
On the same note, the one item Abenet pointed out last week now has a donut with score of 104 after I tweeted it last week
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -658,15 +658,15 @@ $ psql -c 'select * from pg_stat_activity' | wc -l
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -34,7 +34,7 @@ I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -477,15 +477,15 @@ Caused by: java.lang.NullPointerException
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -36,7 +36,7 @@ I sent Atmire the dspace.log from today and told them to log into the server to
|
||||
In other news, I checked the statistics API on DSpace 6 and it’s working
|
||||
I tried to build the OAI registry on the freshly migrated DSpace 6 on DSpace Test and I get an error:
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -811,15 +811,15 @@ $ csvcut -c 'id,cg.subject.ilri[],cg.subject.ilri[en_US],dc.subject[en_US]&#
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -38,7 +38,7 @@ I restarted Tomcat and PostgreSQL and the issue was gone
|
||||
|
||||
Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the 5_x-prod branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter’s request
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -1142,15 +1142,15 @@ Fixed 4 occurences of: Muloi, D.M.
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -36,7 +36,7 @@ It is class based so I can easily add support for other vocabularies, and the te
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -798,15 +798,15 @@ $ grep -c added /tmp/2020-08-27-countrycodetagger.log
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -48,7 +48,7 @@ I filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/39
|
||||
|
||||
I filed an issue on OpenRXV to make some minor edits to the admin UI: https://github.com/ilri/OpenRXV/issues/40
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -717,15 +717,15 @@ solr_query_params = {
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -44,7 +44,7 @@ During the FlywayDB migration I got an error:
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -1241,15 +1241,15 @@ $ ./delete-metadata-values.py -i 2020-10-31-delete-74-sponsors.csv -db dspace -u
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -32,7 +32,7 @@ So far we’ve spent at least fifty hours to process the statistics and stat
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -731,15 +731,15 @@ $ ./fix-metadata-values.py -i 2020-11-30-fix-hung-orcid.csv -db dspace63 -u dspa
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -36,7 +36,7 @@ I started processing those (about 411,000 records):
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -869,15 +869,15 @@ $ query-json '.items | length' /tmp/policy2.json
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -50,7 +50,7 @@ For example, this item has 51 views on CGSpace, but 0 on AReS
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -688,15 +688,15 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -60,7 +60,7 @@ $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty
|
||||
}
|
||||
}
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -898,15 +898,15 @@ dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -34,7 +34,7 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -875,15 +875,15 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -44,7 +44,7 @@ Perhaps one of the containers crashed, I should have looked closer but I was in
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -1042,15 +1042,15 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -36,7 +36,7 @@ I looked at the top user agents and IPs in the Solr statistics for last month an
|
||||
|
||||
I will add the RI/1.0 pattern to our DSpace agents overload and purge them from Solr (we had previously seen this agent with 9,000 hits or so in 2020-09), but I think I will leave the Microsoft Word one… as that’s an actual user…
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -685,15 +685,15 @@ May 26, 02:57 UTC
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -36,7 +36,7 @@ I simply started it and AReS was running again:
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -693,15 +693,15 @@ I simply started it and AReS was running again:
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -30,7 +30,7 @@ Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVO
|
||||
localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
|
||||
COPY 20994
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -715,15 +715,15 @@ COPY 20994
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -32,7 +32,7 @@ Update Docker images on AReS server (linode20) and reboot the server:
|
||||
|
||||
I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -606,15 +606,15 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -48,7 +48,7 @@ The syntax Moayad showed me last month doesn’t seem to honor the search qu
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -588,15 +588,15 @@ The syntax Moayad showed me last month doesn’t seem to honor the search qu
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -46,7 +46,7 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
|
||||
|
||||
So we have 1879/7100 (26.46%) matching already
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -791,15 +791,15 @@ Try doing it in two imports. In first import, remove all authors. In second impo
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -32,7 +32,7 @@ First I exported all the 2019 stats from CGSpace:
|
||||
$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
|
||||
$ zstd statistics-2019.json
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -494,15 +494,15 @@ $ zstd statistics-2019.json
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -40,7 +40,7 @@ Purging 455 hits from WhatsApp in statistics
|
||||
|
||||
Total number of bot hits purged: 3679
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -577,15 +577,15 @@ Total number of bot hits purged: 3679
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
@@ -24,7 +24,7 @@ Start a full harvest on AReS
|
||||
|
||||
Start a full harvest on AReS
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.111.3">
|
||||
<meta name="generator" content="Hugo 0.133.1">
|
||||
|
||||
|
||||
|
||||
@@ -380,15 +380,15 @@ Start a full harvest on AReS
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
|
||||
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user