Compare commits

...

277 Commits

Author SHA1 Message Date
63a2dcfdee Add notes for 2025-01-03 2025-01-03 12:37:39 +03:00
e7d7d4af89 Add notes 2024-12-04 16:27:49 +03:00
bd2d9779bb Add notes 2024-11-19 10:40:23 +03:00
47b96e8370 Add notes for 2024-10-08 2024-10-08 13:46:23 +03:00
512848fc73 Add notes for 2024-10-03 2024-10-03 11:51:44 +03:00
f8a1876ad2 Add notes for 2024-09-29 2024-09-30 07:56:53 +03:00
bb1367025a Add notes for 2024-09-23 2024-09-23 13:10:20 +03:00
dabbc20806 Update notes for 2024-09-16 2024-09-17 08:11:03 +04:00
edd2a8b306 Add docs again 2024-09-17 08:02:34 +04:00
842373d26f Update themes/hugo-theme-bootstrap4-blog 2024-09-17 08:01:55 +04:00
35342f95dc Add notes for 2024-09-16 2024-09-16 22:52:51 +04:00
79708bd30c Add notes for 2024-09-14 2024-09-14 23:02:16 +03:00
a5298945a3 Add notes for 2024-09 2024-09-09 10:20:09 +03:00
062019463c Add docs 2024-08-28 11:35:14 +03:00
f1c25111d0 Add notes 2024-08-28 11:35:05 +03:00
da6d73bc1f content/post/2024-07.md: fix spaces 2024-08-22 09:51:08 +03:00
7be53639dc Add content/posts/2024-08.md 2024-08-16 19:57:30 -07:00
64b8957945 Update notes 2024-08-07 08:54:13 -07:00
89d1b61442 Update notes for 2024-07-11 2024-07-11 13:08:22 +03:00
668947909a Add notes 2024-07-02 11:12:03 +03:00
7858008918 Add notes for 2024-06-21 2024-06-23 09:34:49 +03:00
c3436ea6c2 Add notes for 2024-06-18 2024-06-18 17:30:08 +03:00
bf4a6402d7 Add notes 2024-06-16 16:40:54 +03:00
8383cd466b Add notes for 2024-06-03 2024-06-03 17:31:03 +03:00
6d574d645d Add notes for 2024-05-28 2024-05-28 16:40:32 +03:00
befe3a3a58 Add notes for 2024-05-27 2024-05-27 21:40:09 +03:00
39d8d0876c Add notes for 2024-05-20 2024-05-20 17:34:14 +03:00
28a0c82e96 Minor syntax fix in example 2024-05-16 08:27:56 +03:00
7fc97884df Add notes for 2024-05-13 2024-05-13 16:24:11 +03:00
223453adbb Add notes for 2023-05-13 2024-05-13 08:21:17 +03:00
1b523bf055 Add notes for 2024-05-05 2024-05-05 21:43:52 +03:00
908a75a5c7 Add notes for 2024-05-01 2024-05-01 17:10:05 +03:00
e323c15e8b Add notes for 2024-04-29 2024-04-29 17:21:28 +03:00
8f156a0365 Add notes 2024-04-27 11:22:58 +03:00
515cc0650f Add notes 2024-04-25 15:28:35 +03:00
6db3da2739 Add notes 2024-04-18 17:00:25 +03:00
60b244486f Add notes 2024-04-18 09:38:02 +03:00
efd8eb7f79 Add notes 2024-04-16 09:35:30 +03:00
281827944a Add notes for 2024-04-12 2024-04-12 20:40:52 +03:00
864b3b136e Add notes 2024-04-09 16:50:56 +03:00
01a2ff5bfd Add notes 2024-04-04 10:23:49 +03:00
d71c430a7d Add notes 2024-03-25 18:53:18 +03:00
0e43fc97d7 Add notes for 2024-03-19 2024-03-19 16:24:20 +03:00
90c4d46607 Add notes 2024-03-19 09:01:13 +03:00
83c053f7ee Add notes for 2024-03-13 2024-03-14 09:29:05 +03:00
ba68787282 Update notes for 2024-03-11 2024-03-11 21:58:15 +03:00
1fc45e8f1b Add notes for 2024-03-11 2024-03-11 18:04:40 +03:00
11f1935f85 Add notes for 2024-03-08 2024-03-08 17:31:19 +03:00
5ff70af33b Add notes for 2024-03 2024-03-04 10:02:14 +03:00
b60a58f56a Fix date for 2024-02 frontmatter 2024-03-01 09:55:02 +03:00
cc28c0ccdc Add notes for 2024-02-29 2024-02-29 16:38:38 +03:00
1e87242956 Add notes for 2024-02-29 2024-02-29 09:41:44 +03:00
483a170f06 Add notes 2024-02-27 17:18:35 +03:00
0692b8666c Add notes for 2024-02-23 2024-02-24 20:44:15 +03:00
b2eaff29b1 Add notes for 2024-02-20 2024-02-20 22:55:09 +03:00
da0fd61b7e Add notes for 2024-02-19 2024-02-19 16:48:20 +03:00
3f4b66bd08 Add notes for 2024-02 2024-02-06 11:45:02 +03:00
ed290fb6f8 Add notes for 2024-01-29 2024-02-05 11:09:40 +03:00
63c20dbef9 Add notes for 2024-01-27 2024-01-28 09:23:40 +03:00
300b2e4271 Notes for 2024-01-23 2024-01-24 08:24:50 +03:00
57fe0587a4 Add notes 2024-01-18 15:59:49 +03:00
20ace46614 Add notes 2024-01-10 17:21:12 +03:00
3475d4fd5d Add notes for 2024-01-10 2024-01-10 08:34:16 +03:00
1dfb54ef6b Update notes for 2024-01-07 2024-01-07 22:18:43 +03:00
82c79fc257 Add notes for 2024-01-07 2024-01-07 20:43:02 +03:00
cf5c1e2155 Add notes for 2024-01-06 2024-01-06 17:46:07 +03:00
7418dae4b9 Add notes 2024-01-05 15:45:46 +03:00
264cdcf1db Add notes 2023-12-29 12:08:57 +03:00
293b500b26 content/posts/2023-07.md: minor grammar fix 2023-12-27 10:48:32 +03:00
17a241de5b Add notes for 2023-12-20 2023-12-21 10:09:15 +03:00
7695eacf7a Add notes 2023-12-18 23:15:27 +03:00
f4c985c16b Add notes for 2023-12-12 2023-12-12 14:57:07 +03:00
bc6412de09 Add notes for 2023-12-08 2023-12-09 09:55:16 +03:00
2ecafafc17 Notes for 2023-12-08 2023-12-08 16:32:48 +03:00
804a505ae2 docs: regenerate 2023-12-06 20:57:19 +03:00
6c5fa7375f Fix notes for 2023-11 2023-12-06 20:57:07 +03:00
f2bee38014 Add notes for 2023-12-05 2023-12-06 09:55:57 +03:00
a50fe66c78 Add notes 2023-12-02 10:38:09 +03:00
177c3b796d Add notes for 2023-11-23 2023-11-23 16:15:13 +03:00
eb218389a0 Add notes for 2023-11-18 2023-11-19 14:29:52 +03:00
1dd5900fbf Add notes for 2023-11-16 2023-11-16 17:25:15 +03:00
d14dd7114a Add notes for 2023-11-11 2023-11-13 16:54:36 +03:00
01fb17950b Add notes 2023-11-08 08:20:31 +03:00
c6d514bef9 Add notes for 2023-11-02 2023-11-02 20:58:43 +03:00
34523acc47 Add notes for 2023-10-27 2023-10-27 17:09:30 +03:00
3a4ecbd82d Add notes 2023-10-24 23:26:01 +03:00
c9bcfca903 Add notes for 2023-10-16 2023-10-16 17:03:59 +03:00
7e3a7951d6 Add notes for 2023-10-13 2023-10-13 17:17:41 +03:00
8d39fc7d71 Fix typo 2023-10-08 22:04:41 +03:00
22dd379e9a Add notes for 2023-10-07 2023-10-08 10:57:53 +03:00
98cdd21cb5 Add notes for 2023-10-06 2023-10-06 15:19:34 +03:00
62838a091c Add notes for 2023-10-05 2023-10-05 17:58:03 +03:00
cb40610726 Update notes 2023-10-04 09:24:33 +03:00
249d9be387 Update notes 2023-09-30 13:07:23 +03:00
4a02a78186 Add notes for 2023-09-25 2023-09-25 17:38:05 +03:00
aa6cbb488d Add notes for 2023-09-22 2023-09-23 10:15:01 +03:00
aeaa397612 Add notes for 2023-09-19 2023-09-19 21:13:52 +03:00
d60b85433d Update notes for 2023-09-16 2023-09-16 23:38:04 +03:00
202d3fb88f Add notes for 2023-09-16 2023-09-16 20:24:24 +03:00
afcbc67874 Add notes for 2023-09-13 2023-09-14 20:57:25 +03:00
22e47beeb6 Add notes for 2023-09-10 2023-09-11 09:18:52 +03:00
223979f267 Add notes for 2023-09-09 2023-09-10 09:58:29 +03:00
28d62f1c0c Update notes for 2023-09-08 2023-09-09 00:25:48 +03:00
34bf124d5d Add notes for 2023-09-08 2023-09-09 00:25:12 +03:00
011a1ec9db Add notes for 2023-09-03 2023-09-04 09:16:51 +03:00
45781d590d Add notes for 2023-09-02 2023-09-02 17:37:15 +03:00
d8e0004240 Add notes 2023-09-01 08:10:02 +03:00
bfb7da50af Add notes for 2023-08-31 2023-08-31 17:36:25 +03:00
6ec5e4b006 Add notes 2023-08-30 19:16:01 +03:00
1529cfd80b Add notes 2023-08-29 21:38:23 +03:00
6737febf95 Add notes for 2023-08-26 2023-08-26 19:27:57 +03:00
6fbcc342d2 Add notes for 2023-08-25 2023-08-25 17:06:19 +03:00
e83e681706 Add notes for 2023-08-24 2023-08-24 21:58:03 +03:00
33061dbe3a Add notes for 2023-08-23 2023-08-24 09:03:46 +03:00
d2ad21bde1 Add notes for 2023-08-22 2023-08-22 17:28:49 +03:00
f38ecfb75e Add notes for 2023-08-18 2023-08-18 23:54:07 +03:00
24dd6fefb5 Add notes for 2023-08-14 2023-08-14 18:38:03 +02:00
a659eef05f Fix name 2023-08-14 10:39:08 +02:00
9944f61ed5 Add notes for 2023-08-12 2023-08-13 05:54:16 +02:00
87ccbfc0f0 Add notes for 2023-08-11 2023-08-11 12:25:50 +02:00
929ce9685a Add notes for 2023-08-08 2023-08-08 12:54:39 +02:00
e0f9e484ee Add notes for 2023-08-07 2023-08-07 10:48:56 +02:00
021a92c0d9 Add notes for 2023-08-05 2023-08-05 17:27:43 +03:00
c97d005aa4 Add notes for 2023-08-04 2023-08-04 18:05:44 +03:00
190a1ee4a3 Add notes for 2023-07-31 2023-08-02 23:04:11 +03:00
9a2de13f21 Add notes for 2023-07-28 2023-07-28 12:18:39 +03:00
c644f40491 Add notes 2023-07-28 11:59:59 +03:00
6e701ee9c2 Add notes for 2023-07-25 2023-07-25 23:54:53 +03:00
e4dc8a3ed0 Add notes for 2023-07-22 2023-07-22 09:19:48 +03:00
74f4afe72a Add notes for 2023-07-20 2023-07-20 16:02:38 +03:00
8bebf47078 Add some days of notes 2023-07-19 12:27:43 +03:00
8c1e898683 Add notes for 2023-07-08 2023-07-08 23:20:53 +03:00
89d3fb717c Add notes for 2023-07-05 2023-07-05 16:36:30 +03:00
309ffad285 Add notes for 2023-07-03 2023-07-04 08:03:36 +03:00
0fab2a0f28 Add notes 2023-07-01 17:17:31 +03:00
ae41ef3682 Add notes for 2023-06-28 2023-06-28 20:11:34 +03:00
4415eec1a0 Add notes for 2023-06-19 2023-06-19 16:26:41 +03:00
6985b53a7b Add notes for 2023-06-17 2023-06-17 23:14:32 +03:00
df88592009 Add notes for 2023-06-14 2023-06-14 20:29:35 +03:00
3a68bc3cc7 Add notes for 2023-06-13 2023-06-13 20:58:57 +03:00
943fa8f1a2 Add notes for 2023-06-09 2023-06-10 09:17:08 +03:00
363dbb4505 Add notes for 2023-06-08 2023-06-08 17:04:20 +03:00
bda3cb4cd1 Add notes for 2023-06-06 2023-06-06 16:54:25 +03:00
33c42ecd49 Add notes for 2023-06-04 2023-06-04 11:00:30 +03:00
a9dc98b2dd Add notes for 2023-06-02 2023-06-02 16:33:48 +03:00
0b0d2ea87d Add notes 2023-06-02 08:53:06 +03:00
825385562d Add notes for 2023-05-30 2023-05-30 20:19:17 +03:00
416d2bc7a7 Add notes for 2023-05-26 2023-05-26 17:04:18 +03:00
7cde2ad26b Add notes for 2023-05-22 2023-05-23 08:49:01 +03:00
5fbc484c80 Add notes 2023-05-20 11:10:05 +03:00
aa5fab70b7 Add notes for 2023-05-18 2023-05-18 16:47:51 +03:00
d8be9c001c Add notes for 2023-05-12 2023-05-12 14:02:55 +03:00
a4a725f22e Add notes for 2023-05-11 2023-05-12 08:33:20 +03:00
572f4639ac Add notes for 2023-05-04 2023-05-04 17:27:29 +03:00
b4a5ec05e7 content/posts/2023-04.md: update image format scores
After re-calculation with ssimulacra2 v2.1.
2023-05-04 14:44:51 +03:00
e1aa40cf0e Add notes 2023-05-04 08:38:27 +03:00
bd36e93cd9 Update notes 2023-05-03 17:10:37 +03:00
820114f464 Add notes 2023-05-02 10:39:34 +03:00
ad8516bbb3 Add notes for 2023-04-27 2023-04-27 13:10:13 -07:00
0ca3cadbef Add notes for 2023-04-22 2023-04-22 16:37:19 -07:00
c20f1e1f89 Add notes for 2023-04-20 2023-04-20 22:44:18 -07:00
b024eb1f94 Add notes for 2023-04-18 2023-04-18 11:08:15 -07:00
85438953ce Add notes for 2023-04-06 2023-04-06 16:13:30 +03:00
5a0b3aaec1 Add notes for 2023-04-02 2023-04-02 09:16:25 +03:00
a2875a3811 Add notes for 2023-03-30 2023-03-30 16:59:20 +03:00
479bb9684a Update notes for 2023-03-28 2023-03-28 23:38:38 +03:00
5cd298a37a Add notes for 2023-03-28 2023-03-28 17:04:54 +03:00
37bdf2645f Add notes for 2023-03-27 2023-03-27 10:03:45 +03:00
11646971a9 Add notes for 2023-03-24 2023-03-24 13:19:13 +03:00
534f0d9cf8 Add notes for 2023-03-21 2023-03-22 08:28:33 +03:00
66a1f54e3a Add notes for 2023-03-21 2023-03-21 16:35:41 +03:00
cfdd1cb7fa Add notes for 2023-03-19 2023-03-19 19:48:06 +03:00
e926834065 Add notes for 2023-03-18 2023-03-18 17:42:40 +03:00
68b378845a Add notes for 2023-03-15 2023-03-15 08:03:48 +03:00
e9dd768d66 content/posts/2023-01.md: fix typos 2023-03-14 14:30:17 +03:00
40fe625083 Add notes 2023-03-13 21:22:25 +03:00
345cd4365b Add notes for 2023-03-10 2023-03-10 17:34:05 +03:00
bee6532af2 Add notes for 2023-03-09 2023-03-09 17:01:50 +03:00
5787bc326c Add notes for 2023-03-08 2023-03-08 18:53:32 +03:00
f5d24aa841 Add notes for 2023-03-07 2023-03-07 17:15:26 +03:00
2b98b5cda7 Add notes for 2023-03-07 2023-03-07 10:05:12 +03:00
19f8de4481 Update notes 2023-03-07 09:53:31 +03:00
7a48286d6b Add notes 2023-03-01 08:30:25 +03:00
e06160976c Add notes 2023-02-26 19:59:12 +03:00
2e80702de4 Add notes for 2023-02-22 2023-02-22 21:37:12 +03:00
ba6f826201 content/posts/2022-08.md: syntax fix 2023-02-22 11:59:48 +03:00
47f2c6c17f Add notes for 2023-02-21 2023-02-21 20:46:53 +03:00
a667e6986e Add notes for 2023-02-15 2023-02-15 19:47:13 +03:00
617c0eec3c Add notes for 2023-02-14 2023-02-14 23:13:35 +03:00
0b64999280 Add notes for 2023-02-12 2023-02-13 10:33:39 +03:00
d5214f02e1 Add notes for 2023-02-08 2023-02-09 08:50:54 +03:00
16ba5723eb Add notes for 2023-01-31 2023-01-31 22:20:38 +03:00
81f04f48ad Add notes for 2023-01-29 2023-01-29 18:19:31 +03:00
2c7f6b3e39 Add notes 2023-01-22 21:53:45 +03:00
ddb1ce8f4e Add notes for 2023-01-17 2023-01-17 22:38:55 +03:00
3f4e42fe37 Add notes for 2023-01-15 2023-01-15 08:10:16 +03:00
db4b0a6fd6 Add notes for 2023-01-12 2023-01-12 23:11:42 +03:00
967b16a966 Add notes for 2023-01-10 2023-01-10 22:22:03 +03:00
d1278a67d8 Add notes for 2023-01-04 2023-01-04 17:08:14 +03:00
676eefafbb content/posts/2022-11.md: Fix syntax for image 2023-01-04 10:53:02 +03:00
b781203a58 Add notes for 2023-01-01 2023-01-01 10:12:13 +02:00
9768a0fe57 Add notes for 2022-12-29 2022-12-29 08:32:08 +02:00
2e6c267397 Add notes for 2022-12-28 2022-12-28 22:55:34 +02:00
bf122d4ac3 Add notes for 2022-12-25 2022-12-25 16:48:19 +02:00
249a63404b Add notes for 2022-12-23 2022-12-23 10:04:37 +02:00
3be39e67fa Add notes for 2022-12-21 2022-12-21 20:39:09 +02:00
8354acdbdd Add notes for 2022-12-18 2022-12-19 07:03:13 +02:00
54769fcb04 Add notes for 2022-12-15 2022-12-15 16:41:04 +03:00
aaec17b94d Add notes for 2022-12-14 2022-12-14 22:14:03 +03:00
9c1e60426a Add notes for 2022-12-12 2022-12-12 18:17:33 +03:00
1bafe6ce71 Add notes for 2022-12-08 2022-12-08 18:59:57 +02:00
4200ae4189 Add notes for 2022-12-07 2022-12-07 22:59:37 +01:00
12b4f1660d Add notes for 2022-12-03 2022-12-04 03:19:49 +03:00
1dd80f769a Add notes for 2022-12-02 2022-12-03 10:46:29 +03:00
651148cf0a Add notes for 2022-11-30 2022-11-30 18:21:20 +03:00
0599df9bed Add notes for 2022-11-30 2022-11-30 12:35:31 +03:00
4f254af2f3 Update notes for 2022-11-28 2022-11-28 23:19:19 +03:00
8199de67ad Add notes for 2022-11-28 2022-11-28 17:42:46 +03:00
f5750dab39 Add notes for 2022-11-27 2022-11-27 13:52:43 +03:00
6240bdf5ad content/posts/2022-11.md: fix typo 2022-11-27 12:38:48 +03:00
59cd155eb3 Add notes for 2022-11-26 2022-11-26 17:38:27 +03:00
b5b28f2d78 Add notes for 2022-11-24 2022-11-24 17:41:34 +03:00
b9d764d026 Add notes for 2022-11-23 2022-11-23 17:10:47 +03:00
de6172b45a Add notes for 2022-11-21 2022-11-21 10:31:02 +03:00
4e6a8ec51b Add notes for 2022-11-09 2022-11-10 15:45:04 +03:00
c63abf656d Add notes for 2022-11-07 2022-11-07 17:18:14 +03:00
7544ee54ea Add notes for 2022-11-01 2022-11-01 22:12:24 +03:00
d48d74c981 Add notes for 2022-10-31 2022-10-31 16:59:47 +03:00
5ae92a2334 Add notes for 2022-10-30 2022-10-31 07:48:00 +03:00
3633377854 Add notes for 2022-10-28 2022-10-28 13:17:35 +03:00
189f33e1ce Add notes for 2022-10-26 2022-10-26 17:50:40 +03:00
5da2c1eff7 Add notes for 2022-10-25 2022-10-26 09:15:29 +03:00
3e8da69de7 Add notes for 2022-10-25 2022-10-25 16:38:17 +03:00
3f0d06239b Add notes for 2022-10-22 2022-10-23 12:33:23 +03:00
46a9178bdb Add notes for 2022-10-19 2022-10-19 21:32:01 +03:00
7713ecefa8 Add notes for 2022-10-18 2022-10-18 22:12:42 +03:00
a1ddc29951 Add notes for 2022-10-17 2022-10-17 15:58:02 +03:00
96cdb781fb Add notes 2022-10-15 17:38:47 +03:00
55a231611f Add notes for 2022-10-12 2022-10-13 07:10:59 +03:00
57288fad56 Add notes for 2022-10-09 2022-10-09 21:19:38 +03:00
510dd965ea Add notes 2022-10-07 21:29:35 +03:00
42f0fc6147 Add notes for 2022-10-05 2022-10-05 17:22:42 +03:00
9a88b6c1b5 Add notes for 2022-10-03 2022-10-03 16:26:30 +03:00
652f181273 Add notes for 2022-10-01 2022-10-01 19:47:37 +03:00
c7aec5606c Add notes for 2022-09-30 2022-09-30 17:29:50 +03:00
96f47ec7b5 Update notes for 2022-09-28 2022-09-28 21:23:10 +03:00
f1bb112554 Add notes for 2022-09-29 2022-09-28 17:10:23 +03:00
a2ca9483c4 content/posts/2022-08.md: add update to issue 2022-09-27 14:35:26 +03:00
98a3695d0d Add notes for 2022-09-26 2022-09-26 17:17:19 +03:00
a156315103 Update notes for 2022-09-25 2022-09-25 21:02:46 +03:00
ecb09f0a54 Add notes for 2022-09-25 2022-09-25 14:32:38 +03:00
062450e84f Add notes for 2022-09-24 2022-09-24 09:26:29 +03:00
c9e2325f34 Add notes for 2022-09-23 2022-09-23 16:49:58 +03:00
ae01de27c5 Add notes for 2022-09-22 2022-09-22 21:59:15 +03:00
fbf08b7003 Add notes for 2022-09-19 2022-09-19 15:58:41 +03:00
3b78d2f7e4 Add notes for 2022-09-18 2022-09-18 21:04:01 +03:00
1b15837e4e Add notes for 2022-09-16 2022-09-16 17:09:32 +03:00
e0d4d1ff7f Add notes for 2022-09-14 2022-09-15 08:37:57 +03:00
954f3598bd content/posts/2022-09.md: fix date 2022-09-15 08:37:36 +03:00
547a92723d Add notes for 2022-09-12 2022-09-12 17:07:29 +03:00
147ad86375 Add notes for 2022-09-12 2022-09-12 11:35:57 +03:00
69392070de Add notes for 2022-09-09 2022-09-09 17:29:51 +03:00
aa77e80c44 Add notes for 2022-09-08 2022-09-08 17:47:25 +03:00
ef3b4f1176 Add notes for 2022-09-07 2022-09-07 18:00:26 +03:00
5972b89839 Add notes for 2022-09-06 2022-09-06 17:48:46 +03:00
ac66d6c1a9 Add notes for 2022-09-05 2022-09-05 16:59:11 +03:00
6ce43e6a95 Add notes for 2022-09-01 and 2022-09-02 2022-09-02 16:41:19 +03:00
baf1cea539 Add notes 2022-08-31 17:37:28 +03:00
d9e2669a3d Add notes for 2022-08-30 2022-08-30 17:45:35 +03:00
49af872267 Add notes for 2022-08-29 2022-08-29 04:54:12 +03:00
5084b5ca5e Add notes for 2022-08-24 2022-08-24 21:24:07 -07:00
64d5b998f9 Add notes for 2022-08-23 2022-08-23 12:14:14 -07:00
8e6c83a5e1 Add notes for 2022-08-20 2022-08-20 22:37:35 -07:00
daf4a646ed Add notes for 2022-08-19 2022-08-19 21:55:36 -07:00
fc0a9ad944 Update notes for 2022-08-18 2022-08-18 22:43:37 -07:00
e203ee6dcc Add notes for 2022-08-18 2022-08-18 13:45:48 -07:00
6c61d1c102 Add notes for 2022-08-15 2022-08-15 18:46:57 -07:00
234 changed files with 32611 additions and 8213 deletions

View File

@@ -209,7 +209,7 @@ dc.identifier.issn
- I need to follow up with Moayad about the reporting functionality
- Also, I need to email Harrison my notes on the CG Core v2 stuff
- Also, Jane asked me to check the Data Portal to see which email address requests for confidential data are going
- Yesterday Theirry from CTA asked me about an error he was getting while submitting an item on CGSpace: "Unable to load Submission Information, since WorkspaceID (ID:S106658) is not a valid in-process submission."
- Yesterday Thierry from CTA asked me about an error he was getting while submitting an item on CGSpace: "Unable to load Submission Information, since WorkspaceID (ID:S106658) is not a valid in-process submission."
- I looked in the DSpace logs and found this right around the time of the screenshot he sent me:
```

View File

@@ -169,7 +169,7 @@ $ csvjoin --outer -c alpha2 ~/Downloads/clarisa-countries.csv ~/Downloads/UNSD\
- Then re-export the UN M.49 countries to a clean list because the one I did yesterday somehow has errors:
```console
csvcut -d ';' -c 'ISO-alpha2 Code,Country or Area' ~/Downloads/UNSD\ \ Methodology.csv | sed -e '1s/ISO-alpha2 Code/alpha2/' -e '1s/Country or Area/UN M.49 Name/' > ~/Downloads/un-countries.csv
$ csvcut -d ';' -c 'ISO-alpha2 Code,Country or Area' ~/Downloads/UNSD\ \ Methodology.csv | sed -e '1s/ISO-alpha2 Code/alpha2/' -e '1s/Country or Area/UN M.49 Name/' > ~/Downloads/un-countries.csv
```
- Check the number of lines in each file:
@@ -202,7 +202,7 @@ $ xsv join --full alpha2 /tmp/clarisa-un-cgspace-xsv-full.csv alpha2 /tmp/mel-co
## 2022-06-28
- Start working on the CGSpace subject export for FAO
- Start working on the CGSpace subject export for FAO / AGROVOC
- First I exported a list of all metadata in our `dcterms.subject` and other center-specific subject fields with their counts:
```console
@@ -220,7 +220,7 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2022-06-28-cgspace-subjects.txt -o /tmp/2022-
- I keep getting timeouts after every five or ten requests, so this will not be feasible for 27,000 subjects!
- I think I will have to write some custom script to use the AGROVOC RDF file
- Using rdflib to open the 1.2GB `agrovoc_lod.rdf` file takes several minutes and doesn't seem very efficient
- I tried using [lightrdf](https://github.com/ozekik/lightrdf) and it's much quicker, but the documentation is limiting and I'm not sure how to search yet
- I tried using [lightrdf](https://github.com/ozekik/lightrdf) and it's much quicker, but the documentation is limited and I'm not sure how to search yet
- I had to try in different Python versions because 3.10.x is apparently too new
- For future reference I was able to search with lightrdf:

View File

@@ -56,7 +56,7 @@ $ csvjoin --left -c dc.title ~/Downloads/2022-08-03-Innovations-Cleaned.csv ~/Do
```console
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
$ dspace import --add --eperson=aorth@mjanja.ch --source /tmp/SimpleArchiveFormat --mapfile=./2022-08-03-innovations.map
$ dspace import --add --eperson=fuuu@fuuu.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-08-03-innovations.map
```
- Meeting with Mohammed Salem about harmonizing MEL and CGSpace metadata fields
@@ -72,6 +72,7 @@ $ dspace import --add --eperson=aorth@mjanja.ch --source /tmp/SimpleArchiveForma
![Controlled vocabulary bug in DSpace 7](/cgspace-notes/2022/08/dspace7-submission.png)
- I think we need to add IDs, I will have to check what the implications of that are
- Note (2022-09-27): see this related change from DSpace 7.3: https://github.com/DSpace/DSpace/pull/8174
- Emilio contacted me last week to say they have re-worked their harvester on Hetzner to use a new user agent: `AICCRA website harvester`
- I verified that I see it in the REST API logs, but I don't see any new stats hits for it
- I do see 11,000 hits from that IP last month when I had the incorrect nginx configuration that was sending a literal `$http_user_agent` so I purged those
@@ -90,4 +91,257 @@ $ dspace import --add --eperson=aorth@mjanja.ch --source /tmp/SimpleArchiveForma
- I will also purge all the hits from this IP in Solr statistics
- I also see the core.ac.uk bot making tens of thousands of requests today, but we are already tagging that as a bot in Tomcat's Crawler Session Manager valve, so they should be sharing a Tomcat session with other bots and not creating too many sessions
## 2022-08-15
- Start indexing on AReS
- Add CONSERVATION to ILRI subjects on CGSpace
- I see that AGROVOC has `conservation agriculture` and I suggested that we use that instead
## 2022-08-17
- Peter and Jose sent more feedback about the CRP Innovation records from MARLO
- We expanded the CRP names in the citation and removed the `cg.identifier.url` URLs because they are ugly and will stop working eventually
- The mappings of MARLO links will be done internally with the `cg.number` IDs like "IN-1119" and the Handle URIs
## 2022-08-18
- I talked to Jose about the CCAFS MARLO records
- He still hasn't finished re-processing the PDFs to update the internal MARLO links
- I started looking at the other records (MELIAs, OICRs, Policies) and found some minor issues in the MELIAs so I sent feedback to Jose
- On second thought, I opened the MELIAs file in OpenRefine and it looks OK, so this must have been a parsing issue in LibreOffice when I was checking the file (or perhaps I didn't use the correct quoting when importing)
- Import the original MELIA v2 CSV file into OpenRefine to fix encoding before processing with csvcut/csvjoin
- Then extract the IDs and filenames from the original V2 file and join with the UTF-8 file:
```console
$ csvcut -c 'cg.number (series/report No.)',File ~/Downloads/MELIA-Metadata-v2-csv.csv > MELIA-v2-IDs-Files.csv
$ csvjoin -c 'cg.number (series/report No.)' MELIAs\ metadata\ utf8\ 20220816_JM.csv MELIA-v2-IDs-Files.csv > MELIAs-UTF-8-with-files.csv
```
- Then I imported them into OpenRefine to start metadata cleaning and enrichment
- Make some minor changes to [cgspace-submission-guidelines](https://github.com/ilri/cgspace-submission-guidelines)
- Upgrade to Bootstrap v5.2.0
- Dedupe value pairs and controlled vocabularies before writing them
- Sort the controlled vocabularies before writing them (we don't do this for value pairs because some are added in specific order, like CRPs)
## 2022-08-19
- Peter Ballantyne sent me metadata for 311 Gender items that need to be duplicate checked on CGSpace before uploading
- I spent a half an hour in OpenRefine to fix the dates because they only had YYYY, but most abstracts and titles had more specific information about the date
- Then I checked for duplicates:
```console
$ ./ilri/check-duplicates.py -i ~/Downloads/gender-ppts-xlsx.csv -u dspace -db dspace -p 'fuuu' -o /tmp/gender-duplicates.csv
```
- I sent the list of ~130 possible duplicates to Peter to check
- Jose sent new versions of the MARLO Innovation/MELIA/OICR/Policy PDFs
- The idea was to replace tinyurl links pointing to MARLO, but I still see many tinyurl links, some of which point to CGIAR Sharepoint and require a login
- I asked them why they don't just use the original links in the first place in case tinyurl.com disappears
- I continued working on the MARLO MELIA v2 UTF-8 metadata
- I did the same metadata enrichment exercise to extract countries and AGROVOC subjects from the abstract field that I did earlier this month, using a Jython expression to match terms in copies of the abstract field
- It helps to replace some characters with spaces first with this GREL: `value.replace(/[.\/;(),]/, " ")`
- This caught some extra AGROVOC terms, but unfortunately we only check for single-word terms
- Then I checked for existing items on CGSpace matching these MELIA using my duplicate checker:
```console
$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv -u dspace -db dspace -p 'fuuu' -o /tmp/melia-matches.csv
```
- Then I did some minor processing and checking of the duplicates file (for example, some titles appear more than once in both files), and joined with the other file (left join):
```console
$ xsv join --left id ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv id ~/Downloads/melia-matches-csv.csv > /tmp/melias-with-relations.csv
```
- I had to use `xsv` because `csvcut` was throwing an error detecting the dialect of the input CSVs (?)
- I created a SAF bundle and imported the 749 MELIAs to DSpace Test
- I found thirteen items on CGSpace with dates in format "DD/MM/YYYY" so I fixed those
## 2022-08-20
- Peter sent me back the results of the duplicate checking on the Gender presentations
- There were only a handful of duplicates, so I used the IDs in the spreadsheet to flag and delete them in OpenRefine
- I had a new idea about matching AGROVOC subjects and countries in OpenRefine
- I was previously splitting up the text value field (title/abstract/etc) by spaces and searching for each word in the list of terms/countries like this:
```console
with open(r"/tmp/cgspace-countries.txt",'r') as f:
countries = [name.rstrip().lower() for name in f]
return "||".join([x for x in value.split(' ') if x.lower() in countries])
```
- But that misses multi-word terms/countries with spaces, so we can search the other way around by using a regex for each term/country and checking if it appears in the text value field:
```console
import re
with open(r"/tmp/agrovoc-subjects.txt",'r') as f:
terms = [name.rstrip().lower() for name in f]
return "||".join([term for term in terms if re.match(r".*\b" + term + r"\b.*", value.lower())])
```
- Now we are only limited by our small (~1,400) list of AGROVOC subjects, so I did an export from PostgreSQL of all `dcterms.subjects` values and am looking them up against AGROVOC's API right now:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value AS "dcterms.subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY "dcterms.subject" ORDER BY count DESC) to /tmp/2022-08-20-agrovoc.csv WITH CSV HEADER;
COPY 21685
$ csvcut -c 1 /tmp/2022-08-20-agrovoc.csv | sed 1d > /tmp/all-subjects.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/all-subjects.txt -o 2022-08-20-all-subjects-results.csv
$ csvgrep -c 'number of matches' -m 0 -i /tmp/2022-08-20-all-subjects-results.csv.bak | csvcut -c 1 | sed 1d > /tmp/agrovoc-subjects.txt
$ wc -l /tmp/agrovoc-subjects.txt
11834 /tmp/agrovoc-subjects.txt
```
- Then I created a new column joining the title and abstract, and ran the Jython expression above against this new file with 11,000 AGROVOC terms
- Then I joined that column with Peter's `dcterms.subject` column and then deduplicated it with this Jython:
```console
res = []
[res.append(x) for x in value.split("||") if x not in res]
return "||".join(res)
```
- This is way better, but you end up getting a bunch of countries, regions, and short words like "gates" matching in AGROVOC that are inappropriate (we typically don't tag these in AGROVOC) or incorrect (gates, as in windows or doors, not the funding agency)
- I did a text facet in OpenRefine and removed a bunch of these by eye
- Then I finished adding the `dcterms.relation` and CRP metadata flagged by Peter on the Gender presentations
- I'm waiting for him to send me the PDFs and then I will upload them to DSpace Test
## 2022-08-21
- Start indexing on AReS
- The load on CGSpace was around 5.0 today, and now that I started the harvesting it's over 10 for an hour now, sigh...
- I'm going to try an experiment to block Googlebot, bingbot, and Yandex for a week to see if the load goes down
## 2022-08-22
- I tried to re-generate the SAF bundle for the MARLO Innovations after improving the AGROVOC subjects and the v3 PDFs, but six are missing from the v3 zip that are present in the original zip:
- ProjectInnovationSummary-WLE-P500-I78.pdf
- ProjectInnovationSummary-WLE-P452-I699.pdf
- ProjectInnovationSummary-WLE-P518-I696.pdf
- ProjectInnovationSummary-WLE-P442-I740.pdf
- ProjectInnovationSummary-WLE-P516-I647.pdf
- ProjectInnovationSummary-WLE-P438-I585.pdf
- I downloaded them manually using the URLs in the original CSV
- I also uploaded a new version of the MELIAs to DSpace Test
## 2022-08-23
- Checking the number of items on CGSpace so we can keep an eye on the 100,000 number:
```console
dspace=# SELECT COUNT(uuid) FROM item WHERE in_archive='t';
count
-------
95716
(1 row)
```
- If I check OAI I see more, but perhaps that counts mapped items multiple times
- Peter said the 303 Gender PPTs were good to go, so I updated the collection mappings and IDs in OpenRefine and then uploaded them to CGSpace:
```console
$ dspace import --add --eperson=fuu@fuu.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-08-23-gender-ppts.map
```
- I created a [GitHub issue for OpenRXV compatibility issues with DSpace 7](https://github.com/ilri/OpenRXV/issues/133)
## 2022-08-24
- Start working on the MARLO OICRs
- First I extracted the filenames and IDs from the v2 metadata file, then joined it with the UTF-8 version:
```console
$ xsv select 'cg.number (series/report No.),File' OICRS\ Metadata\ v2.csv > /tmp/OICR-files.csv
$ xsv join --left 'cg.number (series/report No.)' OICRS\ metadata\ utf8\ 20220816_JM.csv 'cg.number (series/report No.)' /tmp/OICR-files.csv > OICRs-UTF-8-with-files.csv
```
- After that I imported it into OpenRefine for data cleaning
- To enrich the metadata I combined the title and abstract into a new field and then checked my list of 11,000 AGROVOC terms against it
- First, create a new column with this GREL:
```console
cells["dc.title"].value + " " + cells["dcterms.abstract"].value
```
- Then use this Jython:
```python
import re
with open(r"/tmp/agrovoc-subjects.txt",'r') as f :
terms = [name.rstrip().lower() for name in f]
return "||".join([term for term in terms if re.match(r".*\b" + term + r"\b.*", value.lower())])
```
- After that I de-duplicated the terms using this Jython:
```python
res = []
[res.append(x) for x in value.split("||") if x not in res]
return "||".join(res)
```
- Then I split the multi-values on "||" and used a text facet to remove some countries and other nonsense terms that matched, like "gates" and "al" and "s"
- Then I did the same for countries
- Then I exported the CSV and started searching for duplicates so that I can add them as relations:
```console
$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-24-OICRs.csv -u dspace -db dspace -p 'omg' -o /tmp/oicrs-matches.csv
```
- Oh wow, I actually found one OICR already uploaded to CGSpace... I have to ask Jose about that
## 2022-08-25
- I started processing the MARLO Policies in OpenRefine, similar to the Innovations, MELIAs, and OICRs above
- I also re-ran the AGROVOC matching on Innovations because my technique has improved since I ran it a few weeks ago
## 2022-08-29
- Start a harvest on AReS
- Meeting with Peter and Abenet about CGSpace issues
- I mapped the one MARLO OICR duplicate from the CCAFS Reports collection and deleted it from the OICRs CSV
## 2022-08-30
- Manuel from the "Alianza SIDALC" in South America contacted me asking for permission to harvest CGSpace and include our content in their system
- I responded that we would be glad if they harvested us, and that they should use a useful user agent so we can contact them incase of any issues or changes on the server
- I emailed ILRI ICT to ask how Abenet and I can use the CGSpace Support email address in our email applications because we haven't checked that account in years
- I tried to log in on office365.com but it gave an error
- I got access to the account and cleaned up the inbox, unsubscribed from a bunch of Microsoft and Yammer feeds, etc
- Remind Dani, Tariku, and Andrea about the legacy links that we want to update on ILRI's website:
- http://mahider.ilri.org → https://cgspace.cgiar.org
- http://mahider.ilri.org/handle/10568/xxxxx → https://hdl.handle.net/10568/xxxxx
- http://www.ilri.org/ilrinews/index.php/archives/xxxx → https://newsarchive.ilri.org/archives/xxxx
- Join the MARLO OICRs with their relations that I processed a few days ago (minus the second id column and some others):
```console
$ xsv join --left id ~/Downloads/2022-08-24-OICRs.csv id ~/Downloads/oicrs-matches-csv.csv | xsv select '!id[1],Your Title,Their Title,Similarity,Your Date,Their Date,datediff' > /tmp/oicrs-with-relations.csv
```
- Then I cleaned them with csv-metadata-quality to catch some duplicates, add regions, etc and re-imported to OpenRefine
- I flagged a few duplicates for Jose and he'll let me know what to do with them
- I imported the OICRs to DSpace Test:
```console
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
$ dspace import --add --eperson=fuuuu@fuuu.com --source /tmp/SimpleArchiveFormat-oicrs --mapfile=./2022-08-30-OICRs.map
```
- Meeting with Marie-Angelique, Abenet, Valentina, Sara, and Margarita about Types
- I am testing the `org.apache.cocoon.uploads.autosave=false` setting for XMLUI so that files posted via multi-part forms get memory mapped instead of written to disk
- Check the MARLO Policies for relations and join them with the main CSV file:
```console
$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-25-Policies-UTF-8-With-Files.csv -u dspace -db dspace -p 'fuui' -o /tmp/policies-matches.csv
$ xsv join --left id ~/Downloads/2022-08-25-Policies-UTF-8-With-Files.csv id /tmp/policies-matches.csv | xsv select '!id[1],Your Title,Their Title,Similarity,Your Date,Their Date' > /tmp/policies-with-relations.csv
```
<!-- vim: set sw=2 ts=2: -->

574
content/posts/2022-09.md Normal file
View File

@@ -0,0 +1,574 @@
---
title: "September, 2022"
date: 2022-09-01T09:41:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-09-01
- A bit of work on the "Mapping CG CoreCGSpaceMELMARLO Types" spreadsheet
- I tested an item submission on DSpace Test with the Cocoon `org.apache.cocoon.uploads.autosave=false` change
- The submission works as expected
- Start debugging some region-related issues with csv-metadata-quality
- I created a new test file `test-geography.csv` with some different scenarios
- I also fixed a few bugs and improved the region-matching logic
<!--more-->
- I filed [an issue for the "South-eastern Asia" case mismatch in country_converter](https://github.com/konstantinstadler/country_converter/issues/115) on GitHub
- Meeting with Moayad to discuss OpenRXV developments
- He demoed his new multiple dashboards feature and I helped him rebase those changes to master so we can test them more
## 2022-09-02
- I worked a bit more on exclusion and skipping logic in csv-metadata-quality
- I also pruned and updated all the Python dependencies
- Then I released [version 0.6.0](https://github.com/ilri/csv-metadata-quality/releases/tag/v0.6.0) now that the excludes and region matching support is working way better
## 2022-09-05
- Started a harvest on AReS last night
- Looking over the Solr statistics from last month I see many user agents that look suspicious:
- Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C)
- Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 77.0.3865.90 Safari / 537.36
- Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0
- Mozilla/5.0 (X11; Linux i686; rv:2.0b12pre) Gecko/20110204 Firefox/4.0b12pre
- Mozilla/5.0 (Windows NT 10.0; Win64; x64; Xbox; Xbox One) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36 Edge/44.18363.8131
- Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
- Mozilla/4.0 (compatible; MSIE 4.5; Windows 98;)
- curb
- bitdiscovery
- omgili/0.5 +http://omgili.com
- Mozilla/5.0 (compatible)
- Vizzit
- Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0
- Mozilla/5.0 (Android; Mobile; rv:13.0) Gecko/13.0 Firefox/13.0
- Java/17-ea
- AdobeUxTechC4-Async/3.0.12 (win32)
- ZaloPC-win32-24v473
- Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com
- Scoop.it
- Mozilla/5.0 (Windows NT 6.1; rv:27.0) Gecko/20100101 Firefox/27.0
- Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
- ows NT 10.0; WOW64; rv: 50.0) Gecko/20100101 Firefox/50.0
- WebAPIClient
- Mozilla/5.0 Firefox/26.0
- Mozilla/5.0 (compatible; woorankreview/2.0; +https://www.woorank.com/)
- For example, some are apparently using versions of Firefox that are over ten years old, and some are obviously trying to look like valid user agents, but making typos (`Mozilla / 5.0`)
- Tons of hosts making requests likt this:
```console
GET /bitstream/handle/10568/109408/Milk%20testing%20lab%20protocol.pdf?sequence=1&isAllowed=\x22><script%20>alert(String.fromCharCode(88,83,83))</script> HTTP/1.1" 400 5 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0
```
- I got a list of hosts making requests like that so I can purge their hits:
```console
# zcat /var/log/nginx/{access,library-access,oai,rest}.log.[123]*.gz | grep 'String.fromCharCode(' | awk '{print $1}' | sort -u > /tmp/ips.txt
```
- I purged 4,718 hits from IPs
- I see some new Hetzner ranges that I hadn't blocked yet apparently?
- I got a [list of Hetzner's IPs from IP Quality Score](https://www.ipqualityscore.com/asn-details/AS24940/hetzner-online-gmbh) then added them to the existing ones in my Ansible playbooks:
```console
$ awk '{print $1}' /tmp/hetzner.txt | wc -l
36
$ sort -u /tmp/hetzner-combined.txt | wc -l
49
```
- I will add this new list to nginx's `bot-networks.conf` so they get throttled on scraping XMLUI and get classified as bots in Solr statistics
- Then I purged hits from the following user agents:
```console
$ ./ilri/check-spider-hits.sh -f /tmp/agents
Found 374 hits from curb in statistics
Found 350 hits from bitdiscovery in statistics
Found 564 hits from omgili in statistics
Found 390 hits from Vizzit in statistics
Found 9125 hits from AdobeUxTechC4-Async in statistics
Found 97 hits from ZaloPC-win32-24v473 in statistics
Found 518 hits from nbertaupete95 in statistics
Found 218 hits from Scoop.it in statistics
Found 584 hits from WebAPIClient in statistics
Total number of hits from bots: 12220
```
- Then I will add these user agents to the ILRI spider override in DSpace
## 2022-09-06
- I'm testing dspace-statistics-api with our DSpace 7 test server
- After setting up the env and the database the `python -m dspace_statistics_api.indexer` runs without issues
- While playing with Solr I tried to search for statistics from this month using `time:2022-09*` but I get this error: "Can't run prefix queries on numeric fields"
- I guess that the syntax in Solr changed since 4.10...
- This works, but is super annoying: `time:[2022-09-01T00:00:00Z TO 2022-09-30T23:59:59Z]`
## 2022-09-07
- I tested the controlled-vocabulary changes on DSpace 6 and they work fine
- Last week I found that DSpace 7 is more strict with controlled vocabularies and requires IDs for all node values
- This is a pain because it means I have to re-do the IDs in each file every time I update them
- If I add `id="0000"` to each, then I can use [this vim expression](https://vim.fandom.com/wiki/Making_a_list_of_numbers#Substitute_with_ascending_numbers) `let i=0001 | g/0000/s//\=i/ | let i=i+1` to replace the numbers with increments starting from 1
- Meeting with Marie Angelique, Abenet, Sarа, аnd Margarita to continue the discussion about Types from last week
- We made progress with concrete actions and will continue next week
## 2022-09-08
- I had a meeting with Nicky from UNEP to discuss issues they are having with their DSpace
- I told her about the meeting of DSpace community people that we're planning at ILRI in the next few weeks
## 2022-09-09
- Add some value mappings to AReS because I see a lot of incorrect regions and countries
- I also found some values that were blank in CGSpace so I deleted them:
```console
dspace=# BEGIN;
BEGIN
dspace=# DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='';
DELETE 70
dspace=# COMMIT;
COMMIT
```
- Start a full Discovery index on CGSpace to catch these changes in the Discovery
## 2022-09-11
- Today is Sunday and I see the load on the server is high
- Google and a bunch of other bots have been blocked on XMLUI for the past two weeks so it's not from them!
- Looking at the top IPs this morning:
```console
# cat /var/log/nginx/{access,library-access,oai,rest}.log /var/log/nginx/{access,library-access,oai,rest}.log.1 | grep '11/Sep/2022' | awk '{print $1}' | sort | uniq -c | sort -h | tail -n 40
...
165 64.233.172.79
166 87.250.224.34
200 69.162.124.231
202 216.244.66.198
385 207.46.13.149
398 207.46.13.147
421 66.249.64.185
422 157.55.39.81
442 2a01:4f8:1c17:5550::1
451 64.124.8.36
578 137.184.159.211
597 136.243.228.195
1185 66.249.64.183
1201 157.55.39.80
3135 80.248.237.167
4794 54.195.118.125
5486 45.5.186.2
6322 2a01:7e00::f03c:91ff:fe9a:3a37
9556 66.249.64.181
```
- The top is still Google, but all the requests are HTTP 503 because I classified them as bots for XMLUI at least
- Then there's 80.248.237.167, which is using a normal user agent and scraping Discovery
- That IP is on Internet Vikings aka Internetbolaget and we are already marking that subnet as 'bot' for XMLUI so most of these requests are HTTP 503
- On another note, I'm curious to explore enabling caching of certain REST API responses
- For example, where the use is for harvesting rather than actual clients getting bitstreams or thumbnails, it seems there might be a benefit to speeding these up for subsequent requestors:
```console
# awk '{print $7}' /var/log/nginx/rest.log | grep -v retrieve | sort | uniq -c | sort -h | tail -n 10
4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/bitstreams
4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/metadata
4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/bitstreams
4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/metadata
5 /rest/handle/10568/110310?expand=all
5 /rest/handle/10568/89980?expand=all
5 /rest/handle/10568/97614?expand=all
6 /rest/handle/10568/107086?expand=all
6 /rest/handle/10568/108503?expand=all
6 /rest/handle/10568/98424?expand=all
```
- I specifically have to not cache things like requests for bitstreams because those are from actual users and we need to keep the real requests so we get the statistics hit
- Will be interesting to check the results above as the day goes on (now 10AM)
- To estimate the potential savings from caching I will check how many non-bitstream requests are made versus how many are made more than once (updated the next morning using yesterday's log):
```console
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort -u | wc -l
33733
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk '$1 > 1' | wc -l
5637
```
- In the afternoon I started a harvest on AReS (which should affect the numbers above also)
- I enabled an nginx proxy cache on DSpace Test for this location regex: `location ~ /rest/(handle|items|collections|communities)/.+`
## 2022-09-12
- I am testing harvesting DSpace Test via AReS with the nginx proxy cache enabled
- I had to tune the regular expression in nginx a bit because the REST requests OpenRXV uses weren't matching
- Now I'm trying this one: `/rest/(handle|items|collections|communities)/?`
- Testing in [regex101.com](https://regex101.com/r/vPz11y/1) with this test string:
```
/rest/handle/10568/27611
/rest/items?expand=metadata,parentCommunityList,parentCollectionList,bitstreams&limit=10&offset=36270
/rest/handle/10568/110310?expand=all
/rest/rest/bitstreams/28926633-c7c2-49c2-afa8-6d81cadc2316/retrieve
/rest/bitstreams/15412/retrieve
/rest/items/083dbb0d-11e2-4dfe-902b-eb48e4640d04/metadata
/rest/items/083dbb0d-11e2-4dfe-902b-eb48e4640d04/bitstreams
/rest/collections/edea23c0-0ebd-4525-90b0-0b401f997704/items
/rest/items/14507941-aff2-4d57-90bd-03a0733ad859/metadata
/rest/communities/b38ea726-475f-4247-a961-0d0b76e67f85/collections
/rest/collections/e994c450-6ff7-41c6-98df-51e5c424049e/items?limit=10000
```
- I estimate that it will take about 1GB of cache to harvest 100,000 items from CGSpace with OpenRXV (10,000 pages)
- Basically all but 4 and 5 (bitstreams) should match
- Upload 682 OICRs from MARLO to CGSpace
- We had tested these on DSpace Test last month along with the MELIAs, Policies, and Innovations, but we decided to upload the OICRs first so that other things can link against them as related items
## 2022-09-14
- Meeting with Peter, Abenet, Indira, and Michael about CGSpace rollout plan for the Initiatives
## 2022-09-16
- Meeting with Marie-Angeqlique, Abenet, Margarita, and Sara about types for CG Core
- We are about halfway through the list of types now, with concrete actions for CG Core and CGSpace
- We will meet next in two weeks to hopefully finalize the list, then we can move on to definitions
## 2022-09-18
- Deploy the `org.apache.cocoon.uploads.autosave=false` change on CGSpace
- Start a harvest on AReS
## 2022-09-19
- Deploy the nginx proxy cache for /rest requests on CGSpace
- I had tested this last week on DSpace Test
- By my counts on CGSpace yesterday (Sunday, a busy day for the REST API), we had 5,654 URLs that were requested more than twice, and it tails off after that towards two, three, four, etc:
```console
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk '$1 > 1' | wc -l
5654
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk '$1 == 2' | wc -l
4710
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk '$1 == 3' | wc -l
814
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk '$1 == 4' | wc -l
86
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk '$1 == 5' | wc -l
39
```
- For now I guess requests that were done two or three times by different clients will be cached and that's a win, and I expect more and more REST API activity soon when initiatives and One CGIAR stuff picks up
## 2022-09-20
- I checked the status of the nginx REST API cache on CGSpace and it was stuck at 7,083 items for hours:
```console
# find /var/cache/nginx/rest_cache/ -type f | wc -l
7083
```
- The proxy cache key zone is currently 1m, which can store ~8,000 keys, so that could be what we're running into
- I increased it to 2m and will keep monitoring it
- CIP webmaster contacted me to say they are having problems harvesting CGSpace from their WordPress
- I am not sure if there are issues due to the REST API caching I enabled...
## 2022-09-21
- Planning the Nairobi DSpace Users meeting with Abenet
- Planning to have a call about MEL submitting to CGSpace on Monday with Mohammed Salem
- I created two collections on DSpace Test: one with a workflow, and one without
- According to my notes from [2020-10]({{< relref "2020-10.md" >}}) the account must be in the admin group in order to submit via the REST API, so I added it to the admin group of each collection
## 2022-09-22
- Nairobi DSpace users meeting at ILRI
- I found a few users that didn't have ORCID iDs and were missing tags on CGSpace so I tagged them:
```console
dc.contributor.author,cg.creator.identifier
dc.contributor.author,cg.creator.identifier
"Alonso, Silvia","Silvia Alonso: 0000-0002-0565-536X"
"Goopy, John P.","John Goopy: 0000-0001-7177-1310"
"Korir, Daniel","Daniel Korir: 0000-0002-1356-8039"
"Leitner, Sonja","Sonja Leitner: 0000-0002-1276-8071"
"Fèvre, Eric M.","Eric M. Fèvre: 0000-0001-8931-4986"
"Galiè, Alessandra","Alessandra Galie: 0000-0001-9868-7733"
"Baltenweck, Isabelle","Isabelle Baltenweck: 0000-0002-4147-5921"
"Robinson, Timothy P.","Timothy Robinson: 0000-0002-4266-963X"
"Lannerstad, Mats","Mats Lannerstad: 0000-0002-5116-3198"
"Graham, Michael","Michael Graham: 0000-0002-1189-8640"
"Merbold, Lutz","Lutz Merbold: 0000-0003-4974-170X"
"Rufino, Mariana C.","Mariana Rufino: 0000-0003-4293-3290"
"Wilkes, Andreas","Andreas Wilkes: 0000-0001-7546-991X"
"van der Weerden, T.","Tony van der Weerden: 0000-0002-6999-2584"
"Vermeulen, S.","Sonja Vermeulen: 0000-0001-6242-9513"
"Vermeulen, Sonja","Sonja Vermeulen: 0000-0001-6242-9513"
"Vermeulen, Sonja J.","Sonja Vermeulen: 0000-0001-6242-9513"
"Hung Nguyen-Viet","Hung Nguyen-Viet: 0000-0003-1549-2733"
"Herrero, Mario T.","Mario Herrero: 0000-0002-7741-5090"
"Thornton, Philip K.","Philip Thornton: 0000-0002-1854-0182"
"Duncan, Alan J.","Alan Duncan: 0000-0002-3954-3067"
"Lukuyu, Ben A.","Ben Lukuyu: 0000-0002-9374-3553"
"Lindahl, Johanna F.","Johanna Lindahl: 0000-0002-1175-0398"
"Okeyo Mwai, Ally","Ally Okeyo Mwai: 0000-0003-2379-7801"
"Wieland, Barbara","Barbara Wieland: 0000-0003-4020-9186"
"Omore, Amos O.","Amos Omore: 0000-0001-9213-9891"
"Randolph, Thomas F.","Thomas Fitz Randolph: 0000-0003-1849-9877"
"Staal, Steven J.","Steven Staal: 0000-0002-1244-1773"
"Hanotte, Olivier H.","Olivier Hanotte: 0000-0002-2877-4767"
"Dessie, Tadelle","Tadelle Dessie: 0000-0002-1630-0417"
"Dione, Michel M.","Michel Dione: 0000-0001-7812-5776"
"Gebremedhin, Berhanu","Berhanu Gebremedhin: 0000-0002-3168-2783"
"Ouma, Emily A.","Emily Ouma: 0000-0002-3123-1376"
"Roesel, Kristina","Kristina Roesel: 0000-0002-2553-1129"
"Bishop, Richard P.","Richard Bishop: 0000-0002-3720-9970"
"Lapar, Ma. Lucila","Ma. Lucila Lapar: 0000-0002-4214-9845"
"Rich, Karl M.","Karl Rich: 0000-0002-5581-9553"
"Hoekstra, Dirk","Dirk Hoekstra: 0000-0002-6111-6627"
"Nene, Vishvanath","Vishvanath Nene: 0000-0001-7066-4169"
"Patel, S.P.","Sonal Henson: 0000-0002-2002-5462"
"Hanson, Jean","Jean Hanson: 0000-0002-3648-2641"
"Marshall, Karen","Karen Marshall: 0000-0003-4197-1455"
"Notenbaert, An Maria Omer","An Maria Omer Notenbaert: 0000-0002-6266-2240"
"Ojango, Julie M.K.","Ojango J.M.K.: 0000-0003-0224-5370"
"Wijk, Mark T. van","Mark van Wijk: 0000-0003-0728-8839"
"Tarawali, Shirley A.","Shirley Tarawali: 0000-0001-9398-8780"
"Naessens, Jan","Jan Naessens: 0000-0002-7075-9915"
"Butterbach-Bahl, Klaus","Klaus Butterbach-Bahl: 0000-0001-9499-6598"
"Poole, Elizabeth J.","Elizabeth Jane Poole: 0000-0002-8570-794X"
"Mulema, Annet A.","Annet Mulema: 0000-0003-4192-3939"
"Dror, Iddo","Iddo Dror: 0000-0002-0800-7456"
"Ballantyne, Peter G.","Peter G. Ballantyne: 0000-0001-9346-2893"
"Baker, Derek","Derek Baker: 0000-0001-6020-6973"
"Ericksen, Polly J.","Polly Ericksen: 0000-0002-5775-7691"
"Jones, Christopher S.","Chris Jones: 0000-0001-9096-9728"
"Mude, Andrew G.","Andrew Mude: 0000-0003-4903-6613"
"Puskur, Ranjitha","Ranjitha Puskur: 0000-0002-9112-3414"
"Kiara, Henry K.","Henry Kiara: 0000-0001-9578-1636"
"Gibson, John P.","John Gibson: 0000-0003-0371-2401"
"Flintan, Fiona E.","Fiona Flintan: 0000-0002-9732-097X"
"Mrode, Raphael A.","Raphael Mrode: 0000-0003-1964-5653"
"Mtimet, Nadhem","Nadhem Mtimet: 0000-0003-3125-2828"
"Said, Mohammed Yahya","Mohammed Yahya Said: 0000-0001-8127-6399"
"Pezo, Danilo A.","Danilo Pezo: 0000-0001-5345-5314"
"Haileslassie, Amare","Amare Haileslassie: 0000-0001-5237-9006"
"Wright, Iain A.","Iain Wright: 0000-0002-6216-5308"
"Cadilhon, Joseph J.","Jean-Joseph Cadilhon: 0000-0002-3181-5136"
"Domelevo Entfellner, Jean-Baka","Jean-Baka Domelevo Entfellner: 0000-0002-8282-1325"
"Oyola, Samuel O.","Samuel O. Oyola: 0000-0002-6425-7345"
"Agaba, M.","Morris Agaba: 0000-0001-6777-0382"
"Beebe, Stephen E.","Stephen E Beebe: 0000-0002-3742-9930"
"Ouso, Daniel","Daniel Ouso: 0000-0003-0994-2558"
"Ouso, Daniel O.","Daniel Ouso: 0000-0003-0994-2558"
"Rono, Gilbert K.","Gilbert Kibet-Rono: 0000-0001-8043-5423"
"Kibet, Gilbert","Gilbert Kibet-Rono: 0000-0001-8043-5423"
"Juma, John","John Juma: 0000-0002-1481-5337"
"Juma, J.","John Juma: 0000-0002-1481-5337"
$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2022-09-22-add-orcids.csv -db dspace -u dspace -p 'fuuu'
```
- This adds nearly 5,500 ORCID tags!
- Some of these authors were not in the controlled vocabulary so I added them
## 2022-09-23
- Tag some more ORCID metdata (amended above)
- Meeting with Peter and Abenet to discuss CGSpace issues
- We found a workable solution to the MEL submission issue: they can submit to a dedicated MEL-only collection with no workflow and we will map or move the items after
- Pascal says that they have made a [pull request for their duplicate checker on DSpace 7](https://github.com/DSpace/DSpace/pull/8415) yayyyy
## 2022-09-24
- Found some more ORCID identifiers to tag so I added them to the list above
- Start a harvest on AReS around 8PM on Saturday night
## 2022-09-25
- The harvest on AReS finished and now the load on CGSpace server is still high like always on Sunday mornings
- UptimeRobot says it's down sigh...
- I had an idea to include the HTTP Accept header in the nginx proxy cache key to fix the issue we had with CIP last week
- It seems to work:
```
$ http --print Hh 'https://dspacetest.cgiar.org/rest/items?expand=metadata,parentCommunityList,parentCollectionList,bitstreams&limit=10&offset=60'
...
Content-Type: application/json
X-Cache-Status: MISS
$ http --print Hh 'https://dspacetest.cgiar.org/rest/items?expand=metadata,parentCommunityList,parentCollectionList,bitstreams&limit=10&offset=60'
...
Content-Type: application/json
X-Cache-Status: HIT
$ http --print Hh 'https://dspacetest.cgiar.org/rest/items?expand=metadata,parentCommunityList,parentCollectionList,bitstreams&limit=10&offset=60' Accept:application/xml
...
Content-Type: application/xml
X-Cache-Status: MISS
$ http --print Hh 'https://dspacetest.cgiar.org/rest/items?expand=metadata,parentCommunityList,parentCollectionList,bitstreams&limit=10&offset=60' Accept:application/xml
...
Content-Type: application/xml
X-Cache-Status: HIT
```
- This effectively makes our cache half as effective, but hopefully as more people start harvesting the number of requests handled by it will go up
- I will enable this on CGSpace and email Moises from CIP to check if their harvester is working
## 2022-09-26
- Update welcome text on CGSpace after our meeting last week
- I found another dozen or so ORCIDs for top authors on ILRI's community on CGSpace and tagged them (~1,100 more metadata fields)
- Last week we discussed moving `cg.identifier.googleurl` to `cg.identifier.url` since there is no need to treat Google Books URLs specially anymore as far as we know
- I made the changes to the submission form and the XMLUI item displays, then moved all existing metadata in PostgreSQL:
```console
dspace= ☘ UPDATE metadatavalue SET metadata_field_id=219 WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=222;
UPDATE 1137
```
- Then I deleted `cg.identifier.googleurl` from the metadata registry
- Meeting with Salem, Svetlana, Valentina, and Abenet about MEL depositing to CGSpace for the initiatives
- Submitting to a collection without a workflow works as expected, and we can even select another collection (with a workflow) to map the item to from the MEL submission
- The three minor issues we found were:
- MEL still doesn't send the bitstream
- MEL sends metadata with a download URL on mel.cgiar.org
- MEL sends a JPEG that says "no thumbnail" when an item doesn't have a thumbnail
- I still need to send feedback to the group
## 2022-09-27
- Find a few more ORCID identifiers missing for ILRI authors and add them to the controlled vocabulary and tag the authors on CGSpace
- Moises from CIP says the WordPress importer worked fine with the current nginx proxy cache settings so it seems adding the HTTP Accept header to the cache key worked
- Update my DSpace 7 environments to 7.4-SNAPSHOT
- I see they have added thumbnails in some places now
- Oh nice, they also added the "recent submissions" to the home page
- While talking with Salem about the MEL depositing to CGSpace we discovered an issue with HTTP DELETE on `/items/{item id}/bitstreams/{bitstream id}` or `/bitstreams/{bitstream id}`
- DSpace removes the bitstream but keeps the empty `THUMBNAIL` bundle, which breaks the display in XMLUI
- Meeting with Enrico et al about PRMS reporting for the initiatives
## 2022-09-28
- I was reading the source code for DSpace 6's REST API and found that it's [not possible to specify a bundle while POSTing a bitstream](https://github.com/DSpace/DSpace/blob/dspace-6.4/dspace-rest/src/main/java/org/dspace/rest/ItemsResource.java#L427)
- I asked Salem how they do it on MEL and he said they pretend to be a human and do it via XMLUI!
- I added a few new ILRI subjects to the input forms on CGSpace
- Both "bushmeat" and "wildlife conservation" are AGROVOC terms, but "wild meat" is not
- The distinction ILRI would like to start making is:
> Meat comes from any animal, and when at ILRI we specifically make
> reference to it in the context of livestock. However the word bushmeat
> refers to illegal harvesting of meat. wild meat is being used as legal
> harvesting of meat from wildlife and not from livestock.
- I added a few more CGIAR authors ORCID identifiers to our controlled vocabulary and tagged them on CGSpace (~450 more metadata fields)
- Talking to Salem about ORCID identifiers, we compared list and they have a bunch that we don't have:
```console
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml ~/Downloads/MEL_ORCID_2022-09-28.csv | \
grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | \
sort | \
uniq > /tmp/2022-09-29-combined-orcids.txt
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
1421
$ wc -l /tmp/2022-09-29-combined-orcids.txt
1905 /tmp/2022-09-29-combined-orcids.txt
```
- After combining them I ran them through my `resolve-orcids.py` script:
```console
$ ./ilri/resolve-orcids.py -i /tmp/2022-09-29-combined-orcids.txt -o /tmp/2022-09-29-combined-orcids-names.txt -d
```
- I wrote a script `update-orcids.py` to read a list of names and ORCID identifiers and update existing metadata in the database to the latest name format
```console
$ ./ilri/update-orcids.py -i ~/src/git/cgspace-submission-guidelines/content/terms/cg-creator-identifier/cg-creator-identifier.txt -db dspace -u dspace -p 'fuuu' -m 247 -d
Connected to database.
Fixed 9 occurences of: ADEBOWALE AD AKANDE: 0000-0002-6521-3272
Fixed 43 occurences of: Alamu Emmanuel Oladeji (PhD, FIFST, MNIFST): 0000-0001-6263-1359
Fixed 3 occurences of: Alessandra Galie: 0000-0001-9868-7733
Fixed 1 occurences of: Amanda De Filippo: 0000-0002-1536-3221
...
```
## 2022-09-29
- I've been checking the size of the nginx proxy cache the last few days and it always seems to hover around 14,000 entries and 385MB:
```console
# find /var/cache/nginx/rest_cache/ -type f | wc -l
14202
# du -sh /var/cache/nginx/rest_cache
384M /var/cache/nginx/rest_cache
```
- Also on that note I'm trying to implement a workaround for a potential caching issue that causes MEL to not be able to update items on DSpace Test
- I *think* we might need to allow requests with a JSESSIONID to bypass the cache, but I have to verify with Salem
- We can do this with an nginx map:
```console
# Check if the JSESSIONID cookie is present and contains a 32-character hex
# value, which would mean that a user is actively attempting to re-use their
# Tomcat session. Then we set the $active_user_session variable and use it
# to bypass the nginx proxy cache in REST requests.
map $cookie_jsessionid $active_user_session {
# requests with an empty key are not evaluated by limit_req
# see: http://nginx.org/en/docs/http/ngx_http_limit_req_module.html
default '';
'~[A-Z0-9]{32}' 1;
}
```
- Then in the location block where we do the proxy cache:
```console
# Don't cache when user Shift-refreshes (Cache-Control: no-cache) or
# when a client has an active session (see the $cookie_jsessionid map).
proxy_cache_bypass $http_cache_control $active_user_session;
proxy_no_cache $http_cache_control $active_user_session;
```
- I found one client making 10,000 requests using a Windows 98 user agent:
```console
Mozilla/4.0 (compatible; MSIE 5.00; Windows 98)
```
- They all come from one IP address (129.227.149.43) in Hong Kong
- The IP belongs to a hosting provider called Zenlayer
- I will add this IP to the nginx bot networks and purge its hits
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ip -p
Purging 33027 hits from 129.227.149.43 in statistics
Total number of bot hits purged: 33027
```
- So it seems we've seen this bot before and the total number is much higher than the 10,000 this month
- I had a call with Salem and we verified that the nginx cache bypass for clients who provide a JSESSIONID fixes their issue with updating items/bitstreams from MEL
- The issue was that they delete all metadata and bitstreams, then add them again to make sure everything is up to date, and in that process they also re-request the item with all expands to get the bitstreams, which ends up getting cached and then they try to delete the old bitstream
- I also noticed that someone made a [pull request to enable POSTing bitstreams to a particular bundle](https://github.com/DSpace/DSpace/pull/8343) and it works, so that's awesome!
## 2022-09-30
- I applied [the patch for POSTing bitstreams to other bundles](https://github.com/DSpace/DSpace/pull/8343) on CGSpace
- Testing a few other DSpace 6.4 patches on DSpace Test:
- [DS-3791 Make sure the "yearDifference" takes into account that a gap of 10 year contains 11 years](https://github.com/DSpace/DSpace/pull/1901)
- [DS-3873 Limit the usage of PDFBoxThumbnail to PDFs](https://github.com/DSpace/DSpace/pull/2501)
- [Reduce itemCounter init](https://github.com/DSpace/DSpace/pull/2161)
- [ImageMagick: Only execute "identify" on first page](https://github.com/DSpace/DSpace/pull/2201)
- [DS-3881: Show no total results on search-filter](https://github.com/DSpace/DSpace/pull/2371)
- [pass value instead of qualifier to method](https://github.com/DSpace/DSpace/pull/2699)
- [dspace-api: check for null AND empty qualifier in findByElement()](https://github.com/DSpace/DSpace/pull/7993)
- [Avoid exporting mapped Item more than once](https://github.com/DSpace/DSpace/pull/7995)
- [[DS-4574] v. 6 - Upgrade DBCP2 dependency](https://github.com/DSpace/DSpace/pull/3162)
- [bump up pdfbox version on 6.x to match main branch](https://github.com/DSpace/DSpace/pull/2742)
<!-- vim: set sw=2 ts=2: -->

776
content/posts/2022-10.md Normal file
View File

@@ -0,0 +1,776 @@
---
title: "October, 2022"
date: 2022-10-01T19:45:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-10-01
- Start a harvest on AReS last night
- Yesterday I realized how to use [GraphicsMagick with im4java](https://im4java.sourceforge.net/docs/dev-guide.html) and I want to re-visit some of my thumbnail tests
- I'm also interested in libvips support via jVips, though last time I checked it was only for Java 8
- I filed [an issue to ask about Java 11+ support](https://github.com/criteo/JVips/issues/141)
<!--more-->
## 2022-10-03
- Make two pull requests for DSpace 7.x
- [Update PDFBox dependency to version 2.0.27](https://github.com/DSpace/DSpace/pull/8503)
- [Update Apache commons-dbcp2 and commons-pool2 dependencies](https://github.com/DSpace/DSpace/pull/8504)
- Udana had asked me about their RSS feed and it not showing the latest publications in his email inbox
- He is using this feed from FeedBurner: https://feeds.feedburner.com/iwmi-cgspace
- I don't have access to the FeedBurner configuration, but I looked at the [raw feed](https://gist.github.com/alanorth/0c518fc571f450f8cc353c42cbdd277c) and see it's just getting all the items in the IWMI community
- This OpenSearch query should do the same: `https://cgspace.cgiar.org/open-search/discover?scope=10568/16814&query=*&sort_by=3&order=DESC`
- The `sort_by=3` corresponds to `webui.itemlist.sort-option.3 = dateaccessioned:dc.date.accessioned:date` in dspace.cfg
- Peter sent me a CSV file a few days ago that he was unable to upload to CGSpace
- The stacktrace from the error he was getting was:
```console
Java stacktrace: java.lang.ClassCastException: org.apache.cocoon.servlet.multipart.PartInMemory cannot be cast to org.dspace.app.xmlui.cocoon.servlet.multipart.DSpacePartOnDisk
at org.dspace.app.xmlui.aspect.administrative.FlowMetadataImportUtils.processUploadCSV(FlowMetadataImportUtils.java:116)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.mozilla.javascript.MemberBox.invoke(MemberBox.java:155)
at org.mozilla.javascript.NativeJavaMethod.call(NativeJavaMethod.java:243)
at org.mozilla.javascript.Interpreter.interpretLoop(Interpreter.java:3237)
at org.mozilla.javascript.Interpreter.interpret(Interpreter.java:2394)
at org.mozilla.javascript.InterpretedFunction.call(InterpretedFunction.java:162)
at org.mozilla.javascript.ContextFactory.doTopCall(ContextFactory.java:393)
at org.mozilla.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:2834)
at org.mozilla.javascript.InterpretedFunction.call(InterpretedFunction.java:160)
at org.mozilla.javascript.Context.call(Context.java:538)
at org.mozilla.javascript.ScriptableObject.callMethod(ScriptableObject.java:1833)
at org.mozilla.javascript.ScriptableObject.callMethod(ScriptableObject.java:1803)
at org.apache.cocoon.components.flow.javascript.fom.FOM_JavaScriptInterpreter.handleContinuation(FOM_JavaScriptInterpreter.java:698)
at org.apache.cocoon.components.treeprocessor.sitemap.CallFunctionNode.invoke(CallFunctionNode.java:94)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.SelectNode.invoke(SelectNode.java:82)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.buildPipeline(ConcreteTreeProcessor.java:186)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.buildPipeline(TreeProcessor.java:260)
at org.apache.cocoon.components.treeprocessor.sitemap.MountNode.invoke(MountNode.java:107)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.SelectNode.invoke(SelectNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.buildPipeline(ConcreteTreeProcessor.java:186)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.buildPipeline(TreeProcessor.java:260)
at org.apache.cocoon.components.treeprocessor.sitemap.MountNode.invoke(MountNode.java:107)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.buildPipeline(ConcreteTreeProcessor.java:186)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.buildPipeline(TreeProcessor.java:260)
at org.apache.cocoon.components.source.impl.SitemapSource.init(SitemapSource.java:277)
at org.apache.cocoon.components.source.impl.SitemapSource.<init>(SitemapSource.java:148)
at org.apache.cocoon.components.source.impl.SitemapSourceFactory.getSource(SitemapSourceFactory.java:62)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:153)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:183)
at org.apache.cocoon.generation.FileGenerator.setup(FileGenerator.java:99)
at org.dspace.app.xmlui.cocoon.AspectGenerator.setup(AspectGenerator.java:81)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy190.setup(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.setupPipeline(AbstractProcessingPipeline.java:343)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.setupPipeline(AbstractCachingProcessingPipeline.java:710)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.preparePipeline(AbstractProcessingPipeline.java:466)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.prepareInternal(AbstractProcessingPipeline.java:480)
at sun.reflect.GeneratedMethodAccessor267.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy189.prepareInternal(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.init(SitemapSource.java:292)
at org.apache.cocoon.components.source.impl.SitemapSource.<init>(SitemapSource.java:148)
at org.apache.cocoon.components.source.impl.SitemapSourceFactory.getSource(SitemapSourceFactory.java:62)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:153)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:183)
at org.apache.cocoon.generation.FileGenerator.setup(FileGenerator.java:99)
at org.dspace.app.xmlui.cocoon.AspectGenerator.setup(AspectGenerator.java:81)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy190.setup(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.setupPipeline(AbstractProcessingPipeline.java:343)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.setupPipeline(AbstractCachingProcessingPipeline.java:710)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.preparePipeline(AbstractProcessingPipeline.java:466)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.prepareInternal(AbstractProcessingPipeline.java:480)
at sun.reflect.GeneratedMethodAccessor267.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy189.prepareInternal(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.init(SitemapSource.java:292)
at org.apache.cocoon.components.source.impl.SitemapSource.<init>(SitemapSource.java:148)
at org.apache.cocoon.components.source.impl.SitemapSourceFactory.getSource(SitemapSourceFactory.java:62)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:153)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:183)
at org.apache.cocoon.generation.FileGenerator.setup(FileGenerator.java:99)
at org.dspace.app.xmlui.cocoon.AspectGenerator.setup(AspectGenerator.java:81)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy190.setup(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.setupPipeline(AbstractProcessingPipeline.java:343)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.setupPipeline(AbstractCachingProcessingPipeline.java:710)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.preparePipeline(AbstractProcessingPipeline.java:466)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.prepareInternal(AbstractProcessingPipeline.java:480)
at sun.reflect.GeneratedMethodAccessor267.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy189.prepareInternal(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.init(SitemapSource.java:292)
at org.apache.cocoon.components.source.impl.SitemapSource.<init>(SitemapSource.java:148)
at org.apache.cocoon.components.source.impl.SitemapSourceFactory.getSource(SitemapSourceFactory.java:62)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:153)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:183)
at org.apache.cocoon.generation.FileGenerator.setup(FileGenerator.java:99)
at org.dspace.app.xmlui.cocoon.AspectGenerator.setup(AspectGenerator.java:81)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy190.setup(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.setupPipeline(AbstractProcessingPipeline.java:343)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.setupPipeline(AbstractCachingProcessingPipeline.java:710)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.preparePipeline(AbstractProcessingPipeline.java:466)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.prepareInternal(AbstractProcessingPipeline.java:480)
at sun.reflect.GeneratedMethodAccessor267.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy189.prepareInternal(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.init(SitemapSource.java:292)
at org.apache.cocoon.components.source.impl.SitemapSource.<init>(SitemapSource.java:148)
at org.apache.cocoon.components.source.impl.SitemapSourceFactory.getSource(SitemapSourceFactory.java:62)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:153)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:183)
at org.apache.cocoon.generation.FileGenerator.setup(FileGenerator.java:99)
at org.dspace.app.xmlui.cocoon.AspectGenerator.setup(AspectGenerator.java:81)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy190.setup(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.setupPipeline(AbstractProcessingPipeline.java:343)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.setupPipeline(AbstractCachingProcessingPipeline.java:710)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.preparePipeline(AbstractProcessingPipeline.java:466)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.prepareInternal(AbstractProcessingPipeline.java:480)
at sun.reflect.GeneratedMethodAccessor267.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy189.prepareInternal(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.init(SitemapSource.java:292)
at org.apache.cocoon.components.source.impl.SitemapSource.<init>(SitemapSource.java:148)
at org.apache.cocoon.components.source.impl.SitemapSourceFactory.getSource(SitemapSourceFactory.java:62)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:153)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:183)
at org.apache.cocoon.generation.FileGenerator.setup(FileGenerator.java:99)
at org.dspace.app.xmlui.cocoon.AspectGenerator.setup(AspectGenerator.java:81)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy190.setup(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.setupPipeline(AbstractProcessingPipeline.java:343)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.setupPipeline(AbstractCachingProcessingPipeline.java:710)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.preparePipeline(AbstractProcessingPipeline.java:466)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.prepareInternal(AbstractProcessingPipeline.java:480)
at sun.reflect.GeneratedMethodAccessor267.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy189.prepareInternal(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.init(SitemapSource.java:292)
at org.apache.cocoon.components.source.impl.SitemapSource.<init>(SitemapSource.java:148)
at org.apache.cocoon.components.source.impl.SitemapSourceFactory.getSource(SitemapSourceFactory.java:62)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:153)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:183)
at org.apache.cocoon.generation.FileGenerator.setup(FileGenerator.java:99)
at org.dspace.app.xmlui.cocoon.AspectGenerator.setup(AspectGenerator.java:81)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy190.setup(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.setupPipeline(AbstractProcessingPipeline.java:343)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.setupPipeline(AbstractCachingProcessingPipeline.java:710)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.preparePipeline(AbstractProcessingPipeline.java:466)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.prepareInternal(AbstractProcessingPipeline.java:480)
at sun.reflect.GeneratedMethodAccessor267.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy189.prepareInternal(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.init(SitemapSource.java:292)
at org.apache.cocoon.components.source.impl.SitemapSource.<init>(SitemapSource.java:148)
at org.apache.cocoon.components.source.impl.SitemapSourceFactory.getSource(SitemapSourceFactory.java:62)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:153)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:183)
at org.apache.cocoon.generation.FileGenerator.setup(FileGenerator.java:99)
at org.dspace.app.xmlui.cocoon.AspectGenerator.setup(AspectGenerator.java:81)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy190.setup(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.setupPipeline(AbstractProcessingPipeline.java:343)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.setupPipeline(AbstractCachingProcessingPipeline.java:710)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.preparePipeline(AbstractProcessingPipeline.java:466)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.prepareInternal(AbstractProcessingPipeline.java:480)
at sun.reflect.GeneratedMethodAccessor267.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy189.prepareInternal(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.init(SitemapSource.java:292)
at org.apache.cocoon.components.source.impl.SitemapSource.<init>(SitemapSource.java:148)
at org.apache.cocoon.components.source.impl.SitemapSourceFactory.getSource(SitemapSourceFactory.java:62)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:153)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:183)
at org.apache.cocoon.generation.FileGenerator.setup(FileGenerator.java:99)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy190.setup(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.setupPipeline(AbstractProcessingPipeline.java:343)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.setupPipeline(AbstractCachingProcessingPipeline.java:710)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.preparePipeline(AbstractProcessingPipeline.java:466)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:411)
at sun.reflect.GeneratedMethodAccessor331.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy189.process(Unknown Source)
at org.apache.cocoon.components.treeprocessor.sitemap.SerializeNode.invoke(SerializeNode.java:147)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:171)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcessor.java:247)
at org.apache.cocoon.components.treeprocessor.sitemap.MountNode.invoke(MountNode.java:117)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:171)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcessor.java:247)
at org.apache.cocoon.components.treeprocessor.sitemap.MountNode.invoke(MountNode.java:117)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:171)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcessor.java:247)
at org.apache.cocoon.servlet.RequestProcessor.process(RequestProcessor.java:351)
at org.apache.cocoon.servlet.RequestProcessor.service(RequestProcessor.java:169)
at org.apache.cocoon.sitemap.SitemapServlet.service(SitemapServlet.java:84)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:468)
at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:443)
at org.apache.cocoon.servletservice.spring.ServletFactoryBean$ServiceInterceptor.invoke(ServletFactoryBean.java:264)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
at com.sun.proxy.$Proxy186.service(Unknown Source)
at org.dspace.springmvc.CocoonView.render(CocoonView.java:113)
at org.springframework.web.servlet.DispatcherServlet.render(DispatcherServlet.java:1216)
at org.springframework.web.servlet.DispatcherServlet.processDispatchResult(DispatcherServlet.java:1001)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:945)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:867)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:951)
at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:853)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:647)
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:827)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:113)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter.doFilter(DSpaceCocoonServletFilter.java:160)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.servlet.multipart.DSpaceMultipartFilter.doFilter(DSpaceMultipartFilter.java:119)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:110)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:492)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:165)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:235)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:1025)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:451)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1201)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:654)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:317)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:750)
```
- So this is a broken side effect from the `org.apache.cocoon.uploads.autosave=false` change I made a few weeks ago
- Importing the CSV via the command line works fine
## 2022-10-04
- I stumbled across more low-quality thumbnails on CGSpace
- Some have the description "Generated Thumbnail", and others are manually uploaded ".jpg.jpg" ones...
- I want to develop some more thumbnail fixer scripts to the cgspace-java-helpers suite:
- If an item has an `IM Thumbnail` and a `Generated Thumbnail` in the `THUMBNAIL` bundle, remove the `Generated Thumbnail`
- If an item has a PDF bitstream and a JPG bitstream with description /thumbnail/ in the ORIGINAL bundle, remove the /thumbnail/ bitstream in the ORIGINAL bundle and try to remove the /thumbnail/.jpg bitstream in the THUMBNAIL bundle
## 2022-10-05
- I updated the [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) to include a new `FixLowQualityThumbnails` script to detect the low-quality thumbnails I found above
- Add missing ORCID identifier for an Alliance author
- I've been running the `dspace cleanup -v` script every few weeks or months on CGSpace and assuming it finished successfully because I didn't get a error on the stdout/stderr, but today I noticed that the script keeps saying it is deleting the same bitstreams
- I looked in dspace.log and found the error I used to see a lot:
```console
Caused by: org.postgresql.util.PSQLException: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
Detail: Key (uuid)=(99b76ee4-15c6-458c-a940-866148bc7dee) is still referenced from table "bundle".
```
- If I mark the primary bitstream as null manually the cleanup script continues until it finds a few more
- I ended up with a long list of UUIDs to fix before the script would complete:
```console
$ psql -d dspace -c "update bundle set primary_bitstream_id=NULL where primary_bitstream_id in ('b76d41c0-0a02-4f53-bfde-a840ccfff903','1981efaa-eadb-46cd-9d7b-12d7a8cff4c4','97a8b1fa-3c12-4122-9c7b-fc2a3eaf570d','99b76ee4-15c6-458c-a940-866148bc7dee','f330fc22-a787-46e2-b8d0-64cc3e166124','592f4a0d-1ed5-4663-be0e-958c0d3e653b','e73b3178-8f29-42bc-bfd1-1a454903343c','e3a5f592-ac23-4934-a2b2-26735fac0c4f','73f4ff6c-6679-44e8-8cbd-9f28a1df6927','11c9a75c-17a6-4966-a4e8-a473010eb34c','155faf93-92c5-4c17-866e-1db50b1f9687','8e073e9e-ab54-4d99-971a-66de073d51e3','76ddd62c-6499-4a8c-beea-3fc8c60200d8','2850fcc9-f450-430a-9317-c42def74e813','8fef3198-2aea-4bd8-aeab-bf5fccb46e42','9e3c3528-e20f-4da3-a0bd-ae9b8515b770')"
```
## 2022-10-06
- I finished running the cleanup script on CGSpace and the before and after on the number of bitstreams is interesting:
```console
$ find /home/cgspace.cgiar.org/assetstore -type f | wc -l
181094
$ find /home/cgspace.cgiar.org/assetstore -type f | wc -l
178329
```
- So that cleaned up ~2,700 bitstreams!
- Interesting, someone on the DSpace Slack mentioned this as being a known issue with discussion, reproducers, and a pull request: https://github.com/DSpace/DSpace/issues/7348
- I am having an issue with the new FixLowQualityThumbnails script on some communities like 10568/117865 and 10568/97114
- For some reason it doesn't descend into the collections
- Also, my old FixJpgJpgThumbnails doesn't either... weird
- I might have to resort to getting a list of collections and doing it that way:
```console
$ psql -h localhost -U postgres -d dspacetest -c 'SELECT ds6_collection2collectionhandle(uuid) FROM collection WHERE uuid in (SELECT uuid FROM collection);' |
sed 1,2d |
tac |
sed 1,3d > /tmp/collections
```
- Strange, I don't think doing it by collections is actually working because it says it's replacing the bitstreams, but it doesn't actually do it
- I don't have time to figure out what's happening, because I see "update_item" in dspace.log when the script says it's doing it, but it doesn't do it
- I might just extract a list of items that have .jpg.jpg thumbnails from the database and run the script through item mode
- There might be a problem with the context commit logic...?
- I exported a list of items that have .jpg.jpg thumbnails on CGSpace:
```console
$ psql -h localhost -p 5432 -U postgres -d dspacetest -c "SELECT ds6_bitstream2itemhandle(dspace_object_id) FROM metadatavalue WHERE text_value ~ '.*\.(jpg|jpeg|JPG|JPEG)\.(jpg|jpeg|JPG|JPEG)' AND dspace_object_id IS NOT NULL;" |
sed 1,2d |
tac |
sed 1,3d |
grep -v '␀' |
sort -u |
sed 's/ //' > /tmp/jpgjpg-handles.txt
```
- I restarted DSpace Test because it had high load since yesterday and I don't know why
- Run `check-duplicates.py` on the 1642 MARLO Innovations to try to include matches from the OICRs we uploaded last month
- Then I processed those matches like I did with the OICRs themselves last month, and then cleaned them one last time with csv-metadata-quality, created a SAF bundle, and uploaded them to CGSpace
- BTW this bumps CGSpace over 100,000 items...
- Then I did the same for the 749 MARLO MELIAs and imported them to CGSpace
- Meeting about CG Core types with Abenet, Marie-Angelique, Sara, Margarita, and Valentina
- I made some minor logic changes to the FixJpgJpgThumbnails script in cgspace-java-helpers
- Now it checks to make sure the bitstream description is not empty or null, and also excludes Maps (in addition to Infographics) since those are likely to be JPEG files in the ORIGINAL bundle on purpose
## 2022-10-07
- I did the matching and cleaning on the 512 MARLO Policies and uploaded them to CGSpace
- I sent a list of the IDs and Handles for all four groups of MARLO items to Jose so he can do the redirects on their server:
```console
$ wc -l /tmp/*mappings.csv
1643 /tmp/crp-innovation-mappings.csv
750 /tmp/crp-melia-mappings.csv
683 /tmp/crp-oicr-mappings.csv
513 /tmp/crp-policy-mappings.csv
3589 total
```
- I fixed the mysterious issue with my cgspace-java-helpers scripts not working on communities and collections
- It was because the code wasn't committing the context!
- I ran both `FixJpgJpgThumbnails` and `FixLowQualityThumbnails` on a dozen or so large collections on CGSpace and processed about 1,200 low-quality thumbnails
- I did a complete re-sync of CGSpace to DSpace Test
## 2022-10-08
- Start a harvest on AReS
- Experiment with PDF thumbnails in ImageMagick again, I found an [interesting reference on their legacy website](https://legacy.imagemagick.org/Usage/thumbnails/) saying we can use `-unsharp` after `-thumbnail` to make them less blurry
- There are a few examples for unsharp values (starting from a DSpace default of a flattened JPEG from the PDF, then the thumbnail in a second operation:
```console
$ convert '10568-103447.pdf[0]' -flatten 10568-103447-dspace-step1.pdf.jpg
$ convert 10568-103447-dspace-step1.pdf.jpg -thumbnail 600x600 -unsharp 0x.5 10568-103447-dspace-step2-600-unsharp.pdf.jpg
$ convert 10568-103447-dspace-step1.pdf.jpg -thumbnail 600x600 -unsharp 2x0.5+0.7+0 10568-103447-dspace-step2-600-unsharp2.pdf.jpg
$ convert 10568-103447-dspace-step1.pdf.jpg -thumbnail 600x600 -unsharp 0x0.75+0.75+0.008 10568-103447-dspace-step2-600-unsharp3.pdf.jpg
$ convert 10568-103447-dspace-step1.pdf.jpg -thumbnail 600x600 -unsharp 1.5x1+0.7+0.02 10568-103447-dspace-step2-600-unsharp4.pdf.jpg
```
- I merged all the changes from `6_x-dev` to `6_x-prod` after having run them on DSpace Test for the last ten days
## 2022-10-11
- I put together the microsite for improving DSpace PDF thumbnails: https://github.com/alanorth/improved-dspace-thumbnails/
- I need to make the pull request to DSpace
- I also discussed the thumbnails with Dani in Addis
## 2022-10-12
- I submitted a pull request to DSpace 7 for the `-unsharp 0x0.5` change: https://github.com/DSpace/DSpace/pull/8515
- I did some tests on CGSpace and verified that MEL will indeed need admin permissions on every collection that they want to map to
- I had a call with Salem and he asked me about redirecting from some CRP duplicates that exist in both MELSpace and CGSpace
- We decided that the only way is to use an HTTP 301 redirect in the nginx web server, but I said that I'd check with CNRI to see if there was a way to do this within the Handle system
## 2022-10-13
- Disable the REST API cache on CGSpace temporarily to see if that fixes a strange problem we are seeing with listing publications on ilri.org
- Meeting with MEL, MARLO, and CG Core people to continue discussing `dcterms.type`
- I added the new MEL account to all the appropriate authorizations for Initiatives that ICARDA is involved in on CGSpace
- I still have to add the few that WorldFish is involved in
## 2022-10-14
- Abenet finalized adding the MEL user to all initiative collections on CGSpace
- Re-sync CGSpace to DSpace Test to get the new MEL user and authorizations
- I checked ilri.org and I see more publications for 2021 and earlier
- The results are still strange though because I only see a few for each year
## 2022-10-15
- I'm going to turn the REST API cache on CGSpace back on to see if the ilri.org publications thing gets broken again
- Start a harvest on AReS
## 2022-10-16
- The harvest on AReS finished but somehow there are 10,000 less items than the previous indexing... hmmm
- I don't see any hits from MELSpace there so I will start another harvest...
- After starting the harvesting the load on the server went up to 20 and UptimeRobot said CGSpace was down for three hours, sigh
- I stopped the harvesting and the load went down immediately
- I am trying to find a pattern with the load on Sundays
- I see this in the AReS backend logs:
```console
[Nest] 1 - 10/16/2022, 6:42:04 PM [HarvesterService] Starting Harvest =>0
[Nest] 1 - 10/16/2022, 6:42:07 PM [HarvesterService] Starting Harvest =>101555
[Nest] 1 - 10/16/2022, 6:42:10 PM [HarvesterService] Starting Harvest =>4936
```
- Which means MELSpace is having some issue
- I'm not sure what was going on on CGSpace yesterday, but the load was indeed very high according to Munin:
![CGSpace CPU load day](/cgspace-notes/2022/10/cpu-day.png)
- The pattern is clear on Sundays if you look at the past month:
![CGSpace CPU load month](/cgspace-notes/2022/10/cpu-month.png)
- I have yet to find an increased nginx request pattern correlating with the increased load, but looking back on the last year it seems something started happening around March, 2022, and also I start seeing CPU steal in July (red coming from the top of the graph):
![CGSpace CPU load year](/cgspace-notes/2022/10/cpu-year.png)
- The amount of CPU steal is very low if I look at it now, around 1 or 2 percent, but what's happening now reminds me of the mysterious load problems I had in 2019-03 that were due to CPU steal
- Salem said there was an issue with the sitemaps on MELSpace so that's why it wasn't working in AReS
- Load on CGSpace is low in the evening so I'll start a new AReS harvest
## 2022-10-18
- Start mapping the Initiative names on CGSpace to tne new short names from Enrico's spreadsheet
- Then I will update them for existing CGSpace items:
```console
$ ./ilri/fix-metadata-values.py -i 2022-10-18-update-initiatives.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.initiative -m 258 -t correct -d -n
```
- And later in the controlled vocabulary
- Apply some corrections to a few hundred items on CGSpace for Peter
- Meeting with Abenet, Sara, and Valentina about CG Core types
- We finished going over our list and agreed to send a message to concerned parties in our organizations for feedback by November 4th
- Next week we will continue doing the definitions
- Re-sync CGSpace to DSpace Test to get the latest Initiatives changes
- I also need to re-create the CIAT/Alliance TIP accounts so they can continue testing
- I re-created the tip-submit@cgiar.org and tip-approve@cgiar.org account on DSpace Test
- According to my notes:
- A user must be in the collection admin group in order to deposit via the REST API (not in the collection's "Submit" group, which is for normal submission)
- A user must be in the collection's "Accept/Reject/Edit Metadata" step in order to see and approve the item in the DSpace workflow
- I created a new "TIP test" collection under Alliance's community and added the users accordingly
- I think I'll be able to just add these two submit/approve users to the Alliance Admins and Alliance Editors groups once we're ready
## 2022-10-19
- I submitted a [bug report for the two-page portrait layout of some PDF thumbnails](https://bugs.ghostscript.com/show_bug.cgi?id=705994) on Ghostscript's bug tracker
- For reference, the thumbnail for PDFs like in [10568/116598](https://hdl.handle.net/10568/116598) looks like this:
![gs thumbnail](/cgspace-notes/2022/10/gs-10568-116598.pdf.jpg)
- In other news, I see `pdftocairo` from the poppler package produces a similar, though slightly prettier version of the thumbnail of that PDF:
![pdftocairo thumbnail]('/cgspace-notes/2022/10/pdftocairo-10568-116598.pdf.jpg)
- I used the command:
```console
$ pdftocairo -jpeg -singlefile -f 1 -l 1 -scale-to-x 640 -scale-to-y -1 10568-116598.pdf thumb
```
- The Ghostscript developers responded in a few minutes (!) and explained that PDFs can contain many different "boxes":
> PDF files can have multiple different 'Box' values; ArtBox, BleedBox, CropBox, MediaBox and TrimBox. The MediaBox is required the other boxes are optional, a given PDF page description must contain the MediaBox and may contain any or all of the others.
>
> By default Ghostscript uses the MediaBox to determine the size of the media. Other PDF consumers may exhibit other behaviours.
>
> The pages in your PDF file contain all of the Boxes. In the majority of cases the Boxes all contain the same values (which makes their inclusion pointless of course). But for page 1 they differ:
>
> /CropBox[594.375 0.0 1190.55 839.176]
> /MediaBox[0.0 0.0 1190.55 841.89]
>
> You can tell Ghostscript to use a different Box value for the media by using one of -dUseArtBox, -dUseBleedBox, -dUseCropBox, -dUseTrim,Box. If I specify -dUseCropBox then the file is rendered as you expect.
- I confirm that adding `-define pdf:use-cropbox=true` to the ImageMagick command produces a better thumbnail in this case
- We can check the boxes in a PDF using `pdfinfo` from the poppler package:
```console
$ pdfinfo -box data/10568-116598.pdf
Creator: Adobe InDesign 17.0 (Macintosh)
Producer: Adobe PDF Library 16.0.3
CreationDate: Tue Dec 7 12:44:46 2021 EAT
ModDate: Tue Dec 7 15:37:58 2021 EAT
Custom Metadata: no
Metadata Stream: yes
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 17
Encrypted: no
Page size: 596.175 x 839.176 pts
Page rot: 0
MediaBox: 0.00 0.00 1190.55 841.89
CropBox: 594.38 0.00 1190.55 839.18
BleedBox: 594.38 0.00 1190.55 839.18
TrimBox: 594.38 0.00 1190.55 839.18
ArtBox: 594.38 0.00 1190.55 839.18
File size: 572600 bytes
Optimized: no
PDF version: 1.6
```
- In this case the MediaBox is a strange size, and we should use the CropBox
- I wonder if we can check that from DSpace...
- Apply some corrections from Peter on CGSpace
- Meeting with Leroy, Daniel, Francesca, and Maria from Alliance to review their TIP tool and talk about next steps
- We asked them to do some real submissions (as opposed to "I like coffee" etc) to test the full breadth of the metadata and controlled vocabularies
- Minor work on the CG Core Types spreadsheet to clear up some of the actions and incorporate some of Peter's feedback
- After looking at the request patterns in nginx on CGSpace for the past few weeks I see nothing that would explain the high loads we see several times per week (especially Sundays!)
- So I suspect there is a noisy neighbor, and actually I do see some non-trivial amount of CPU steal in my Munin graphs and `iostat`
- I asked Linode to move the instance elsewhere
## 2022-10-22
- Start a harvest on AReS
## 2022-10-24
- Peter sent me some corrections for affiliations:
```console
$ cat 2022-10-24-affiliations.csv
cg.contributor.affiliation,correct
Wageningen University and Research Centre,Wageningen University & Research
Wageningen University and Research,Wageningen University & Research
Wageningen University,Wageningen University & Research
$ ./ilri/fix-metadata-values.py -i 2022-10-24-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
```
- Add ORCID identifier for Claudia Arndt on CGSpace and tag her existing items
- Linode responded to my request last week and said they don't think that the culprit here is CPU steal, but that they would move us to another host anyways
- I still need to check the Munin graphs
## 2022-10-25
- Upload some changes to items on CGSpace for Peter
- Start a full Discovery index on CGSpace:
```console
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 226m40.463s
user 132m6.511s
sys 3m15.077s
```
## 2022-10-26
- We published the [infographic](https://hdl.handle.net/10568/125167) and [blog post](https://www.ilri.org/news/celebrating-open-access-cgspace) to mark CGSpace's 100,000th item
- I generated a high-quality thumbnail using ImageMagick in order to Tweet it:
```console
$ convert -density 144 10568-125167.pdf\[0\] -thumbnail x1200 /tmp/10568-125167.pdf.png
$ pngquant /tmp/10568-125167.pdf.png
```
- Spent some time looking at the MediaBox / CropBox thing in DSpace's `ImageMagickThumbnailFilter.java`
- We need to make sure to put `-define pdf:use-cropbox=true` before we specify the input file or else it will not have any effect
## 2022-10-27
- I found out that we can use [pdfcpu to remove the CropBox from a PDF](https://pdfcpu.io/boxes/boxes_remove.html#examples) for testing:
```console
$ pdfcpu box rem -- "crop" in.pdf out.pdf
```
- I filed [an issue on DSpace](https://github.com/DSpace/DSpace/issues/8549) for the ImageMagick `CropBox` problem
- I decided that this is a bug that should be fixed separately from the "improving thumbnail quality" issue
- I made [a pull request](https://github.com/DSpace/DSpace/pull/8550) to fix the `CropBox` issue
- I did more work on my [improved-dspace-thumbnails](https://github.com/alanorth/improved-dspace-thumbnails/) microsite to complement the DSpace thumbnail pull requests
- I am updating it to recommend using the PDF cropbox and "supersampling" with a higher density than 72
- I measured execution time of ImageMagick with `time` and found that the higher-density mode takes about five times longer on average
- I measured the [maximum heap memory of ImageMagick with Valgrind and Massif](https://stackoverflow.com/a/131346):
```console
$ valgrind --tool=massif magick convert ...
```
- Then I checked the results for each set of default DSpace thumbnail runs and "improved" thumbnail runs using `ms_print` (hacky way to get the max heap, I know):
```console
$ for file in memory-dspace/massif.out.49*; do ms_print "$file" | grep -A1 " MB" | tail -n1 | sed 's/\^.*//'; done
15.87
16.06
21.26
15.88
20.01
15.85
20.06
16.04
15.87
15.87
20.02
15.87
15.86
19.92
10.89
$ for file in memory-improved/massif.out.5*; do ms_print "$file" | grep -A1 " MB" | tail -n1 | sed 's/\^.*//'; done
245.3
245.5
298.6
245.3
306.8
245.2
306.9
245.5
245.2
245.3
306.8
245.3
244.9
306.3
165.6
```
- Ouch, this shows that it takes about *fifteen times* more memory to do the "4x" density of 288!
- It seems more reasonable to use a "2x" density of 144:
```console
$ for file in memory-improved-144/*; do ms_print "$file" | grep -A1 " MB" | tail -n1 | sed 's/\^.*//'; done
61.80
62.00
76.76
61.82
77.43
61.77
77.48
61.98
61.76
61.81
77.44
61.81
61.69
77.16
41.84
```
- There's a really cool visualizer called massif-visualizer, but it isn't easy to parse
## 2022-10-28
- I finalized the code for the ImageMagick density change and made a [pull request](https://github.com/DSpace/DSpace/pull/8553) against DSpace 7.x
## 2022-10-29
- Start a harvest on AReS
## 2022-10-31
- Tag version 6.1 of cgspace-java-helpers: https://github.com/ilri/cgspace-java-helpers/releases/tag/v6.1
- I also pushed a more recent `6.1-SNAPSHOT` version to Maven Central via OSSRH
- I should probably push a non SNAPSHOT but I don't have time to figure that out in Maven
- Add some new items on CGSpace and update others for Peter
- Email Mishell from CIP about their [old theses](https://cgspace.cgiar.org/handle/10568/125218) which are using Creative Commons licenses
- They said it's OK so I updated all sixteen items in that collection
- Move the "MEL submissions" collection on CGSpace from ICARDA's community to the Initiatives community
- Meeting with Peter and Abenet about ongoing CGSpace action points
- I created the authorizations for Alliance's TIP tool to submit on CGSpace
<!-- vim: set sw=2 ts=2: -->

530
content/posts/2022-11.md Normal file
View File

@@ -0,0 +1,530 @@
---
title: "November, 2022"
date: 2022-11-01T09:11:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-11-01
- Last night I re-synced DSpace 7 Test from CGSpace
- I also updated all my local `7_x-dev` branches on the latest upstreams
- I spent some time updating the authorizations in Alliance collections
- I want to make sure they use groups instead of individuals where possible!
- I reverted the Cocoon autosave change because it was more of a nuissance that Peter can't upload CSVs from the web interface and is a very low severity security issue
<!--more-->
- I ran FixLowQualityThumbnails from cgspace-java-helpers on some large collections on CGSpace and ended up fixing 194 items!
- I did some minor checking and uploaded twenty-four IFPRI outputs for the Initiatives to DSpace Test
- Tim merged my [pull request to override the ImageMagick PDF density in DSpace 7](https://github.com/DSpace/DSpace/pull/8553)
- I ported it to DSpace 6.x and submitted a pull request: https://github.com/DSpace/DSpace/pull/8560
## 2022-11-02
- I joined the FAOCGIAR AGROVOC results sharing meeting
- From June to October, 2022 we suggested 39 new keywords, added 27 to AGROVOC, 4 rejected, and 9 still under discussion
- Doing duplicate check on IFPRI's batch upload and I found one duplicate uploaded by IWMI earlier this year
- I will update the metadata of that item and map it to the correct Initiative collection
## 2022-11-03
- I added countries to the twenty-three IFPRI items in OpenRefine based on their titles and abstracts (using the Jython trick I learned a few months ago), then added regions using csv-metadata-quality, and uploaded them to CGSpace
- I exported a list of collections from CGSpace so I can run the thumbnail fixes on each, as we seem to have issues when doing it on (some) large communities like the CRP community:
```console
localhost/dspace= ☘ \COPY (SELECT ds6_collection2collectionhandle(uuid) AS collection FROM collection) to /tmp/collections.txt
COPY 1268
```
- Then I started a test run on DSpace Test:
```console
$ while read -r collection; do chrt -b 0 dspace dsrun io.github.ilri.cgspace.scripts.FixLowQualityThumbnails $collection | tee -a /tmp/FixLowQualityThumbnails.log; done < /tmp/collections.txt
```
- I'll be curious to check the log after it's all done.
- After a few hours I see:
```console
$ grep -c 'Action: remove' /tmp/FixLowQualityThumbnails.log
626
```
- Not bad, because last week I did a more manual selection of collections and deleted ~200
- I will replicate this on CGSpace soon, and also try the FixJpgJpgThumbnails tool
- I see that the CIAT Library is still up, so I should really grab all the PDFs before they shut that old server down
- Export a list of items with PDFs linked there:
```console
localhost/dspacetest= ☘ \COPY (SELECT dspace_object_id,text_value FROM metadatavalue WHERE metadata_field_id=219 AND text_value LIKE '%ciat-library%') to /tmp/ciat-library-items.csv;
COPY 4621
```
- After stripping the page numbers off I see there are only about 2,700 unique files, and we have to filter the dead JSPUI ones...
```console
$ csvcut -c url 2022-11-03-CIAT-Library-items.csv | sed 1d | grep -v jspui | sort -u | wc -l
2752
```
- I'm not sure how we'll handle the duplicates because many items are book chapters or something where they share a PDF
## 2022-11-04
- I decided to check for old pre-ImageMagick thumbnails on CGSpace by finding any bitstreams with the description "Generated Thumbnail":
```console
localhost/dspacetest= ☘ \COPY (SELECT ds6_bitstream2itemhandle(dspace_object_id) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND text_value='Generated Thumbnail') to /tmp/old-thumbnails.txt;
COPY 1147
$ grep -v '\\N' /tmp/old-thumbnails.txt > /tmp/old-thumbnail-handles.txt
$ wc -l /tmp/old-thumbnail-handles.txt
987 /tmp/old-thumbnail-handles.txt
```
- A bunch of these have `\N` for some reason when I use the `ds6_bitstream2itemhandle` function to get their handles so I had to exclude those...
- I forced the media-filter for these items on CGSpace:
```console
$ while read -r handle; do JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -i $handle -f -v; done < /tmp/old-thumbnail-handles.txt
```
- Upload some batch records via CSV for Peter
- Update the about page on CGSpace with new text from Peter
- Add a few more ORCID identifiers and names to my growing file `2022-09-22-add-orcids.csv`
- I tagged fifty-four new authors using this list
- I deleted and mapped one duplicate item for Maria Garruccio
- I updated the CG Core website from Bootstrap v4.6 to v5.2
## 2022-11-07
- I did a harvest on AReS last night but it seems that MELSpace's sitemap is broken again because we have 10,000 fewer records
- I filed [an issue](https://github.com/ecrmnn/iso-3166-1/issues/10) on the iso-3166-1 npm package to update the name of Turkey to Türkiye
- I also filed [an issue](https://github.com/flyingcircusio/pycountry/issues/148) and [a pull request](https://github.com/flyingcircusio/pycountry/pull/149) on the pycountry package
- I also filed [an issue](https://github.com/konstantinstadler/country_converter/issues/121) and [a pull request](https://github.com/konstantinstadler/country_converter/pull/122) on the country-converter package
- I also changed one item on CGSpace that had been submitted since the name was changed
- I also imported the new iso-codes 4.12.0 into cgspace-java-helpers
- I also updated it in the DSpace `input-forms.xml`
- I also forked the iso-3166-1 package from npm and updated Swaziland, Macedonia, and Turkey in my fork
- I submitted a [pull request](https://github.com/ecrmnn/iso-3166-1/pull/11) to update this upstream
- Since I was making all these pull requests I also made [one on country-converter for the UN M.49 region "South-eastern Asia"](https://github.com/konstantinstadler/country_converter/pull/123)
- Port the [ImageMagick PDF cropbox fix](https://github.com/DSpace/DSpace/pull/8550) to DSpace 6.x
- I deployed it on CGSpace, ran all updates, and rebooted the host
- I ran the filter-media script on one large collection where many of these PDFs with cropbox issues exist:
```console
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v -f -i 10568/78 >& /tmp/filter-media-cropbox.log
```
- But looking at the items it processed, I'm not sure it's working as expected
- I looked at a few dozen
- I found some links to the Bioversity website on CGSpace that are not redirecting properly:
```console
$ http --print Hh http://www.bioversityinternational.org/nc/publications/publication/issue/geneflow_2004.html
GET /nc/publications/publication/issue/geneflow_2004.html HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: www.bioversityinternational.org
User-Agent: HTTPie/3.2.1
HTTP/1.1 302 Found
Connection: Keep-Alive
Content-Length: 275
Content-Type: text/html; charset=iso-8859-1
Date: Mon, 07 Nov 2022 16:35:21 GMT
Keep-Alive: timeout=15, max=100
Location: https://www.bioversityinternational.orgnc/publications/publication/issue/geneflow_2004.html
Server: Apache
```
- The `Location` header is clearly wrong, and if I try https directly I get an HTTP 500
## 2022-11-08
- Looking at the Solr statistics hits on CGSpace for 2022-11
- I see 221.219.100.42 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent
- I see 122.10.101.60 is in Hong Kong and making thousands of requests to XMLUI handles in a few hours, using a normal user agent
- I see 135.125.21.38 on OVH is making thousands of requests trying to do SQL injection
- I see 163.237.216.11 is somewhere in California making thousands of requests with a normal user agent
- I see 51.254.154.148 on OVH is making thousands of requests trying to do SQL injection
- I see 221.219.103.211 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent
- I see 216.218.223.53 on Hurricane Electric making thousands of requests to XMLUI in a few minutes using a normal user agent
- I will purge all these hits and proably add China Unicom's subnet mask to my nginx `bot-network.conf` file to tag them as bots since there are SO many bad and malicious requests coming from there
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 8975 hits from 221.219.100.42 in statistics
Purging 7577 hits from 122.10.101.60 in statistics
Purging 6536 hits from 135.125.21.38 in statistics
Purging 23950 hits from 163.237.216.11 in statistics
Purging 4093 hits from 51.254.154.148 in statistics
Purging 2797 hits from 221.219.103.211 in statistics
Purging 2618 hits from 216.218.223.53 in statistics
Total number of bot hits purged: 56546
```
- Also interesting to see a few new user agents:
- `RStudio Desktop (2022.7.1.554); R (4.2.1 x86_64-w64-mingw32 x86_64 mingw32)`
- `rstudio.cloud R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu)`
- `MEL`
- `Gov employment data scraper ([[your email]])`
- `RStudio Desktop (2021.9.0.351); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)`
- I will purge all these:
```console
$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
Purging 6155 hits from RStudio in statistics
Purging 1929 hits from rstudio in statistics
Purging 1454 hits from MEL in statistics
Purging 1094 hits from Gov employment data scraper in statistics
Total number of bot hits purged: 10632
```
- Work on the CIAT Library items a bit again in OpenRefine
- I flagged items with:
- URL containing "#page" at the end (these are linking to book chapters, but we don't want to upload the PDF multiple times)
- Same URL used by more than one item ("Duplicates" facet in OpenRefine, these are some corner case I don't want to handle right now)
- URL containing ":8080" to CIAT's old DSpace (this server is no longer live)
- I want to try to handle the simple cases that should cover most of the items first
## 2022-11-09
- Continue working on the Python script to upload PDFs from CIAT Library to the relevant item on CGSpace
- I got the basic functionality working
## 2022-11-12
- Start a harvest on AReS
## 2022-11-15
- Meeting with Marie-Angelique, Sara, and Valentina about CG Core types
- We agreed to continue adding the feedback for each of the proposed actions
- The others will start filling in definitions for the types
- Sara had some good questions about duplicates on CGSpace and how we can possibly prevent them now that several systems are submitting items directly into the repository
- We need to be careful especially with regards to author's outputs that will be reported in the PRMS
## 2022-11-16
- Maria asked if we can extend the timeout for XMLUI sessions
- According to [this issue](https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44) it seems to be 30 minutes by default, as a Tomcat default
- I think we could extend this to an hour, as there is no real security risk (we're not a bank) and most user's lock screens would have activated after ten minutes or so anyways
## 2022-11-20
- Start a harvest on AReS
## 2022-11-22
- Check and upload some items to CGSpace for Peter
- I am waiting for some feedback from him about some duplicates and metadata issues for the rest
## 2022-11-23
- Fix some authorization issues for ABC's TIP submit tool on DSpace Test (the groups were correct on CGSpace, but not on test)
- Peter sent me feedback about the duplicates and metadata questions from yesterday
- I uploaded the eight items for COHESA and sixty-two for Gender
- I ran the script to tag ORCID identifiers with my `2022-09-22-add-orcids.csv` file and tagged twenty-seven
- Maria asked for help uploading a large PDF to CGSpace
- The PDF is only two pages, but it is 139MB!
- I decided to compress it with GhostScript, first with the screen profile (72dpi), then with the ebook profile (150dpi):
```console
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=Key\ facts\ from\ a\ traditional\ colombian\ food\ market-screen.pdf Key\ facts\ from\ a\ traditional\ colombian\ food\ market.pdf
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=Key\ facts\ from\ a\ traditional\ colombian\ food\ market-ebook.pdf Key\ facts\ from\ a\ traditional\ colombian\ food\ market.pdf
```
- The ebook one looks really good and is only 2.4MB...
- But for reference, this free Adobe tool seems to work: https://www.adobe.com/acrobat/online/compress-pdf.html
## 2022-11-24
- My script finished downloading the CIAT Library PDFs
- I did some more work on my `post-ciat-pdfs.py` script and tested uploading the items to my local DSpace and DSpace Test
- Then I ran the script on CGSpace, uploading ~1,500 PDFs to to existing items
## 2022-11-25
- Tony Murray, who is working on IFPRI's CGSpace integration, emailed me to ask some questions about the REST API
- Oh no, I realized there is a logic issue with the PDFbox cropbox code I added a few weeks ago:
```console
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v -f -i 10568/77010
The following MediaFilters are enabled:
Full Filter Name: org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter
org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter
Loading @mire database changes for module MQM
Changes have been processed
IM Thumbnail tropentag2016_marshall.pdf is replacable.
File: tropentag2016_marshall.pdf.jpg
ERROR filtering, skipping bitstream:
Item Handle: 10568/77010
Bundle Name: ORIGINAL
File Size: 1486580
Checksum: 1ad66d918a56a5e84667386e1a32e352 (MD5)
Asset Store: 0
java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:325)
at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:248)
at org.apache.pdfbox.pdmodel.PDDocument.getPage(PDDocument.java:1543)
at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:167)
at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:27)
at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103)
at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61)
at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181)
at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159)
at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:232)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
```
- Salem gave me a list of CGSpace collections that have double spaces in the names
- Normally this would only be a minor annoyance, but he discovered that the REST API seems to trim the spaces, which causes an issue when trying to reference them!
- He sent me a list of about ten collection UUIDs so I fixed them
- I found a bunch of LIVES presentations on CGSpace that have presentations on SlideShare with incorrect licenses... I updated about fifty of them
## 2022-11-26
- Sync DSpace Test with CGSpace
- I increased the session timeout in Tomcat from thirty minutes to sixty, as requested by Maria a few weeks ago
- See: https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44
- I re-built DSpace on CGSpace, ran all updates, and rebooted the machine
- Then after coming back up the handle server won't start
- The `handle-server.log` file shows:
```console
Shutting down...
"2022/11/26 02:12:17 CET" 25 Rotating log files
Error: null
(see the error log for details.)
```
- In the `error.log` file I see:
```console
"2022/11/26 02:12:18 CET" 25 Started new run.
java.lang.UnsupportedOperationException
at java.lang.Runtime.runFinalizersOnExit(Runtime.java:287)
at java.lang.System.runFinalizersOnExit(System.java:1059)
at net.handle.server.Main.initialize(Main.java:124)
at net.handle.server.Main.main(Main.java:75)
Shutting down...
```
- Ah, it seems to be due to an [issue in OpenJDK 1.8.0_352](https://groups.google.com/g/dspace-tech/c/PqjfA5mqG4w/m/FhxI5oXhFwAJ?pli=1)
- I see the server upgraded to the new JDK version on 2022-11-10:
```console
Upgrade: openjdk-8-jdk-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04), openjdk-8-jre-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04)
End-Date: 2022-11-10 04:10:45
```
- As highlighted in the dspace-tech mailing list thread above, [this OpenJDK release deprecated `Runtime.runFinalizersOnExit`](https://mail.openjdk.org/pipermail/jdk8u-dev/2022-October/015706.html):
```console
- JDK-8287132: Retire Runtime.runFinalizersOnExit so that it always throws UOE
```
- I downloaded the previous versions of the packages from Launchpad:
```console
# wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jdk-headless_8u342-b07-0ubuntu1~20.04_amd64.deb
# wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jre-headless_8u342-b07-0ubuntu1~20.04_amd64.deb
# dpkg -i openjdk-8-j*8u342-b07*.deb
```
- Then the handle-server process starts up fine, so I held these OpenJDK versions for now:
```console
# apt-mark hold openjdk-8-jdk-headless:amd64 openjdk-8-jre-headless:amd64
openjdk-8-jdk-headless set on hold.
openjdk-8-jre-headless set on hold.
```
- Start a harvest on AReS
## 2022-11-27
- I realized I made a mistake in the PDF CropBox code I wrote for dspace-api a few weeks ago
- For PDFs with only one page I was seeing this in the filter-media output:
```console
java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
```
- It turns out that [PDDocument's getPage() is zero-based](https://javadoc.io/static/org.apache.pdfbox/pdfbox/2.0.27/org/apache/pdfbox/pdmodel/PDDocument.html#getPage-int-)
- I also updated PDFBox from 2.0.24 to 2.0.27
- I synced DSpace 7 Test with CGSpace
- I had to follow my notes from 2022-03 to delete the missing Atmire migrations
## 2022-11-28
- Update `ilri/fix-metadata-values.py` to update the `last_modified` date for items when it updates metadata
- This should allow us to use the normal `index-discovery` (with out `-b`) as well as having REST API responses showing a correct last modified date
- Maria asked me to add some ORCID identifiers for Alliance staff to the controlled vocabulary
- I also updated the `add-orcid-identifiers-csv.py` to update the `last_modified` timestamp of the item
- I re-factored my CGSpace Python scripts to use a helper `util.py` module with common functions
- For now it only has the one for updating an item's `last_modified` timestamp but I will gradually add more
- I also ran our list of ORCID identifiers against ORCID's API to see if anyone changed their name format
- Then I ran them on CGSpace with `ilri/update-orcids.py` to fix them
- Normalize the `text_lang` values for CGSpace metadata again:
```console
localhost/dspacetest= ☘ SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang │ count
───────────┼─────────
en_US │ 2912429
│ 108387
en │ 12457
fr │ 2
vi │ 2
es │ 1
␀ │ 0
(7 rows)
Time: 624.651 ms
localhost/dspacetest= ☘ BEGIN;
BEGIN
Time: 0.130 ms
localhost/dspacetest= ☘ UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', '');
UPDATE 120844
Time: 4074.879 ms (00:04.075)
localhost/dspacetest= ☘ SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang │ count
───────────┼─────────
en_US │ 3033273
fr │ 2
vi │ 2
es │ 1
␀ │ 0
(5 rows)
Time: 346.913 ms
localhost/dspacetest= ☘ COMMIT;
```
- Discussing the UN M.49 regions on CGSpace with Valentina and Abenet
- The PRMS team is confused about our regions, which are mostly UN M.49 with some legacy stuff using different ones
- I think we can fix all the stuff for Initiatives from this year very easily, then work on the legacy stuff later
- Also, I noticed that that [country_converter was using the wrong UN M.49 region for Myanmar](https://github.com/konstantinstadler/country_converter/issues/124)
- I submitted a [pull request](https://github.com/konstantinstadler/country_converter/pull/125)
- I exported a CSV of the Initiatives and ran the csv-metadata-quality script to add missing UN M.49 regions
- To make sure everything was correct I got a list of the changes from csv-metadata-quality and checked them all manually on the UN M.49 site, just in case there was another bug in country_converter
- This fixed regions for about fifty items
- I dumped the UN M.49 regions from the CSV on the UNSD website:
```console
$ csvcut -d";" -c 'Region Name,Sub-region Name,Intermediate Region Name' ~/Downloads/UNSD\ \ Methodology.csv | sed -e 1d -e 's/,/\n/g' | sort -u
Africa
Americas
Asia
Australia and New Zealand
Caribbean
Central America
Central Asia
Channel Islands
Eastern Africa
Eastern Asia
Eastern Europe
Europe
Latin America and the Caribbean
Melanesia
Micronesia
Middle Africa
Northern Africa
Northern America
Northern Europe
Oceania
Polynesia
South America
South-eastern Asia
Southern Africa
Southern Asia
Southern Europe
Sub-Saharan Africa
Western Africa
Western Asia
Western Europe
```
- For now I will combine it with our existing list, which contains a few legacy regions, while we discuss about a long-term plan with Peter and Abenet
- Peter wrote to ask me to change the PIM CRP's full name from `Policies, Institutions and Markets` to `Policies, Institutions, and Markets`
- It's apparently the only CRP with an Oxford comma...?
- I updated them all on CGSpace
- Also, I ran an `index-discovery` without the `-b` since now my metadata update scripts update the `last_modified` timestamp as well and it finished in fifteen minutes, and I see the changes in the Discovery search and facets
## 2022-11-29
- Meeting with Marie-Angelique, Abenet, Sara, Valentina, and Margarita about `dcterms.type` for CG Core
- We discussed some of the feedback from Peter
- Peter and Abenet and I agreed to update some of our metadata in response to the PRMS feedback
- I updated Pacific to Oceania, and Central Africa to Middle Africa, and removed the old ones from the submission form
- These are UN M.49 regions
## 2022-11-30
- I ran csv-metadata-quality on an export of the ILRI community on CGSpace, but only with title, country, and region fields
- It fixed some whitespace issues and added missing regions to about 1,200 items
- I thought of a way to delete duplicate metadata values, since the CSV upload method can't detect them correctly
- First, I wrote a [SQL query](https://chartio.com/learn/databases/how-to-find-duplicate-values-in-a-sql-table/) to identify metadata falues with the same `text_value`, `metadata_field_id`, and `dspace_object_id`:
```console
\COPY (SELECT a.text_value, a.metadata_value_id, a.metadata_field_id, a.dspace_object_id
FROM metadatavalue a
JOIN (
SELECT dspace_object_id, text_value, metadata_field_id, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id NOT IN (11, 12, 28, 136, 159) GROUP BY dspace_object_id, text_value, metadata_field_id HAVING COUNT(*) > 1
) b
ON a.dspace_object_id = b.dspace_object_id
AND a.text_value = b.text_value
AND a.metadata_field_id = b.metadata_field_id
ORDER BY a.text_value) TO /tmp/duplicates.txt
```
- (This query excludes metadata for accession and available dates, provenance, format, etc)
- Then, I sorted the file by fields four and one (`dspace_object_id` and `text_value`) so that the duplicate metadata for each item were next to each other, used awk to print the second field (`metadata_field_id`) from every _other_ line, and created a SQL script to delete the metadata
```console
$ sort -k4,1 /tmp/duplicates.txt | \
awk -F'\t' 'NR%2==0 {print $2}' | \
sed 's/^\(.*\)$/DELETE FROM metadatavalue WHERE metadata_value_id=\1;/' > /tmp/delete-duplicates.sql
```
- This worked very well, but there were some metadata values that were tripled or quadrupled, so it only deleted the first duplicate
- I just ran it again two more times to find the last duplicates, now we have none!
- I also generated another SQL file with commands to update the last modified timestamps of these items:
```console
$ awk -F'\t' '{print $4}' /tmp/duplicates.txt | sort -u | sed "s/^\(.*\)$/UPDATE item SET last_modified=NOW() WHERE uuid='\1';/" > /tmp/update-timestamp.sql
```
- Tezira said she was having trouble archiving submissions
- In the afternoon I looked and found a high number of locks:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c | sort -n
60 dspaceCli
176 dspaceApi
1194 dspaceWeb
```
![PostgreSQL database locks](/cgspace-notes/2022/11/postgres_locks_cgspace-day.png)
- The timing looks suspiciously close to when I was running the batch updates on the ILRI community this morning.
- I restarted Tomcat and PostgreSQL and everything was back to normal
- I found some items on CGSpace in Dinka, Ndogo, and Bari languages, but the `dcterms.language` field was "other"
- That's so unfortunate! These languages are not in ISO 639-1, but they are in ISO 639-3, which uses Alpha 3 and has more space for languages
- I changed them from other to use the three-letter codes, and I will suggest to the CG Core group that we use ISO 639-3 in the future
- Send feedback to Salem about some metadata issues with MEL submissions to CGSpace
<!-- vim: set sw=2 ts=2: -->

378
content/posts/2022-12.md Normal file
View File

@@ -0,0 +1,378 @@
---
title: "December, 2022"
date: 2022-12-01T08:52:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-12-01
- Fix some incorrect regions on CGSpace
- I exported the CCAFS and IITA communities, extracted just the country and region columns, then ran them through csv-metadata-quality to fix the regions
- Add a few more authors to my CSV with author names and ORCID identifiers and tag 283 items!
- Replace "East Asia" with "Eastern Asia" region on CGSpace (UN M.49 region)
<!--more-->
- CGSpace and PRMS information session with Enrico and a bunch of researchers
- I noticed some minor issues with SPDX licenses and AGROVOC terms in items submitted by TIP so I sent a message to Daniel from Alliance
- I startd a harvest on AReS since we've updated so much metadata recently
## 2022-12-02
- File some issues related to metadata on the MEL issue tracker
- [Only use "Open Access" or "Limited Access" access rights when publishing items on CGSpace](https://github.com/CodeObia/MEL/issues/11066)
- [Set the description when submitting bitstreams to CGSpace](https://github.com/CodeObia/MEL/issues/11067)
- [Some items have a Creative Commons license, but are Limited Access and bitstreams are locked](https://github.com/CodeObia/MEL/issues/11068)
## 2022-12-03
- I downloaded a fresh copy of CLARISA's institutions list as well as ROR's latest dump from 2022-12-01 to check how many are matching:
```console
$ curl -s https://api.clarisa.cgiar.org/api/institutions | json_pp > ~/Downloads/2022-12-03-CLARISA-institutions.json
$ jq -r '.[] | .name' ~/Downloads/2022-12-03-CLARISA-institutions.json > ~/Downloads/2022-12-03-CLARISA-institutions.txt
$ ./ilri/ror-lookup.py -i ~/Downloads/2022-12-03-CLARISA-institutions.txt -o /tmp/clarisa-ror-matches.csv -r v1.15-2022-12-01-ror-data.json
$ csvgrep -c matched -m true /tmp/clarisa-ror-matches.csv | wc -l
1864
$ wc -l ~/Downloads/2022-12-03-CLARISA-institutions.txt
7060 /home/aorth/Downloads/2022-12-03-CLARISA-institutions.txt
```
- Out of the box they match 26.4%, but there are many institutions with multiple languages in the text value, as well as countries in parentheses so I think it could be higher
- If I replace the slashes and remove the countries at the end there are slightly more matches, around 29%:
```console
$ sed -e 's_ / _\n_' -e 's_/_\n_' -e 's/ \?(.*)$//' ~/Downloads/2022-12-03-CLARISA-institutions.txt > ~/Downloads/2022-12-03-CLARISA-institutions-alan.txt
```
- I checked CGSpace's top 1,000 affiliations too, first exporting from PostgreSQL:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2022-11-22-affiliations.csv;
```
- Then cutting (tab is the default delimeter):
```console
$ cut -f 1 /tmp/2022-11-22-affiliations.csv > 2022-11-22-affiliations.txt
$ ./ilri/ror-lookup.py -i 2022-11-22-affiliations.txt -o /tmp/cgspace-matches.csv -r v1.15-2022-12-01-ror-data.json
$ csvgrep -c matched -m true /tmp/cgspace-matches.csv | wc -l
542
```
- So that's a 54% match for our top affiliations
- I realized we should actually check affiliations and sponsors, since those are stored in separate fields
- When I add those the matches go down a bit to 45%
- Oh man, I realized institutions like `Université d'Abomey Calavi` don't match in ROR because they are like this in the JSON:
```console
"name": "Universit\u00e9 d'Abomey-Calavi"
```
- So we likely match a bunch more than 50%...
- I exported a list of affiliations and donors from CGSpace for Peter to look over and send corrections
## 2022-12-05
- First day of PRMS technical workshop in Rome
- Last night I submitted a CSV import with changes to 1,500 Alliance items (adding regions) and it hadn't completed after twenty-four hours so I canceled it
- Not sure if there is some rollback that will happen or what state the database will be in, so I will wait a few hours to see what happens before trying to modify those items again
- I started it again a few hours later with a subset of the items and 4GB of RAM instead of 2
- It completed successfully...
## 2022-12-07
- I found a bug in my csv-metadata-quality script regarding the regions
- I was accidentally checking `cg.coverage.subregion` due to a sloppy regex
- This means I've added a few thousand UN M.49 regions to the `cg.coverage.subregion` field in the last few days
- I had to extract them from CGSpace and delete them using `delete-metadata-values.py`
- My [DSpace 7.x pull request to tell ImageMagick about the PDF CropBox](https://github.com/DSpace/DSpace/pull/8550) was merged
- Start a harvest on AReS
## 2022-12-08
- While on the plane I decided to fix some ORCID identifiers, as I had seen some poorly formatted ones
- I couldn't remember the XPath syntax so this was kinda ghetto:
```console
$ xmllint --xpath '//node/isComposedBy/node()' dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE 'label=".*"' | sed -e 's/label="//' -e 's/"$//' > /tmp/orcid-names.txt
$ ./ilri/update-orcids.py -i /tmp/orcid-names.txt -db dspace -u dspace -p 'fuuu' -m 247
```
- After that there were still some poorly formatted ones that my script didn't fix, so perhaps these are new ones not in our list
- I dumped them and combined with the existing ones to resolve later:
```console
localhost/dspace= ☘ \COPY (SELECT dspace_object_id,text_value FROM metadatavalue WHERE metadata_field_id=247 AND text_value LIKE '%http%') to /tmp/orcid-formatting.txt;
COPY 36
```
- I think there are really just some new ones...
```console
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/orcid-formatting.txt| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2022-12-08-orcids.txt
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u | wc -l
1907
$ wc -l /tmp/2022-12-08-orcids.txt
1939 /tmp/2022-12-08-orcids.txt
```
- Then I applied these updates on CGSpace
- Maria mentioned that she was getting a lot more items in her daily subscription emails
- I had a hunch it was related to me updating the `last_modified` timestamp after updating a bunch of countries, regions, etc in items
- Then today I noticed this option in `dspace.cfg`: `eperson.subscription.onlynew`
- By default DSpace sends notifications for modified items too! I've disabled it now...
- I applied 498 fixes and two deletions to affiliations sent by Peter
- I applied 206 fixes and eighty-one deletions to donors sent by Peter
- I tried to figure out how to authenticate to the DSpace 7 REST API
- First [you need a CSRF token](https://github.com/DSpace/RestContract/blob/main/csrf-tokens.md), before you can even try to authenticate
- Then you can authenticate, but I can't get it to work:
```console
$ curl -v https://dspace7test.ilri.org/server/api
...
dspace-xsrf-token: 0b7861fb-9c8a-4eea-be70-b3be3bd0a0b4
...
$ curl -v -X POST --data "user=aorth@omg.com&password=myPassword" "https://dspace7test.ilri.org/server/authn/login" -H "X-XSRF-TOKEN: 0b7861fb-9c8a-4eea-be70-b3be3bd0a0b4"
```
- Start a harvest on AReS
## 2022-12-09
- I found a way to check the owner of a Handle prefix
- You query the admin Handle for the prefix, ie: https://hdl.handle.net/0.na/10568
## 2022-12-11
- I got LDAP authentication working on DSpace 7
## 2022-12-12
- Submit some issues to MEL GitHub:
- [Links to https://mel.cgiar.org/dspace/limited for Limited Access items on CGSpace](https://github.com/CodeObia/MEL/issues/11081)
- [Items submitted to CGSpace without Initiative](https://github.com/CodeObia/MEL/issues/11083)
- PRMS planning meeting before tomorrow's meeting with researchers and submitters
## 2022-12-13
- I made some minor changes to csv-metadata-quality
- I switched to using the SPDX license data as a JSON directly from SPDX, instead of via the now-deprecated spdx-license-list package on pypi
- I exported the Initiatives collection to tag missing regions
- I submitted an issue to MEL GitHub:
- [Set the description of bitstreams in the THUMBNAIL bundle to "IM Thumbnail" when submitting to CGSpace](https://github.com/CodeObia/MEL/issues/11084)
- Submit a pull request to [fix the Handle link in the Citizen Lab test URLs for Iran](https://github.com/citizenlab/test-lists/pull/1199)
- I had originally submitted this in 2018, but it seems someone updated the URL in 2020... hmmm
- I normalized the `text_lang` values on CGSpace again:
```console
dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 3050302
en | 618
| 605
fr | 2
vi | 2
es | 1
| 0
(7 rows)
dspace=# BEGIN;
BEGIN
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', '', NULL);
UPDATE 1223
dspace=# COMMIT;
COMMIT
```
- I wrote an initial version of a script to map CGSpace items to Initiative collections based on their `cg.contributor.initiative` metadata
- I am still considering if I want to add a mode to *un-map* items that are mapped to collections, but do not have the corresponding metadata tag
## 2022-12-14
- Lots of work on PRMS related metadata issues with CGSpace
- We noticed that PRMS uses `cg.identifier.dataurl` for the FAIR score, but not `cg.identifier.url`
- We don't use these consistently for datasets in CGSpace so I decided to move them to the dataurl field, but we will also ask the PRMS team to consider the normal URL field, as there are commonly other external resources related to the knowledge product there
- I updated the `move-metadata-values.py` script to use the latest best practices from my other scripts and some of the helper functions from `util.py`
- Then I exported a list of text values pointing to Dataverse instances from `cg.identifier.url`:
```console
localhost/dspace= ☘ \COPY (SELECT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=219 AND (text_value LIKE '%persistentId%' OR text_value LIKE '%20.500.11766.1/%')) to /tmp/data.txt;
COPY 61
```
- Then I moved them to `cg.identifier.dataurl` on CGSpace:
```console
$ ./ilri/move-metadata-values.py -i /tmp/data.txt -db dspace -u dspace -p 'dom@in34sniper' -f cg.identifier.url -t cg.identifier.dataurl
```
- I still need to add a note to the CGSpace submission form to inform submitters about the correct field for dataset URLs
- I finalized work on my new `fix-initiative-mappings.py` script
- It has two modes:
1. Check item metadata to see which Initiatives are tagged and then map the item if it is not yet mapped to the corresponding Initiative collection
2. Check item collections to see which Initiatives are mapped and then unmap the item if the corresponding Initiative metadata is missing
- The second one is disabled by default until I can get more feedback from Abenet, Michael, and others
- After I applied a handful of collection mappings I started a harvest on AReS
## 2022-12-15
- I did some metadata quality checks on the Initiatives collection, adding some missing regions and removing a few duplicate ones
## 2022-12-18
- Load on the server is a bit high
- Looking at the nginx logs I see someone from the University of Chicago (128.135.98.29) is using RStudio Desktop to query and scrape CGSpace
```
# grep -c 'RStudio Desktop' /var/log/nginx/access.log
5570
```
- RStudio is already in the ILRI bot overrides for DSpace so it shouldn't be causing any extra hits, but I'll put an HTTP 403 in the nginx config to tell the user to use the REST API
- Start a harvest on AReS
## 2022-12-21
- I saw that load on CGSpace was over 20.0 for several hours
- I saw there were some stuck locks in PostgreSQL:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
948 dspaceApi
30 dspaceCli
1237 dspaceWeb
```
- Ah, it's likely there is something stuck because I see the load high since yesterday at 6AM, which is 24 hours now:
![CPU load day](/cgspace-notes/2022/12/cpu-day.png)
![PostgreSQL locks week](/cgspace-notes/2022/12/postgres_locks_ALL-week.png)
- I ran all updates and restarted the server
## 2022-12-22
- I exported the Initiatives collection to check the mappings
- My `fix-initiative-mappings.py` script found six items that could be mapped to new collections based on metadata
- I am still not doing automatic _unmappings_ though...
## 2022-12-23
- I exported the Initiatives collection to check the metadata quality
- I fixed a few errors and missing regions using csv-metadata-quality
- Abenet and Bizu noticed some strange characters in affiliations submitted by MEL
- They appear like so in four items currently `Instituto Nacional de Investigaci<63>n y Tecnolog<6F>a Agraria y Alimentaria, Spain`
- I submitted [an issue](https://github.com/CodeObia/MEL/issues/11108) on MEL's GitHub repository
## 2022-12-24
- Export the ILRI community to try to see if there were any items with Initiative metadata that are not mapped to Initiative collections
- I found about twenty...
- Then I did the same for the AICCRA community
## 2022-12-25
- The load on the server is high and I see some seemingly stuck PostgreSQL locks from dspaceCli:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
44 dspaceApi
58 dspaceCli
```
- [Looking into this more](https://jaketrent.com/post/find-kill-locks-postgres/) I see the PIDs for the dspaceCli locks:
```sql
SELECT pl.pid FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE psa.application_name = 'dspaceCli'
```
- And the SQL queries themselves:
```console
postgres=# SELECT pid, state, usename, query, query_start
FROM pg_stat_activity
WHERE pid IN (
SELECT pl.pid FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE psa.application_name = 'dspaceCli'
);
```
- For these fifty-eight locks there are only six queries running
- Interestingly, they all started at either 04:00 or 05:00 this morning...
- I canceled one using `SELECT pg_cancel_backend(1098749);` and then two of the other PIDs died, perhaps they were dependent?
- Then I canceled the next one and the remaining ones died also
- I exported the entire CGSpace and then ran the `fix-initiative-mappings.py` script, which found 124 items to be mapped
- Getting only the items that have new mappings from the output file is currently tricky because you have to change the file to unix encoding, capture the diff output from the original, and re-add the column headers, but at least this makes the DSpace batch import have to check WAY fewer items
- For the record, I used grep to get only the new lines:
```console
$ grep -xvFf /tmp/orig.csv /tmp/cgspace-mappings.csv > /tmp/2022-12-25-fix-mappings.csv
```
- Then I imported to CGSpace, and will start an AReS harvest once its done
- The import process was quick but it triggered a lot of Solr updates and I see locks rising from dspaceCli again
- After five hours the Solr updating from the metadata import wasn't finished, so I cancelled it, and I see that the items were *not* mapped...
- I split the CSV into multiple files, each with ten items, and the first one imported, but the second went on to do Solr updating stuff forever...
- All twelve files worked except the second one, so it must be something with one of those items...
- Now I started a harvest on AReS
## 2022-12-28
- I got a notice from UptimeRobot that CGSpace was down
- I look at the server and the load is only 3 or 4.x and looking at Munin I don't see any system statistics that are alarming
- PostgreSQL locks look fine, memory and DSpace sessions look fine...
- There were a strangely high number of tuple accesses half an hour ago, and high CPU going up to then
![PostgreSQL tuple access](/cgspace-notes/2022/12/postgres_tuples_cgspace-day.png)
![CPU day](/cgspace-notes/2022/12/cpu-day2.png)
- And I can access the website just fine, so I guess everything is OK
- I exported the Initiatives collection to tag missing regions...
## 2022-12-29
- I exported the Initiatives collection again and I'm wondering why we have so many items with `text_lang` set to NULL and others when I have been periodically resetting them
- It turns out that doing `... text_lang IN ('en', '', NULL)` doesn't properly check for values with NULL
- We actually need to do:
```sql
UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IS NULL OR text_lang IN ('en', '');
```
- I updated the text lang values on CGSpace and re-exported the community
- I fixed a bunch of invalid licenses in these items
- Then I added mappings for another handful of items
- I tagged ORCID identifiers for another thirty items or so
- At 8PM I got a notice from UptimeRobot again that CGSpace was down
- The load is still only around 2.x or 3.x, but there are a lot (and increasing) number of PostgreSQL connections and locks
- They appear to be all from the frontend:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
2892 dspaceWeb
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
2950 dspaceWeb
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
3792 dspaceWeb
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
4460 dspaceWeb
```
- I don't see any other system statistics that look out of order...
- DSpace sessions, network throughput, CPU, etc all seem sane...
- And then all of a sudden, I didn't do anything, but all the locks disappeared and I was able to access the website... WTF
## 2022-12-30
- Start a harvest on AReS
## 2022-12-31
- I found a bunch of items on AReS that have issue dates in 2023 which made me curious
- Looking closer, I think all of these have been tagged incorrectly because they were published online already in 2022
- I sent a mail to Abenet and Bizu to ask, but for sure I know that PRMS will be considering first published date as first published date, no matter if that is online or in print
- I also added some ORCID identifiers to our list and generated thumbnails for some journal articles that were Creative Commons
<!-- vim: set sw=2 ts=2: -->

609
content/posts/2023-01.md Normal file
View File

@@ -0,0 +1,609 @@
---
title: "January, 2023"
date: 2023-01-01T08:44:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-01-01
- Apply some more ORCID identifiers to items on CGSpace using my `2022-09-22-add-orcids.csv` file
- I want to update all ORCID names and refresh them in the database
- I see we have some new ones that aren't in our list if I combine with this file:
<!--more-->
```console
$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u | wc -l
1939
$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml 2022-09-22-add-orcids.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u | wc -l
1973
```
- I will extract and process them with my `resolve-orcids.py` script:
```console
$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml 2022-09-22-add-orcids.csv| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-01-01-orcids.txt
$ ./ilri/resolve-orcids.py -i /tmp/2023-01-01-orcids.txt -o /tmp/2023-01-01-orcids-names.txt -d
```
- Then update them in the database:
```console
$ ./ilri/update-orcids.py -i /tmp/2023-01-01-orcids-names.txt -db dspace -u dspace -p 'fuuu' -m 247
```
- Load on CGSpace is high around 9.x
- I see there is a CIAT bot harvesting via the REST API with IP 45.5.186.2
- Other than that I don't see any particular system stats as alarming
- There has been a marked increase in load in the last few weeks, perhaps due to Initiative activity...
- Perhaps there are some stuck PostgreSQL locks from CLI tools?
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
58 dspaceCli
46 dspaceWeb
```
- The current time on the server is 08:52 and I see the dspaceCli locks were started at 04:00 and 05:00... so I need to check which cron jobs those belong to as I think I noticed this last month too
- I'm going to wait and see if they finish, but by tomorrow I will kill them
## 2023-01-02
- The load on the server is now very low and there are no more locks from dspaceCli
- So there *was* some long-running process that was running and had to finish!
- That finally sheds some light on the "high load on Sunday" problem where I couldn't find any other distinct pattern in the nginx or Tomcat requests
## 2023-01-03
- The load from the server on Sundays, which I have noticed for a long time, seems to be coming from the DSpace checker cron job
- This checks the checksums of all bitstreams to see if they match the ones in the database
- I exported the entire CGSpace metadata to do country/region checks with `csv-metadata-quality`
- I extracted only the items with countries, which was about 48,000, then split the file into parts of 10,000 items, but the upload found 2,000 changes in the first one and took several hours to complete...
- IWMI sent me ORCID identifiers for new scientsts, bringing our total to 2,010
## 2023-01-04
- I finally finished applying the region imports (in five batches of 10,000)
- It was about 7,500 missing regions in total...
- Now I will move on to doing the Initiative mappings
- I modified my `fix-initiative-mappings.py` script to only write out the items that have updated mappings
- This makes it way easier to apply fixes to the entire CGSpace because we don't try to import 100,000 items with no changes in mappings
- More dspaceCli locks from 04:00 this morning (current time on server is 07:33) and today is a Wednesday
- The checker cron job runs on `0,3`, which is Sunday and Wednesday, so this is from that...
- Finally at 16:30 I decided to kill the PIDs associated with those locks...
- I am going to disable that cron job for now and watch the server load for a few weeks
- Start a harvest on AReS
## 2023-01-08
- It's Sunday and I see some PostgreSQL locks belonging to dspaceCli that started at 05:00
- That's strange because I disabled the `dspace checker` one last week, so I'm not sure which this is...
- It's currently 2:30PM on the server so these locks have been there for almost twelve hours
- I exported the entire CGSpace to update the Initiative mappings
- Items were mapped to ~58 new Initiative collections
- Then I ran the ORCID import to catch any new ones that might not have been tagged
- Then I started a harvest on AReS
## 2023-01-09
- Fix some invalid Initiative names on CGSpace and then check for missing mappings
- Check for missing regions in the Initiatives collection
- Export a list of author affiliations from the Initiatives community for Peter to check
- Was slightly ghetto because I did it from a CSV export of the Initiatives community, then imported to OpenRefine to split multi-value fields, then did some sed nonsense to handle the quoting:
```console
$ csvcut -c 'cg.contributor.affiliation[en_US]' ~/Downloads/2023-01-09-initiatives.csv | \
sed -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' | \
sort -u | \
sed -e 's/^\(.*\)/"\1/' -e 's/\(.*\)$/\1"/' > /tmp/2023-01-09-initiatives-affiliations.csv
```
## 2023-01-10
- Export the CGSpace Initiatives collection to check for missing regions and collection mappings
## 2023-01-11
- I'm trying the DSpace 7 REST API again
- While following onathe [DSpace 7 REST API authentication docs](https://github.com/DSpace/RestContract/blob/main/authentication.md) I still cannot log in via curl on the command line because I get a `Access is denied. Invalid CSRF token.` message
- Logging in via the HAL Browser works...
- Someone on the DSpace Slack mentioned that the [authentication documentation is out of date](https://github.com/DSpace/RestContract/issues/209) and we need to specify the cookie too
- I tried it and finally got it to work:
```console
$ curl --head https://dspace7test.ilri.org/server/api
...
set-cookie: DSPACE-XSRF-COOKIE=42c78c56-613d-464f-89ea-79142fc5b519; Path=/server; Secure; HttpOnly; SameSite=None
dspace-xsrf-token: 42c78c56-613d-464f-89ea-79142fc5b519
$ curl -v -X POST https://dspace7test.ilri.org/server/api/authn/login --data "user=alantest%40cgiar.org&password=dspace" -H "X-XSRF-TOKEN: 42c78c56-613d-464f-89ea-79142fc5b519" -b "DSPACE-XSRF-COOKIE=42c78c56-613d-464f-89ea-79142fc5b519"
...
authorization: Bearer eyJh...9-0
$ curl -v "https://dspace7test.ilri.org/api/core/items" -H "Authorization: Bearer eyJh...9-0"
```
- I created [a pull request](https://github.com/DSpace/RestContract/pull/213) to fix the docs
- I did quite a lot of cleanup and updates on the IFPRI batch items for the Gender Equality batch upload
- Then I uploaded them to CGSpace
- I added about twenty more ORCID identifiers to my list and tagged them on CGSpace
## 2023-01-12
- I exported the entire CGSpace and did some cleanups on all metadata in OpenRefine
- I was primarily interested in normalizing the DOIs, but I also normalized a bunch of publishing places
- After this imports I will export it again to do the Initiative and region mappings
- I ran the `fix-initiative-mappings.py` script and got forty-nine new mappings...
- I added several dozen new ORCID identifiers to my list and tagged ~500 on CGSpace
- Start a harvest on AReS
## 2023-01-13
- Do a bit more cleanup on licenses, issue dates, and publishers
- Then I started importing my large list of 5,000 items changed from yesterday
- Help Karen add abstracts to a bunch of SAPLING items that were missing them on CGSpace
- For now I only did open access journal articles, but I should do the reports and others too
## 2023-01-14
- Export CGSpace and check for missing Initiative mappings
- There were a total of twenty-five
- Then I exported the Initiatives communinty to check the countries and regions
## 2023-01-15
- Start a harvest on AReS
## 2023-01-16
- Batch import four IFPRI items for CGIAR Initiative on Low-Emission Food Systems
- Batch import another twenty-eight items for IFPRI across several Initiatives
- On this one I did quite a bit of extra work to check for CRPs and data/code URLs in the acknowledgements, licenses, volume/issue/extent, etc
- I fixed some authors, an ISBN, and added extra AGROVOC keywords from the abstracts
- Then I checked for duplicates and ran it through csv-metadata-quality to make sure the countries/regions matched and there were no duplicate metadata values
## 2023-01-17
- Batch import another twenty-three items for IFPRI across several Initiatives
- I checked the IFPRI eBrary for extra CRPs and data/code URLs in the acknowledgements, licenses, volume/issue/extent, etc
- I fixed some authors, an ISBN, and added extra AGROVOC keywords from the abstracts
- Then I found and removed one duplicate in these items, as well as another on CGSpace already (!): 10568/126669
- Then I ran it through csv-metadata-quality to make sure the countries/regions matched and there were no duplicate metadata values
- I exported the Initiatives collection to check the mappings, regions, and other metadata with csv-metadata-quality
- I also added a bunch of ORCID identifiers to my list and tagged 837 new metadata values on CGSpace
- There is a high load on CGSpace pretty regularly
- Looking at Munin it shows there is a marked increase in DSpace sessions the last few weeks:
![DSpace sessions year](/cgspace-notes/2023/01/jmx_dspace_sessions-year.png)
- Is this attributable to all the PRMS harvesting?
- I also see some PostgreSQL locks starting earlier today:
![PostgreSQL locks day](/cgspace-notes/2023/01/postgres_connections_ALL-day.png)
- I'm curious to see what kinds of IPs have been connecting, so I will look at the last few weeks:
```console
# zcat --force /var/log/nginx/{rest,access,library-access,oai}.log /var/log/nginx/{rest,access,library-access,oai}.log.1 /var/log/nginx/{rest,access,library-access,oai}.log.{2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}.gz | awk '{print $1}' | sort | uniq > /tmp/2023-01-17-cgspace-ips.txt
# wc -l /tmp/2023-01-17-cgspace-ips.txt
129446 /tmp/2023-01-17-cgspace-ips.txt
```
- I ran the IPs through my `resolve-addresses-geoip2.py` script to resolve their ASNs/networks, then extracted some lists of data center ISPs by eyeballing them (Amazon, Google, Microsoft, Apple, DigitalOcean, HostRoyale, and a dozen others):
```console
$ csvgrep -c asn -r '^(8075|714|16276|15169|23576|24940|13238|32934|14061|12876|55286|203020|204287|7922|50245|6939|16509|14618)$' \
/tmp/2023-01-17-cgspace-ips.csv | csvcut -c network | \
sed 1d | sort | uniq > /tmp/networks-to-block.txt
$ wc -l /tmp/networks-to-block.txt
776 /tmp/networks-to-block.txt
```
- I added the list of networks to nginx's `bot-networks.conf` so they will all be heavily rate limited
- Looking at the Munin stats again I see the load has been extra high since yesterday morning:
![CPU week](/cgspace-notes/2023/01/cpu-week.png)
- But still, it's suspicious that there are so many PostgreSQL locks
- Looking at the Solr stats to check the hits the last month (actually I skipped December because I was so busy)
- I see 31.148.223.10 is on ALFA TELECOM s.r.o. in Russia and it made 43,000 requests this month (and 400,000 more last month!)
- I see 18.203.245.60 is on Amazon and it uses weird user agents, different with each request
- I see 3.249.192.212 is on Amazon and it uses weird user agents, different with each request
- I see 34.244.160.145 is on Amazon and it uses weird user agents, different with each request
- I see 52.213.59.101 is on Amazon and it uses weird user agents, different with each request
- I see 91.209.8.29 is in Bulgaria on DGM EOOD and is low risk according to Scamlytics, but their user agent is all lower case and it's a data center ISP so nope
- I see 54.78.176.127 is on Amazon and it uses weird user agents, different with each request
- I see 54.246.128.111 is on Amazon and it uses weird user agents, different with each request
- I see 54.74.197.53 is on Amazon and it uses weird user agents, different with each request
- I see 52.16.103.133 is on Amazon and it uses weird user agents, different with each request
- I see 63.32.99.252 is on Amazon and it uses weird user agents, different with each request
- I see 176.34.141.181 is on Amazon and it uses weird user agents, different with each request
- I see 34.243.17.80 is on Amazon and it uses weird user agents, different with each request
- I see 34.240.206.16 is on Amazon and it uses weird user agents, different with each request
- I see 18.203.81.120 is on Amazon and it uses weird user agents, different with each request
- I see 176.97.210.106 is on Tube Hosting and is rate VERY BAD, malicious, scammy on everything I checked
- I see 79.110.73.54 is on ALFA TELCOM / Serverel and is using a different, weird user agent with each request
- There are too many to count... so I will purge these and then move on to user agents
- I purged hits from those IPs:
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 439185 hits from 31.148.223.10 in statistics
Purging 2151 hits from 18.203.245.60 in statistics
Purging 1990 hits from 3.249.192.212 in statistics
Purging 1975 hits from 34.244.160.145 in statistics
Purging 1969 hits from 52.213.59.101 in statistics
Purging 2540 hits from 91.209.8.29 in statistics
Purging 1624 hits from 54.78.176.127 in statistics
Purging 1236 hits from 54.74.197.53 in statistics
Purging 1327 hits from 54.246.128.111 in statistics
Purging 1108 hits from 52.16.103.133 in statistics
Purging 1045 hits from 63.32.99.252 in statistics
Purging 999 hits from 176.34.141.181 in statistics
Purging 997 hits from 34.243.17.80 in statistics
Purging 985 hits from 34.240.206.16 in statistics
Purging 862 hits from 18.203.81.120 in statistics
Purging 1654 hits from 176.97.210.106 in statistics
Purging 1628 hits from 51.81.193.200 in statistics
Purging 1020 hits from 79.110.73.54 in statistics
Purging 842 hits from 35.153.105.213 in statistics
Purging 1689 hits from 54.164.237.125 in statistics
Total number of bot hits purged: 466826
```
- Looking at user agents in Solr statistics from 2022-12 and 2023-01 I see some weird ones:
- `azure-logic-apps/1.0 (workflow e1f855704d6543f48be6205c40f4083f; version 08585300079823949478) microsoft-flow/1.0`
- `Gov employment data scraper ([[your email]])`
- `Microsoft.Data.Mashup (https://go.microsoft.com/fwlink/?LinkID=304225)`
- `crownpeak`
- `Mozilla/5.0 (compatible)`
- Also, a ton of them are lower case, which I've never seen before... it might be possible, but looks super fishy to me:
- `mozilla/5.0 (x11; ubuntu; linux x86_64; rv:84.0) gecko/20100101 firefox/86.0`
- `mozilla/5.0 (macintosh; intel mac os x 11_3) applewebkit/537.36 (khtml, like gecko) chrome/89.0.4389.90 safari/537.36`
- `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.75 safari/537.36`
- `mozilla/5.0 (windows nt 10.0; win64; x64; rv:86.0) gecko/20100101 firefox/86.0`
- `mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/90.0.4430.93 safari/537.36`
- `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/92.0.4515.159 safari/537.36`
- `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/88.0.4324.104 safari/537.36`
- `mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.75 safari/537.36`
- I purged some of those:
```console
$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
Purging 1658 hits from azure-logic-apps\/1.0 in statistics
Purging 948 hits from Gov employment data scraper in statistics
Purging 786 hits from Microsoft\.Data\.Mashup in statistics
Purging 303 hits from crownpeak in statistics
Purging 332 hits from Mozilla\/5.0 (compatible) in statistics
Total number of bot hits purged: 4027
```
- Then I ran all system updates on the server and rebooted it
- Hopefully this clears the locks and the nginx mitigation helps with the load from non-human hosts in large data centers
- I need to re-work how I'm doing this whitelisting and blacklisting... it's way too complicated now
- Export entire CGSpace to check Initiative mappings, and add nineteen...
- Start a harvest on AReS
## 2023-01-18
- I'm looking at all the ORCID identifiers in the database, which seem to be way more than I realized:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=247) to /tmp/2023-01-18-orcid-identifiers.txt;
COPY 4231
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/2023-01-18-orcid-identifiers.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-01-18-orcids.txt
$ wc -l /tmp/2023-01-18-orcids.txt
4518 /tmp/2023-01-18-orcids.txt
```
- Then I resolved them from ORCID and updated them in the database:
```console
$ ./ilri/resolve-orcids.py -i /tmp/2023-01-18-orcids.txt -o /tmp/2023-01-18-orcids-names.txt -d
$ ./ilri/update-orcids.py -i /tmp/2023-01-18-orcids-names.txt -db dspace -u dspace -p 'fuuu' -m 247
```
- Then I updated the controlled vocabulary
- CGSpace became inactive in the afternoon, with a high number of locks, but surprisingly low CPU usage:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
83 dspaceApi
7829 dspaceWeb
```
- In the DSpace logs I see some weird SQL messages, so I decided to restart PostgreSQL and Tomcat 7...
- I hope this doesn't cause some issue with in-progress workflows...
- I see another user on Cox in the US (98.186.216.144) crawling and scraping XMLUI with Python
- I will add python to the list of bad bot user agents in nginx
- While looking into the locks I see some potential Java heap issues
- Indeed, I see two out of memory errors in Tomcat's journal:
```console
tomcat7[310996]: java.lang.OutOfMemoryError: Java heap space
tomcat7[310996]: Jan 18, 2023 1:37:03 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
```
- Which explains why the locks went down to normal numbers as I was watching... (because Java crashed)
## 2023-01-19
- Update a bunch of ORCID identifiers, Initiative mappings, and regions on CGSpace
- So it seems an IFPRI user got caught up in the blocking I did yesterday
- Their ISP is Comcast...
- I need to re-work the ASN blocking on nginx, but for now I will just get the ASNs again minus Comcast:
```console
$ wget https://asn.ipinfo.app/api/text/list/AS714 \
https://asn.ipinfo.app/api/text/list/AS16276 \
https://asn.ipinfo.app/api/text/list/AS15169 \
https://asn.ipinfo.app/api/text/list/AS23576 \
https://asn.ipinfo.app/api/text/list/AS24940 \
https://asn.ipinfo.app/api/text/list/AS13238 \
https://asn.ipinfo.app/api/text/list/AS32934 \
https://asn.ipinfo.app/api/text/list/AS14061 \
https://asn.ipinfo.app/api/text/list/AS12876 \
https://asn.ipinfo.app/api/text/list/AS55286 \
https://asn.ipinfo.app/api/text/list/AS203020 \
https://asn.ipinfo.app/api/text/list/AS204287 \
https://asn.ipinfo.app/api/text/list/AS50245 \
https://asn.ipinfo.app/api/text/list/AS6939 \
https://asn.ipinfo.app/api/text/list/AS16509 \
https://asn.ipinfo.app/api/text/list/AS14618
$ cat AS* | sort | uniq | wc -l
18179
$ cat /tmp/AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
$ wc -l /tmp/networks.txt
5872 /tmp/networks.txt
```
## 2023-01-20
- A lot of work on CGSpace metadata (ORCID identifiers, regions, and Initiatives)
- I noticed that MEL and CGSpace are using slightly different vocabularies for SDGs so I sent an email to Salem and Sara
## 2023-01-21
- Export the Initiatives community again to perform collection mappings and country/region fixes
## 2023-01-22
- There has been a high load on the server for a few days, currently 8.0... and I've been seeing some PostgreSQL locks stuck all day:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
11 dspaceApi
28 dspaceCli
981 dspaceWeb
```
- Looking at the locks I see they are from this morning at 5:00 AM, which is the `dspace checker-email` script
- Last week I disabled the one that ones at 4:00 AM, but I guess I will experiment with disabling this too...
- Then I killed the PIDs of the locks
```console
$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceCli';" | less -S
...
$ ps auxw | grep 18986
postgres 1429108 1.9 1.5 3359712 508148 ? Ss 05:00 13:40 postgres: 12/main: dspace dspace 127.0.0.1(18986) SELECT
```
- Also, I checked the age of the locks and killed anything over 1 day:
```console
$ psql < locks-age.sql | grep days | less -S
```
- Then I ran all updates on the server and restarted it...
- Salem responded to my question about the SDG mismatch between MEL and CGSpace
- We agreed to use a version based on the text of [this site](http://metadata.un.org/sdg/?lang=en)
- Salem is having issues with some REST API submission / updates
- I updated DSpace Test with a recent CGSpace backup and created a super admin user for him to test
- Clean and normalize fifty-eight IFPRI records for batch import to CGSpace
- I did a duplicate check and found six, so that's good!
- I exported the entire CGSpace to check for missing Initiative mappings
- Then I exported the Initiatives community to check for missing regions
- Then I ran the script to check for missing ORCID identifiers
- Then *finally*, I started a harvest on AReS
## 2023-01-23
- Salem found that you can actually harvest everything in DSpace 7 using the [`discover/browses` endpoint](https://dspace7test.ilri.org/server/api/discover/browses/title/items?page=1&size=100)
- Exported CGSpace again to examine and clean up a bunch of stuff like ISBNs in the ISSN field, DOIs in the URL field, dataset URLs in the DOI field, normalized a bunch of publisher places, fixed some countries and regions, fixed some licenses, etc
- I noticed that we still have "North America" as a region, but according to UN M.49 that is the continent, which comprises "Northern America" the region, so I will update our controlled vocabularies and all existing entries
- I imported changes to 1,800 items
- When it finished five hours later I started a harvest on AReS
## 2023-01-24
- Proof and upload seven items for the Rethinking Food Markets Initiative for IFPRI
- Export CGSpace to do some minor cleanups, Initiative collection mappings, and region fixes
- I also added "CGIAR Trust Fund" to all items with an Initiative in `cg.contributor.initiative`
## 2023-01-25
- Oh shit, the import last night ran for twelve hours and then died:
```console
Error committing changes to database: could not execute statement
Aborting most recent changes.
```
- I re-submitted a smaller version without the CGIAR Trust Fund changes for now just so we get the regions and other fixes
- Do some work on SAPLING issues for CGSpace, sending a large list of issues we found to the MEL team for items they submitted
- Abenet noticed that the number of items in the Initiatives community appears to have dropped by about 2,000 in the XMLUI
- We looked on AReS and all the items are still there
- I looked in the DSpace log and see around 2,000 messages like this:
```console
2023-01-25 07:14:59,529 ERROR com.atmire.versioning.ModificationLogger @ Error while writing item to versioning index: c9fac1f2-6b2b-4941-8077-40b7b5c936b6 message:missing required field: epersonID
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: missing required field: epersonID
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102)
at com.atmire.versioning.ModificationLogger.indexItem(ModificationLogger.java:263)
at com.atmire.versioning.ModificationConsumer.end(ModificationConsumer.java:134)
at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:157)
at org.dspace.core.Context.dispatchEvents(Context.java:455)
at org.dspace.core.Context.commit(Context.java:424)
at org.dspace.core.Context.complete(Context.java:380)
at org.dspace.app.bulkedit.MetadataImport.main(MetadataImport.java:1399)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
```
- I filed a ticket with Atmire to ask them
- For now I just did a light Discovery reindex (not the full one) and all the items appeared again
- Submit an issue to MEL GitHub regarding the capitalization of CRPs: https://github.com/CodeObia/MEL/issues/11133
- I talked to Salem and he said that this is a legacy thing from when CGSpace was using ALL CAPS for most of its metadata. I provided him with [our current controlled vocabulary for CRPs](https://ilri.github.io/cgspace-submission-guidelines/cg-contributor-crp/cg-contributor-crp.txt) and he will update it in MEL.
- On that note, Peter and Abenet and I realized that we still have an old field `cg.subject.crp` with about 450 values in it, but it has not been used for a few years (they are using the old ALL CAPS CRPs)
- I exported this list of values to lowercase them and move them to `cg.contributor.crp`
- Even if some items end up with multiple CRPs, they will get de-duplicated when I remove duplicate values soon
```console
$ ./ilri/fix-metadata-values.py -i /tmp/2023-01-25-fix-crp-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.crp -t correct
$ ./ilri/move-metadata-values.py -i /tmp/2023-01-25-move-crp-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.crp -t cg.contributor.crp
```
- After fixing and moving them all, I deleted the `cg.subject.crp` field from the metadata registry
- I realized a smarter way to update the text lang attributes of metadata would be to restrict the query to items that are in the archive and not withdrawn:
```sql
UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn) AND text_lang IS NULL OR text_lang IN ('en', '');
```
- I tried that in a transaction and it hung, so I canceled it and rolled back
- I see some PostgreSQL locks attributed to `dspaceApi` that were started at `2023-01-25 13:40:04.529087+01` and haven't changed since then (that's eight hours ago)
- I killed the pid...
- There were also saw some locks owned by `dspaceWeb` that were nine and four hours old, so I killed those too...
- Now Maria was able to archive one submission of hers that was hanging all afternoon, but I still can't run the update on the text langs...
- Export entire CGSpace to do Initiative mappings again
- Started a harvest on AReS
## 2023-01-26
- Export entire CGSpace to do some metadata cleanup on various fields
- I also added "CGIAR Trust Fund" to all items in the Initiatives community
## 2023-01-27
- Export a list of affiliations in the Initiatives community for Peter, trying a new method to avoid exporting *everything* from PostgreSQL:
```console
$ dspace metadata-export -i 10568/115087 -f /tmp/2023-01-27-initiatives.csv
$ csvcut -c 'cg.contributor.affiliation[en_US]' 2023-01-27-initiatives.csv \
| sed -e 1d -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' -e '/^$/d' \
| sort | uniq -c | sort -h \
| awk 'BEGIN { FS = "^[[:space:]]+[[:digit:]]+[[:space:]]+" } {print $2}'\
| sed -e '1i cg.contributor.affiliation' -e 's/^\(.*\)$/"\1"/' \
> /tmp/2023-01-27-initiatives-affiliations.csv
```
- The first sed command strips the quotes, deletes empty lines, and splits multiple values on "||"
- The awk command sets the field separator to something so we can get the second "field" of the sort command, ie:
```console
...
309 International Center for Agricultural Research in the Dry Areas
412 International Livestock Research Institute
```
- The second sed command adds the CSV header and quotes back
- I did the same for authors and donors and send them to Peter to make corrections
## 2023-01-28
- Daniel from the Alliance said they are getting an HTTP 401 when trying to submit items to CGSpace via the REST API
## 2023-01-29
- Export the entire CGSpace to do Initiatives collection mappings
- I was thinking about a way to use Crossref's API to enrich our data, for example checking registered DOIs for license information, publishers, etc
- Turns out I had already written `crossref-doi-lookup.py` last year, and it works
- I exported a list of all DOIs without licenses from CGSpace, minus the CIFOR ones because I know they aren't registered on Crossref, which is about 11,800 DOIs
```console
$ csvcut -c 'cg.identifier.doi[en_US]' ~/Downloads/2023-01-29-CGSpace-DOIs-without-licenses.csv \
| csvgrep -c 'cg.identifier.doi[en_US]' -r '.*cifor.*' -i \
| sed 1d > /tmp/2023-01-29-dois.txt
$ wc -l /tmp/2023-01-29-dois.txt
11819 /tmp/2023-01-29-dois.txt
$ ./ilri/crossref-doi-lookup.py -e a.orth@cgiar.org -i /tmp/2023-01-29-dois.txt -o /tmp/crossref-results.csv
$ csvcut -c 'id,cg.identifier.doi[en_US]' ~/Downloads/2023-01-29-CGSpace-DOIs-without-licenses.csv \
| sed -e 's_https://doi.org/__g' -e 's_https://dx.doi.org/__g' -e 's/cg.identifier.doi\[en_US\]/doi/' \
> /tmp/cgspace-temp.csv
$ csvjoin -c doi /tmp/cgspace-temp.csv /tmp/crossref-results.csv \
| csvgrep -c license -r 'creative' \
| sed '1s/license/dcterms.license[en_US]/' \
| csvcut -c id,license > /tmp/2023-01-29-new-licenses.csv
```
- The above was done with just 5,000 DOIs because it was taking a long time, but after the last step I imported into OpenRefine to clean up the license URLs
- Then I imported 635 new licenses to CGSpace woooo
- After checking the remaining 6,500 DOIs there were another 852 new licenses, woooo
- Peter finished the corrections on affiliations, authors, and donors
- I quickly checked them and applied each on CGSpace
- Start a harvest on AReS
## 2023-01-30
- Run the thumbnail fixer tasks on the Initiatives collections:
```console
$ chrt -b 0 dspace dsrun io.github.ilri.cgspace.scripts.FixLowQualityThumbnails 10568/115087 | tee -a /tmp/FixLowQualityThumbnails.log
$ grep -c remove /tmp/FixLowQualityThumbnails.log
16
$ chrt -b 0 dspace dsrun io.github.ilri.cgspace.scripts.FixJpgJpgThumbnails 10568/115087 | tee -a /tmp/FixJpgJpgThumbnails.log
$ grep -c replacing /tmp/FixJpgJpgThumbnails.log
13
```
## 2023-01-31
- Someone from the Google Scholar team contacted us to ask why Googlebot is blocked from crawling CGSpace
- I said that I blocked them because they crawl haphazardly and we had high load during PRMS reporting
- Now I will unblock their ASN15169 in nginx...
- I urged them to be smarter about crawling since we're a small team and they are a huge engineering company
- I removed their ASN and regenerted my list from 2023-01-17:
```console
$ wget https://asn.ipinfo.app/api/text/list/AS714 \
https://asn.ipinfo.app/api/text/list/AS16276 \
https://asn.ipinfo.app/api/text/list/AS23576 \
https://asn.ipinfo.app/api/text/list/AS24940 \
https://asn.ipinfo.app/api/text/list/AS13238 \
https://asn.ipinfo.app/api/text/list/AS32934 \
https://asn.ipinfo.app/api/text/list/AS14061 \
https://asn.ipinfo.app/api/text/list/AS12876 \
https://asn.ipinfo.app/api/text/list/AS55286 \
https://asn.ipinfo.app/api/text/list/AS203020 \
https://asn.ipinfo.app/api/text/list/AS204287 \
https://asn.ipinfo.app/api/text/list/AS50245 \
https://asn.ipinfo.app/api/text/list/AS6939 \
https://asn.ipinfo.app/api/text/list/AS16509 \
https://asn.ipinfo.app/api/text/list/AS14618
$ cat AS* | sort | uniq | wc -l
17134
$ cat /tmp/AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
```
- Then I updated nginx...
- Re-run the scripts to delete duplicate metadata values and update item timestamps that I originally used in 2022-11
- This was about 650 duplicate metadata values...
- Exported CGSpace to do some metadata interrogation in OpenRefine
- I looked at items that are set as `Limited Access` but have Creative Commons licenses
- I filtered ~150 that had DOIs and checked them on the Crossref API using `crossref-doi-lookup.py`
- Of those, only about five or so were incorrectly marked as having Creative Commons licenses, so I set those to copyrighted
- For the rest, I set them to Open Access
- Start a harvest on AReS
<!-- vim: set sw=2 ts=2: -->

423
content/posts/2023-02.md Normal file
View File

@@ -0,0 +1,423 @@
---
title: "February, 2023"
date: 2023-02-01T10:57:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-02-01
- Export CGSpace to cross check the DOI metadata with Crossref
- I want to try to expand my use of their data to journals, publishers, volumes, issues, etc...
<!--more-->
- First, extract a list of DOIs for use with `crossref-doi-lookup.py`:
```console
$ csvcut -c 'cg.identifier.doi[en_US]' ~/Downloads/2023-02-01-cgspace.csv \
| csvgrep -c 1 -m 'doi.org' \
| csvgrep -c 1 -m ' ' -i \
| csvgrep -c 1 -r '.*cifor.*' -i \
| sed 1d > /tmp/2023-02-01-dois.txt
$ ./ilri/crossref-doi-lookup.py -e a.orth@cgiar.org -i /tmp/2023-02-01-dois.txt -o ~/Downloads/2023-01-31-crossref-results.csv -d
```
- Then extract the ID, DOI, journal, volume, issue, publisher, etc from the CGSpace dump and rename the `cg.identifier.doi[en_US]` to `doi` so we can join on it with the Crossref results file:
```console
$ csvcut -c 'id,cg.identifier.doi[en_US],cg.journal[en_US],cg.volume[en_US],cg.issue[en_US],dcterms.publisher[en_US],cg.number[en_US],dcterms.license[en_US]' ~/Downloads/2023-02-01-cgspace.csv \
| csvgrep -c 'cg.identifier.doi[en_US]' -r '.*cifor.*' -i \
| sed -e '1s/cg.identifier.doi\[en_US\]/doi/' \
-e 's_https://doi.org/__g' \
-e 's_https://dx.doi.org/__g' \
> /tmp/2023-02-01-cgspace-doi-metadata.csv
$ csvjoin -c doi /tmp/2023-02-01-cgspace-doi-metadata.csv ~/Downloads/2023-02-01-crossref-results.csv > /tmp/2023-02-01-cgspace-crossref-check.csv
```
- And import into OpenRefine for analysis and cleaning
- I just noticed that Crossref also has types, so we could use that in the future too!
- I got a few corrections after examining manually, but I didn't manage to identify any patterns that I could use to do any automatic matching or cleaning
## 2023-02-05
- Normalize text lang attributes in PostgreSQL, run a quick Discovery index, and then export CGSpace to check Initiative mappings and countries/regions
- Run all system updates on CGSpace (linode18) and reboot it
## 2023-02-06
- Peter said that a new Initiative was approved last month so we need to add it to CGSpace: `Fragility, Conflict, and Migration`
- There is lots of discussion about the "issue date" versus "available date" with Enrico and IFPRI, after lots of feedback from the PRMS QA
- I filed [an issue on CG Core to propose using `dcterms.available` as an optional field to indicate the online date](https://github.com/AgriculturalSemantics/cg-core/issues/43)
## 2023-02-07
- IFPRI's web developer Tony managed to get his Drupal harvester to have a useful user agent:
```console
54.x.x.x - - [06/Feb/2023:10:10:32 +0100] "POST /rest/items/find-by-metadata-field?limit=%22100&offset=0 HTTP/1.1" 200 58855 "-" "IFPRI drupal POST harvester"
```
- He also noticed that there is no pagination on POST requests to `/rest/items/find-by-metadata-field`, and that he needs to increase his timeout for requests that return 100+ results, ie:
```console
$ curl -f -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.actionArea", "value":"Systems Transformation", "language": "en_US"}'
```
- I need to ask on the DSpace Slack about this POST pagination
- Abenet and Udana noticed that the Handle server was not running
- Looking in the `error.log` file I see that the service is complaining about a lock file being present
- This is because Linode had to do emergency maintenance on the VM host this morning and the Handle server didn't shut down properly
- I'm having an issue with `poetry update` so I spent some time debugging and filed [an issue](https://github.com/python-poetry/poetry/issues/7482)
- Proof and import nine items for the Digital Innovation Inititive for IFPRI
- There were only some minor issues in the metadata
- I also did a duplicate check with `check-duplicates.py` just in case
- I did some minor updates on csv-metadata-quality
- First, to reduce warnings on non-SPDX licenses like "Copyrighted; all rights reserved" and "Other" since they are very common for us and I'm sick of seeing the warnings
- Second, to skip whitespace and newline fixes on the abstract field since so many times they are intended
## 2023-02-08
- Make some edits to IFPRI records requested by Jawoo and Leigh
- Help Alessandra upload a last minute report for SAPLING
- Proof and upload twenty-seven IFPRI records to CGSpace
- It's a good thing I did a duplicate check because I found three duplicates!
- Export CGSpace to update Initiative mappings and country/region mappings
- Then start a harvest on AReS
## 2023-02-09
- Do some minor work on the CSS on the DSpace 7 test
## 2023-02-10
- I noticed a large number of PostgreSQL locks from dspaceWeb on CGSpace:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
2033 dspaceWeb
```
- Looking at the lock age, I see some already 1 day old, including this curious query:
```console
select nextval ('public.registrationdata_seq')
```
- I killed all locks that were more than a few hours old
- Export CGSpace to update Initiative collection mappings
- Discuss adding `dcterms.available` to the submission form
- I also looked in the `dcterms.description` field on CGSpace and found ~1,500 items where the is an indication of an online published date
- Using some facets in OpenRefine I narrowed down the ones mentioning "online" and then extracted the dates to a new column:
```console
cells['dcterms.description[en_US]'].value.replace(/.*?(\d+{2}) ([a-zA-Z]+) (\d+{2}).*/,"$3-$2-$1")
```
- Then to handle formats like "2022-April-26" and "2021-Nov-11" I used some replacement GRELs (note the order so we don't replace short patterns in longer strings prematurely):
```console
value.replace("January","01").replace("February","02").replace("March","03").replace("April","04").replace("May","05").replace("June","06").replace("July","07").replace("August","08").replace("September","09").replace("October","10").replace("November","11").replace("December","12")
value.replace("Jan","01").replace("Feb","02").replace("Mar","03").replace("Apr","04").replace("May","05").replace("Jun","06").replace("Jul","07").replace("Aug","08").replace("Sep","09").replace("Oct","10").replace("Nov","11").replace("Dec","12")
```
- This covered about 1,300 items, then I did about 100 more messier ones with some more regex wranling
- I removed the `dcterms.description[en_US]` field from items where I updated the dates
- Then I added `dcterms.available` to the submission form and the item view
- We need to announce this to the editors
## 2023-02-13
- Export CGSpace to do some metadata quality checks
- I added CGIAR Trust Fund as a donor to some new Initiative outputs
- I moved some abstracts from the description field
- I moved some version information to the `cg.edition` field
## 2023-02-14
- The PRMS team in Colombia sent some questions about countries on CGSpace
- I had to fix some, that were clearly wrong, but there is also a difference between CGSpace and MEL because we use mostly iso-codes, and MEL uses the UN M.49 list
- Then I re-ran the country code tagger from cgspace-java-helpers, forcing the update on all items in the Initiatives community
- Remove Alliance research levers from `cg.contributor.crp` field after discussing with Daniel and Maria
- This was a mistake on TIP's part, and there is no direct mapping between research levers and CRPs
- I exported CGSpace to check Initiative collection mappings, regions, and licenses
- Peter told me that all CGIAR blog posts for the Initiatives should be CC-BY-4.0, and I see the logo at the bottom in light gray!
- I had previously missed that and removed some licenses for blog posts
- I checked cgiar.org, ifpri.org, icarda.org, iwmi.cgiar.org, irri.org, etc and corrected a handful
- Start a harvest on AReS
## 2023-02-15
- Work on rebasing my local DSpace 7 dev branches on top of the latest 7.5-SNAPSHOT
- It seems the issues I had with the `dspace submission-forms-migrate` tool in [August, 2022]({{< relref "2022-08.md" >}}) were fixed
- I imported a fresh PostgreSQL snapshot from CGSpace and then removed the Atmire migrations and ran the new migrations as I originally noted in [March, 2022]({{< relref "2022-03.md" >}}), and is pointed out in the [DSpace 7 upgrade notes](https://wiki.lyrasis.org/display/DSDOC7x/Upgrading+DSpace)
- Now I get a new error:
```console
localhost/dspace7= ☘ DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');
localhost/dspace7= ☘ DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';
localhost/dspace7= \q
$ ./bin/dspace database migrate ignored
...
CREATE INDEX resourcepolicy_action_idx ON resourcepolicy(action_id)
at org.flywaydb.core.internal.sqlscript.DefaultSqlScriptExecutor.handleException(DefaultSqlScriptExecutor.java:275)
at org.flywaydb.core.internal.sqlscript.DefaultSqlScriptExecutor.executeStatement(DefaultSqlScriptExecutor.java:222)
at org.flywaydb.core.internal.sqlscript.DefaultSqlScriptExecutor.execute(DefaultSqlScriptExecutor.java:126)
at org.flywaydb.core.internal.resolver.sql.SqlMigrationExecutor.executeOnce(SqlMigrationExecutor.java:69)
at org.flywaydb.core.internal.resolver.sql.SqlMigrationExecutor.lambda$execute$0(SqlMigrationExecutor.java:58)
at org.flywaydb.core.internal.database.DefaultExecutionStrategy.execute(DefaultExecutionStrategy.java:27)
at org.flywaydb.core.internal.resolver.sql.SqlMigrationExecutor.execute(SqlMigrationExecutor.java:57)
at org.flywaydb.core.internal.command.DbMigrate.doMigrateGroup(DbMigrate.java:377)
... 24 more
Caused by: org.postgresql.util.PSQLException: ERROR: relation "resourcepolicy_action_idx" already exists
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2676)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2366)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:356)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:496)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:413)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:333)
at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:319)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:295)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:290)
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:193)
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:193)
at org.flywaydb.core.internal.jdbc.JdbcTemplate.executeStatement(JdbcTemplate.java:201)
at org.flywaydb.core.internal.sqlscript.ParsedSqlStatement.execute(ParsedSqlStatement.java:95)
at org.flywaydb.core.internal.sqlscript.DefaultSqlScriptExecutor.executeStatement(DefaultSqlScriptExecutor.java:210)
... 30 more
```
- I dropped that index and then the migration succeeded:
```console
localhost/dspace7= ☘ DROP INDEX resourcepolicy_action_idx;
localhost/dspace7= ☘ \q
$ ./bin/dspace database migrate ignored
Done.
```
- I think that particular error is because I applied the [indexes in this unmerged DSpace 6 patch](https://github.com/DSpace/DSpace/pull/1792), so I don't need to report this as an error in DSpace 7
## 2023-02-16
- I found a suspicious number of PostgreSQL locks on CGSpace and decided to investigate:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
44 dspaceApi
372 dspaceCli
446 dspaceWeb
```
- This started happening yesterday and I killed a few locks that were several hours old after inspecting the `locks-age.sql` output
- I also checked the `locks.sql` output, which helpfully lists the blocked PID and the blocking PID, to find one blocking PID that was idle in transaction
- I killed that process and then all other locks were instantly processed
- I filed [a GitHub issue](https://github.com/DSpace/dspace-angular/issues/2103) on dspace-angular requesting the item view to use the bitstream description instead of the file name if present
- Weekly CG Core types meeting
- I need to go through the actions and remove those items that are only for CGSpace internal use, ie:
- CD-ROM
- Manuscript-unpublished
- Photo Report
- Questionnaire
- Wiki
- Weekly CGIAR Repository Working Group meeting
- I did some experiments with Crossref dates for about 20,000 DOIs in CGSpace using my `crossref-doi-lookup.py` script
- Some things I noted from reading the [Crossref API docs](https://github.com/CrossRef/rest-api-doc/blob/master/api_format.md) and inspecting the records for a few dozen DOIs manually:
- `["created"]["date-parts"]` → Date on which the DOI was first registered (not useful for us)
- `["published-print"]["date-parts"]` → Date on which the work was published in print
- `["journal-issue"]["published-print"]["date-parts"]` → When present, is 99% the same as the above
- `["published-online"]["date-parts"]` → Date on which the work was published online
- `["journal-issue"]["published-online"]["date-parts"]` → Much more rare, and only 50% the same as the above, so unreliable
- `["issued"]["date-parts"]` → Earliest of published-print and published-online (not useful to us)
- After checking the DOIs manully I decided that when the `published-print` date exists, it is usually more accurate than our issued dates
- I set 12,300 issue dates to those from Crossref
- I also decided that, when `published-online` exists, it is usually accurate when I check the publisher page (we don't have many online dates to compare)
- I set the available date for ~7,000 items to the published-online date as long as:
- There was no `dcterms.available` date already
- It was different than the issued date, because for now I only want online dates that are different, in case this is an online only journal in which case that can be the issue date... maybe I'll re-visit that later
## 2023-02-17
- It seems some (all?) of the changes I applied to dates last night didn't get saved...
- I don't know what happened, so I will run them again after some investigation
- I submitted the first batch of ~7,600 changes and it took twelve hours!
- I almost cancelled it because after applying the changes there was a lock blocking everything for two hours, and it seemed to be stuck, but I kept checking it and saw that the `query_start` and `state_change` were being updated despite it being state "idle in transaction":
```console
$ psql -c 'SELECT * FROM pg_stat_activity WHERE pid=1025176' | less -S
```
- I will apply the other changes in smaller batches...
- Lately I've noticed a lot of activity from the country code tagger curation task
- Looking in the logs I see items being tagged that are very old and should have already been tagged years ago
- Also, I see a ton of these errors whenever the task is updating an item:
```console
2023-02-17 08:01:00,252 INFO org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/89020 with status: 0. Result: '10568/89020: added 1 alpha2 country code(s)'
2023-02-17 08:01:00,467 ERROR com.atmire.versioning.ModificationLogger @ Error while writing item to versioning index: a0fe9d9a-6ac1-4b6a-8fcb-dae07a6bbf58 message:missing required field: epersonID
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: missing required field: epersonID
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102)
at com.atmire.versioning.ModificationLogger.indexItem(ModificationLogger.java:263)
at com.atmire.versioning.ModificationConsumer.end(ModificationConsumer.java:134)
at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:157)
at org.dspace.core.Context.dispatchEvents(Context.java:455)
at org.dspace.curate.Curator.visit(Curator.java:541)
at org.dspace.curate.Curator$TaskRunner.run(Curator.java:568)
at org.dspace.curate.Curator.doCollection(Curator.java:515)
at org.dspace.curate.Curator.doCommunity(Curator.java:487)
at org.dspace.curate.Curator.doSite(Curator.java:451)
at org.dspace.curate.Curator.curate(Curator.java:269)
at org.dspace.curate.Curator.curate(Curator.java:203)
at org.dspace.curate.CurationCli.main(CurationCli.java:220)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
```
- This must be related...
## 2023-02-18
- I realized why the country-code-tagger was tagging everything: I had overridden the `force` parameter last week!
- Start a harvest on AReS
## 2023-02-20
- IWMI is concerned that some of their items with top Altmetric attention scores don't show up in the AReS Explorer
- I looked into it for one and found that AReS is using the Handle, but Altmetric hasn't associated the Handle with the DOI
- Looking into country and region issues for the PRMS team
- Last week they had some questions about some invalid countries that ended up being typos
- I realized my cgspace-java-helpers country-code-tagger curation task is not using the latest version, so it was missing Türkiye
- I compiled the new version and ran it manually, but I have to upload a new version to Maven Central and then update the dependency in `dspace/modules/additions/pom.xml` ughhhhhh
- I tagged version 6.2 with the change for Türkiye and uploaded to to Maven Central with `mvn clean deploy`
- I'm having second thoughts about switching to UN M.49 for countries because there are just too many tradeoffs
- I want to find a way to keep our existing list, and codify some rules for it
- There are several discussions related to the shortcomings of ISO themselves and the iso-codes project, for example:
- [Inconsistency with articles in ISO-3166-1 English short names](https://salsa.debian.org/iso-codes-team/iso-codes/-/issues/33) (this one was filed by me two years ago!)
- [ISO 3166-1: What's the policy for `common_name`?](https://salsa.debian.org/iso-codes-team/iso-codes/-/issues/44)
- I almost want to say fuck it, let's just use iso-codes and tell everyone to deal with it, but make sure we handle ISO 3166-1 Alpha2 or probably Alpha3 in the future
- Something like:
- Prefer `common_name` if it exists
- Prefer the shorter of `name` and `official name`
## 2023-02-21
- Continue working on my `parse-iso-codes.py` script to parse the iso-codes JSON for ISO 3166-1
- I also started a spreadsheet to track current CGSpace country names, proposed new names using the compromise above, and UN M.49 names
- I proposed this to Peter but he wasn't happy because there are still some stupidly long and political names there
- I bumped the version of cgspace-java-helpers to 6.2-SNAPSHOT and pushed it to Maven Central because I can't figure out how to get non-snapshot releases to go there
- Ouch, grunt 1.6.0 was released a few weeks ago, which relies on Node.js v16, thus breaking the Mirage 2 build in DSpace 6
- I filed [an issue in DSpace](https://github.com/DSpace/DSpace/issues/8676)
- Help Moises from CIP troubleshoot harvesting issues on their WordPress site
- I see 2,000 requests with the user agent "RTB website BOT" today and they are all HTTP 200
```console
# grep 'RTB website BOT' /var/log/nginx/rest.log | awk '{print $9}' | sort | uniq -c | sort -h
2023 200
```
- Start reviewing and fixing metadata for Sam's ~250 CAS publications from last year
- Both Abenet and Peter have already looked at them and Sam has been waiting for months on this
## 2023-02-22
- Continue proofing CAS records for Sam
- I downloaded all the PDFs manually and checked the issue dates for each from the PDF, noting some that had licenses, ISBNs, etc
- I combined the title, abstract, and system subjects into one column to mine them for AGROVOC terms:
```console
toLowercase(value) + toLowercase(cells["dcterms.abstract"].value) + toLowercase(cells["cg.subject.system"].value.replace("||", " "))
```
- Then I extracted a list of AGROVOC terms the same way I did in [August, 2022]({{< relref "2022-08.md" >}}) and used this Jython code to extract matching terms:
```python
import re
with open(r"/tmp/agrovoc-subjects.txt",'r') as f :
terms = [name.rstrip().lower() for name in f]
return "||".join([term for term in terms if re.match(r".*\b" + term + r"\b.*", value.lower())])
```
- Then I used [this cool Jython to remove duplicate metadata values](https://stackoverflow.com/questions/15419080/openrefine-remove-duplicates-from-list-with-jython):
```python
deduped_list = list(set(value.split("||")))
return '||'.join(map(str, deduped_list))
```
- Then I did the same with countries, woooooo!
- I checked for duplicates and found forty-one
- I just stumbled upon UNTERM, which provides the official list of countries for the UN General Assembly, including a downloadable Excel with the short and formal names in all UN languages: https://unterm.un.org/unterm2/en/country
- I created a [pull request to add common names for Iran, Laos, and Syria on the Debian iso-codes package](https://salsa.debian.org/iso-codes-team/iso-codes/-/merge_requests/32)
- These are remarked upon in the ISO.org online browsing platform for ISO 3166-1
## 2023-02-23
- Tag v0.6.1 of csv-metadata-quality
- Weekly meeting about CG Core types
- I need to get some definitions from Peter for some types
- Peter sent some of the feedback from Indira to XMLUI
- I removed some old facets, limited others to less values, and adjusted the recent submissions from 5 to 10
## 2023-02-24
- More work on understanding Sam's CAS publications to prepare for uploading them to CGSpace
- I need to reconcile the duplicates and Peter's type re-classifications in the final version of the spreadsheet
- I flagged all the duplicates by creating a custom text facet matching all their titles like:
```console
or(
isNotNull(value.match("Evaluation of the CGIAR Research Program on Climate Change, Agriculture and Food Security (CCAFS)")),
isNotNull(value.match("Report of the IEA Workshop on Development, Use and Assessment of TOC in CGIAR Research, Rome, 12-13 January 2017")),
isNotNull(value.match("Report of the IEA Workshop on Evaluating the Quality of Science, Rome, 10-11 December 2015")),
isNotNull(value.match("Review of CGIARs Intellectual Assets Principles")),
...
)
```
- Annoyingly this seems to miss the ones with parenthesis so I had to do those manually
- This matched thirty-seven items, then I flagged them so I can handle them separately after uploading the others
- Then I used the URL field in the old version of the file to match the items with types `Evaluation` and `Independent Commentary` since Peter changed them
- I added extent, volume, issue, number, and affiliation to a few journal articles
- Then I did some last minute checks to make sure we're not uploading files for items marked as having "multiple documents"
## 2023-02-25
- Oh nice, my [pull request adding common names for Iran, Laos, and Syria to iso-codes](https://salsa.debian.org/iso-codes-team/iso-codes/-/merge_requests/32) was merged
- I did a test import of the 198 CAS Publications on DSpace Test, then inspected Abenet's file with Gaia's "multiple documents" field one more time and decided to do the import on CGSpace
- Gaia's "multiple documents" column had some text like "E6" and "F7" that didn't make any sense, and those files were not in the Sharepoint even
## 2023-02-26
- Start a harvest on AReS
## 2023-02-27
- I found two items for the CAS Publications that were marked as a duplicates, but upon second inspection were not, so I uploaded it to CGSpace
- That makes the total number of items for CAS 200...
- I did some CSV joining and inspections with the remaining thirty-six duplicates with the metadata for their existing items on CGSpace and uploaded them
- Do some work on the new DSpace 7 submission forms
- I ended up reverting to the stock configuration to use some new techniques like the style and type bind
## 2023-02-28
- Keep working on the DSpace 7 submission forms
- As part of this I asked Maria and Francesca if they are still using the `cg.link.permalink` (Bioversity publications permalink) and they said no, so we can remove it from the submission form
- I also removed `cg.subject.ccafs` since the CRP ended over a year ago and `cg.subject.pabra` since there have only been a handful of new items in [their collection](https://hdl.handle.net/10568/80211) and they seem to be using Alliance subjects instead
- I filed [a bug](https://github.com/DSpace/DSpace/issues/8686) on DSpace regarding the inability to add freetext values from an input field that uses a vocabulary
<!-- vim: set sw=2 ts=2: -->

655
content/posts/2023-03.md Normal file
View File

@@ -0,0 +1,655 @@
---
title: "March, 2023"
date: 2023-03-01T07:58:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-03-01
- Remove `cg.subject.wle` and `cg.identifier.wletheme` from CGSpace input form after confirming with IWMI colleagues that they no longer need them (WLE closed in 2021)
- [iso-codes 4.13.0 was released](https://salsa.debian.org/iso-codes-team/iso-codes/-/blob/main/CHANGELOG.md#4130-2023-02-28), which incorporates my changes to the common names for Iran, Laos, and Syria
- I finally got through with porting the input form from DSpace 6 to DSpace 7
<!--more-->
- I can't put my finger on it, but the input form has to be formatted very particularly, for example if your rows have more than two fields in them with out a sufficient Bootstrap grid style, or if you use a `twobox`, etc, the entire form step appears blank
## 2023-03-02
- I did some experiments with the new [Pandas 2.0.0rc0 Apache Arrow support](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i)
- There is a change to the way nulls are handled and it causes my tests for `pd.isna(field)` to fail
- I think we need consider blanks as null, but I'm not sure
- I made some adjustments to the Discovery sidebar facets on DSpace 6 while I was looking at the DSpace 7 configuration
- I downgraded CIFOR subject, Humidtropics subject, Drylands subject, ICARDA subject, and Language from DiscoverySearchFilterFacet to DiscoverySearchFilter in `discovery.xml` since we are no longer using them in sidebar facets
## 2023-03-03
- Atmire merged one of my old pull requests into COUNTER-Robots:
- [COUNTER_Robots_list.json: Add new bots](https://github.com/atmire/COUNTER-Robots/pull/54)
- I will update the local ILRI overrides in our DSpace spider agents file
## 2023-03-04
- Submit a [pull request on pycountry to use iso-codes 4.13.0](https://github.com/flyingcircusio/pycountry/pull/156)
## 2023-03-05
- Start a harvest on AReS
## 2023-03-06
- Export CGSpace to do Initiative collection mappings
- There were thirty-three that needed updating
- Send Abenet and Sam a list of twenty-one CAS publications that had been marked as "multiple documents" that we uploaded as metadata-only items
- Goshu will download the PDFs for each and upload them to the items on CGSpace manually
- I spent some time trying to get csv-metadata-quality working with the new Arrow backend for Pandas 2.0.0rc0
- It seems there is a problem recognizing empty strings as na with `pd.isna()`
- If I do `pd.isna(field) or field == ""` then it works as expected, but that feels hacky
- I'm going to test again on the next release...
- Note that I had been setting both of these global options:
```
pd.options.mode.dtype_backend = 'pyarrow'
pd.options.mode.nullable_dtypes = True
```
- Then reading the CSV like this:
```
df = pd.read_csv(args.input_file, engine='pyarrow', dtype='string[pyarrow]'
```
## 2023-03-07
- Create a PostgreSQL 14 instance on my local environment to start testing compatibility with DSpace 6 as well as all my scripts:
```console
$ podman pull docker.io/library/postgres:14-alpine
$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:14-alpine
$ createuser -h localhost -p 5432 -U postgres --pwprompt dspacetest
$ createdb -h localhost -p 5432 -U postgres -O dspacetest --encoding=UNICODE dspacetest
```
- Peter sent me a list of items that had ILRI affiation on Altmetric, but that didn't have Handles
- I ran a duplicate check on them to find if they exist or if we can import them
- There were about ninety matches, but a few dozen of those were pre-prints!
- After excluding those there were about sixty-one items we already have on CGSpace so I will add their DOIs to the existing items
- After joining these with the records from CGSpace and inspecting the DOIs I found that only forty-four were new DOIs
- Surprisingly some of the DOIs on Altmetric were not working, though we also had some that were not working (specifically the Journal of Agricultural Economics seems to have reassigned DOIs)
- For the rest of the ~359 items I extracted their DOIs and looked up the metadata on Crossref using my `crossref-doi-lookup.py` script
- After spending some time cleaning the data in OpenRefine I realized we don't get access status from Crossref
- We can imply it if the item is Creative Commons, but otherwise I might be able to use [Unpaywall's API](https://unpaywall.org/products/api)
- I found some false positives in Unpaywall, so I might only use their data when it says the DOI is not OA...
- During this process I updated my `crossref-doi-lookup.py` script to get more information from Crossref like ISSNs, ISBNs, full journal title, and subjects
- An unscientific comparison of duplicate checking Peter's file with ~500 titles on PostgreSQL 12 and PostgreSQL 14:
- PostgreSQL 12: `0.11s user 0.04s system 0% cpu 19:24.65 total`
- PostgreSQL 14: `0.12s user 0.04s system 0% cpu 18:13.47 total`
## 2023-03-08
- I am wondering how to speed up PostgreSQL trgm searches more
- I see my local PostgreSQL is using vanilla configuration and I should update some configs:
```console
localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'shared_buffers';
setting │ unit
─────────┼──────
16384 │ 8kB
(1 row)
```
- I re-created my PostgreSQL 14 container with some extra memory settings:
```console
$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:14-alpine -c shared_buffers=1024MB -c random_page_cost=1.1
```
- Then I created a GiST [index on the `metadatavalue` table to try to speed up the trgm similarity operations](https://alexklibisz.com/2022/02/18/optimizing-postgres-trigram-search):
```console
localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=64)); # \di+ shows index size is 795MB
```
- That took a few minutes to build... then the duplicate checker ran in 12 minutes: `0.07s user 0.02s system 0% cpu 12:43.08 total`
- On a hunch, I tried with a GIN index:
```console
localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gin_idx ON metadatavalue USING gin(text_value gin_trgm_ops); # \di+ shows index size is 274MB
```
- This ran in 19 minutes: `0.08s user 0.01s system 0% cpu 19:49.73 total`
- So clearly the GiST index is better for this task
- I am curious if I increase the signature length in the GiST index from 64 to 256 (which will for sure increase the size taken):
```console
localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=256)); # \di+ shows index size is 716MB, which is less than the previous GiST index...
```
- This one finished in ten minutes: `0.07s user 0.02s system 0% cpu 10:04.04 total`
- I might also want to [increase my `work_mem`](https://stackoverflow.com/questions/43008382/postgresql-gin-index-slower-than-gist-for-pg-trgm) (default 4MB):
```console
localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'work_mem';
setting │ unit
─────────┼──────
4096 │ kB
(1 row)
```
- After updating my Crossref lookup script and checking the remaining ~359 items I found a eight more duplicates already existing on CGSpace
- Wow, I found a [really cool way to fetch URLs in OpenRefine](https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine#example-1-fetching-and-parsing-html)
- I used this to fetch the open access status for each DOI from Unpaywall
- First, create a new column called "url" based on the DOI that builds the request URL. I used a Jython expression:
```python
unpaywall_baseurl = 'https://api.unpaywall.org/v2/'
email = "a.orth+unpaywall@cgiar.org"
doi = value.replace("https://doi.org/", "")
request_url = unpaywall_baseurl + doi + '?email=' + email
return request_url
```
- Then create a new column based on fetching the values in that column. I called it "unpaywall_status"
- Then you get a JSON blob in each and you can extract the Open Access status with a GREL like `value.parseJson()['is_oa']`
- I checked a handful of results manually and found that the limited access status was more trustworthy from Unpaywall than the open access, so I will just tag the limited access ones
- I merged the funders and affiliations from Altmetric into my file, then used the same technique to get Crossref data for open access items directly into OpenRefine and parsed the abstracts
- The syntax was hairy because it's marked up with tags like `<jats:p>`, but this got me most of the way there:
```console
value.replace("jats:p", "jats-p").parseHtml().select("jats-p")[0].innerHtml()
value.replace("<jats:italic>","").replace("</jats:italic>", "")
value.replace("<jats:sub>","").replace("</jats:sub>", "").replace("<jats:sup>","").replace("</jats:sup>", "")
```
- I uploaded the 350 items to DSpace Test so Peter and Abenet can explore them
- I exported a list of authors, affiliations, and funders from the new items to let Peter correct them:
```console
$ csvcut -c dc.contributor.author /tmp/new-items.csv | sed -e 1d -e 's/"//g' -e 's/||/\n/g' | sort | uniq -c | sort -nr | awk '{$1=""; print $0}' | sed -e 's/^ //' > /tmp/new-authors.csv
```
- Meeting with FAO AGRIS team about how to detect duplicates
- They are currently using a sha256 hash on titles, which will work, but will only return exact matches
- I told them to try to normalize the string, drop stop words, etc to increase the possibility that the hash matches
- Meeting with Abenet to discuss CGSpace issues
- She reminded me about needing a metadata field for first author when the affiliation is ILRI
- I said I prefer to write a small script for her that will check the first author and first affiliation... I could do it easily in Python, but would need to put a web frontend on it for her
- Unless we could do that in AReS reports somehow
## 2023-03-09
- Apply a bunch of corrections to authors, affiliations, and donors on the new items on DSpace Test
- Meeting with Peter and Abenet about future OpenRXV developments, DSpace 7, etc
- I submitted an [issue on MEL asking them to add provenance metadata when submitting to CGSpace](https://github.com/CodeObia/MEL/issues/11173)
## 2023-03-10
- CKM is getting ready to launch their new website and they display CGSpace thumbnails at 255x362px
- Our thumbnails are 300px so they get up-scaled and look bad
- I realized that the last time we [increased the size of our thumbnails was in 2013](https://github.com/ilri/DSpace/commit/5de61e220124c1d0441c87cd7d36d18cb2293c03), from 94x130 to 300px
- I offered to CKM that we increase them again to 400 or 600px
- I did some tests to check the thumbnail file sizes for 300px, 400px, 500px, and 600px on [this item](https://hdl.handle.net/10568/126388):
```console
$ ls -lh 10568-126388-*
-rw-r--r-- 1 aorth aorth 31K Mar 10 12:42 10568-126388-300px.jpg
-rw-r--r-- 1 aorth aorth 52K Mar 10 12:41 10568-126388-400px.jpg
-rw-r--r-- 1 aorth aorth 76K Mar 10 12:43 10568-126388-500px.jpg
-rw-r--r-- 1 aorth aorth 106K Mar 10 12:44 10568-126388-600px.jpg
```
- Seems like 600px is 3 to 4 times larger file size, so maybe we should shoot for 400px or 500px
- I decided on 500px
- I started re-generating new thumbnails for the ILRI Publications, CGIAR Initiatives, and other collections
- On that note, I also re-worked the XMLUI item display to show larger thumbnails (from a max-width of 128px to 200px)
- And now that I'm looking at thumbnails I am curious what it would take to get DSpace to generate WebP or AVIF thumbnails
- Peter sent me citations and ILRI subjects for the 350 new ILRI publications
- I guess he edited it in Excel because there are a bunch of encoding issues with accents
- I merged Peter's citations and subjects with the other metadata, ran one last duplicate check (and found one item!), then ran the items through csv-metadata-quality and uploaded them to CGSpace
- In the end it was only 348 items for some reason...
## 2023-03-12
- Start a harvest on AReS
## 2023-03-13
- Extract a list of DOIs from the Creative Commons licensed ILRI journal articles that I uploaded last week, skipping any that are "no derivatives" (ND):
```console
$ csvgrep -c 'dc.description.provenance[en]' -m 'Made available in DSpace on 2023-03-10' /tmp/ilri-articles.csv \
| csvgrep -c 'dcterms.license[en_US]' -r 'CC(0|\-BY)'
| csvgrep -c 'dcterms.license[en_US]' -i -r '\-ND\-'
| csvcut -c 'id,cg.identifier.doi[en_US],dcterms.type[en_US]' > 2023-03-13-journal-articles.csv
```
- I want to write a script to download the PDFs and create thumbnails for them, then upload to CGSpace
- I wrote one based on `post_ciat_pdfs.py` but it seems there is an issue uploading anything other than a PDF
- When I upload a JPG or a PNG the file begins with:
```console
Content-Disposition: form-data; name="file"; filename="10.1017-s0031182013001625.pdf.jpg"
```
- ... this means it is invalid...
- I tried in both the `ORIGINAL` and `THUMBNAIL` bundle, and with different filenames
- I tried manually on the command line with `http` and both PDF and PNG work... hmmmm
- Hmm, this seems to have been due to some difference in behavior between the `files` and `data` parameters of `requests.get()`
- I finalized the `post_bitstreams.py` script and uploaded eighty-five PDF thumbnails
- It seems Bizu uploaded covers for a handful so I deleted them and ran them through the script to get proper thumbnails
## 2023-03-14
- Add twelve IFPRI authors to our controlled vocabulary for authors and ORCID identifiers
- I also tagged their existing items on CGSpace
- Export all our ORCIDs and resolve their names to see if any have changed:
```console
$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-03-14-orcids.txt
$ ./ilri/resolve_orcids.py -i /tmp/2023-03-14-orcids.txt -o /tmp/2023-03-14-orcids-names.txt -d
```
- Then update them in the database:
```console
$ ./ilri/update_orcids.py -i /tmp/2023-03-14-orcids-names.txt -db dspace -u dspace -p 'fuuu' -m 247
```
## 2023-03-15
- Jawoo was asking about possibilities to harvest PDFs from CGSpace for some kind of AI chatbot integration
- I see we have 45,000 PDFs (format ID 2)
```console
localhost/dspacetest= ☘ SELECT COUNT(*) FROM bitstream WHERE NOT deleted AND bitstream_format_id=2;
count
───────
45281
(1 row)
```
- Rework some of my Python scripts to use a common `db_connect` function from util
- I reworked my `post_bitstreams.py` script to be able to overwrite bitstreams if requested
- The use case is to upload thumbnails for all the journal articles where we have these horrible pixelated journal covers
- I replaced JPEG thumbnails for ~896 ILRI publications by exporting a list of DOIs from the 10568/3 collection that were CC-BY, getting their PDFs from Sci-Hub, and then posting them with my new script
## 2023-03-16
- Continue working on the ILRI publication thumbnails
- There were about sixty-four that had existing PNG "journal cover" thumbnails that didn't get replaced because I only overwrote the JPEG ones yesterday
- Now I generated a list of those bitstream UUIDs and deleted them with a shell script via the REST API
- I made a [pull request on DSpace 7 to update the bitstream format registry for PNG, WebP, and AVIF](https://github.com/DSpace/DSpace/pull/8722)
- Export CGSpace to perform mappings to Initiatives collections
- I also used this export to find CC-BY items with DOIs that had JPEGs or PNGs in their provenance, meaning that the submitter likely submitted a low-quality "journal cover" for the item
- I found about 330 of them and got most of their PDFs from Sci-Hub and replaced the crappy thumbnails with real ones where Sci-Hub had them (~245)
- In related news, I realized you can get an [API key from Elsevier and download the PDFs from their API](https://stackoverflow.com/questions/59202176/python-download-papers-from-sciencedirect-by-doi-with-requests):
```python
import requests
api_key = 'fuuuuuuuuu'
doi = "10.1016/j.foodqual.2021.104362"
request_url = f'https://api.elsevier.com/content/article/doi:{doi}'
headers = {
'X-ELS-APIKEY': api_key,
'Accept': 'application/pdf'
}
with requests.get(request_url, stream=True, headers=headers) as r:
if r.status_code == 200:
with open("article.pdf", "wb") as f:
for chunk in r.iter_content(chunk_size=1024*1024):
f.write(chunk)
```
- The question is, how do we know if a DOI is Elsevier or not...
- CGIAR Repositories Working Group meeting
- We discussed controlled vocabularies for funders
- I suggested checking our combined lists against Crossref and ROR
- Export a list of donors from `cg.contributor.donor` on CGSpace:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=248) to /tmp/2023-03-16-donors.txt;
COPY 1521
```
- Then resolve them against Crossref's funders API:
```console
$ ./ilri/crossref_funders_lookup.py -e fuuuu@cgiar.org -i /tmp/2023-03-16-donors.txt -o ~/Downloads/2023-03-16-cgspace-crossref-funders-results.csv -d
$ csvgrep -c matched -m true ~/Downloads/2023-03-16-cgspace-crossref-funders-results.csv | wc -l
472
$ sed 1d ~/Downloads/2023-03-16-cgspace-crossref-funders-results.csv | wc -l
1521
```
- That's a 31% hit rate, but I see some simple things like "Bill and Melinda Gates Foundation" instead of "Bill & Melinda Gates Foundation"
## 2023-03-17
- I did the same lookup of CGSpace donors on ROR's 2022-12-01 data dump:
```console
$ ./ilri/ror_lookup.py -i /tmp/2023-03-16-donors.txt -o ~/Downloads/2023-03-16-cgspace-ror-funders-results.csv -r v1.15-2022-12-01-ror-data.json
$ csvgrep -c matched -m true ~/Downloads/2023-03-16-cgspace-ror-funders-results.csv | wc -l
407
$ sed 1d ~/Downloads/2023-03-16-cgspace-ror-funders-results.csv | wc -l
1521
```
- That's a 26.7% hit rate
- As for the number of funders in each dataset
- Crossref has about 34,000
- ROR has 15,000 if "FundRef" data is a proxy for that:
```console
$ grep -c -rsI FundRef v1.15-2022-12-01-ror-data.json
15162
```
- On a related note, I remembered that DOI.org has a list of DOI prefixes and publishers: https://doi.crossref.org/getPrefixPublisher
- In Python I can look up publishers by prefix easily, here with a nested list comprehension:
```console
In [10]: [publisher for publisher in publishers if '10.3390' in publisher['prefixes']]
Out[10]:
[{'prefixes': ['10.1989', '10.32545', '10.20944', '10.3390', '10.35995'],
'name': 'MDPI AG',
'memberId': 1968}]
```
- And in OpenRefine, if I create a new column based on the DOI using Jython:
```python
import json
with open("/home/aorth/src/git/DSpace/publisher-doi-prefixes.json", "rb") as f:
publishers = json.load(f)
doi_prefix = value.split("/")[3]
publisher = [publisher for publisher in publishers if doi_prefix in publisher['prefixes']]
return publisher[0]['name']
```
- ... though this is very slow and hung OpenRefine when I tried it
- I added the ability to overwrite multiple bitstream formats at once in `post_bitstreams.py`
```console
$ ./ilri/post_bitstreams.py -i test.csv -u https://dspacetest.cgiar.org/rest -e fuuu@example.com -p 'fffnjnjn' -d -s 2B40C7C4E34CEFCF5AFAE4B75A8C52E2 --overwrite JPEG --overwrite PNG -n
Session valid: 2B40C7C4E34CEFCF5AFAE4B75A8C52E2
Opened test.csv
384142cb-58b9-4e64-bcdc-0a8cc34888b3: checking for existing bitstreams in THUMBNAIL bundle
> (DRY RUN) Deleting bitstream: IFPRI Malawi_Maize Market Report_February_202_anonymous.pdf.jpg (16883cb0-1fc8-4786-a04f-32132e0617d4)
> (DRY RUN) Deleting bitstream: AgroEcol_Newsletter_2.png (7e9cd434-45a6-4d55-8d56-4efa89d73813)
> (DRY RUN) Uploading file: 10568-129666.pdf.jpg
```
- I learned how to use Python's built-in `logging` module and it simplifies all my debug and info printing
- I re-factored a few scripts to use the new logging
## 2023-03-18
- I applied changes for publishers on 16,000 items in batches of 5,000
- While working on my `post_bitstreams.py` script I realized the Tomcat Crawler Session Manager valve that groups bot user agents into sessions is causing my login to fail the first time, every time
- I've disabled it for now and will check the Munin session graphs after some time to see if it makes a difference
- In any case I have much better spider user agent lists in DSpace now than I did years ago when I started using the Crawler Session Manager valve
## 2023-03-19
- Start a harvest on AReS
## 2023-03-20
- Minor updates to a few of my DSpace Python scripts to fix the logging
- Minor updates to some records for Mazingira reported by Sonja
- Upgrade PostgreSQL on DSpace Test from version 12 to 14, the same way I did from 10 to 12 last year:
- First, I installed the new version of PostgreSQL via the Ansible playbook scripts
- Then I stopped Tomcat and all PostgreSQL clusters and used `pg_upgrade` to upgrade the old version:
```console
# systemctl stop tomcat7
# pg_ctlcluster 12 main stop
# tar -cvzpf var-lib-postgresql-12.tar.gz /var/lib/postgresql/12
# tar -cvzpf etc-postgresql-12.tar.gz /etc/postgresql/12
# pg_ctlcluster 14 main stop
# pg_dropcluster 14 main
# pg_upgradecluster 12 main
# pg_ctlcluster 14 main start
```
- After that I [re-indexed the database indexes using a query](https://adamj.eu/tech/2021/04/13/reindexing-all-tables-after-upgrading-to-postgresql-13/):
```console
$ su - postgres
$ cat /tmp/generate-reindex.sql
SELECT 'REINDEX TABLE CONCURRENTLY ' || quote_ident(relname) || ' /*' || pg_size_pretty(pg_total_relation_size(C.oid)) || '*/;'
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname = 'public'
AND C.relkind = 'r'
AND nspname !~ '^pg_toast'
ORDER BY pg_total_relation_size(C.oid) ASC;
$ psql dspace < /tmp/generate-reindex.sql > /tmp/reindex.sql
$ <trim the extra stuff from /tmp/reindex.sql>
$ psql dspace < /tmp/reindex.sql
```
- The index on `metadatavalue` shrunk by 90MB, and others a bit less
- This is nice, but not as drastic as I noticed last year when upgrading to PostgreSQL 12
## 2023-03-21
- Leigh sent me a list of IFPRI authors with ORCID identifiers so I combined them with our list and resolved all their names with `resolve_orcids.py`
- It adds 154 new ORCID identifiers
- I did a follow up to the publisher names from last week using the list from doi.org
- Last week I only updated items with a DOI that had *no* publisher, but now I was curious to see how our existing publisher information compared
- I checked a dozen or so manually and, other than CIFOR/ICRAF and CIAT/Alliance, the metadata was better than our existing data, so I overwrote them
- I spent some time trying to figure out how to get ssimulacra2 running so I could compare thumbnails in JPEG and WebP
- I realized that we can't directly compare JPEG to WebP, we need to convert to JPEG/WebP, then convert each to lossless PNG
- Also, we shouldn't be comparing the resulting images against each other, but rather the original, so I need to a straight PDF to lossless PNG version also
- After playing with WebP at Q82 and Q92, I see it has lower ssimulacra2 scores than JPEG Q92 for the dozen test files
- Could it just be something with ImageMagick?
## 2023-03-22
- I updated csv-metadata-quality to use pandas 2.0.0rc1 and everything seems to work...?
- So the issues with nulls (isna) when I tried the first release candidate a few weeks ago were resolved?
- Meeting with Jawoo and others about a "ChatGPT-like" thing for CGIAR data using CGSpace documents and metadata
## 2023-03-23
- Add a missing IFPRI ORCID identifier to CGSpace and tag his items on CGSpace
- A super unscientific comparison between csv-metadata-quality's pytest regimen using Pandas 1.5.3 and Pandas 2.0.0rc1
- The data was gathered using [rusage](https://justine.lol/rusage), and this is the results of the last of three consecutive runs:
```
# Pandas 1.5.3
RL: took 1,585,999µs wall time
RL: ballooned to 272,380kb in size
RL: needed 2,093,947µs cpu (25% kernel)
RL: caused 55,856 page faults (100% memcpy)
RL: 699 context switches (1% consensual)
RL: performed 0 reads and 16 write i/o operations
# Pandas 2.0.0rc1
RL: took 1,625,718µs wall time
RL: ballooned to 262,116kb in size
RL: needed 2,148,425µs cpu (24% kernel)
RL: caused 63,934 page faults (100% memcpy)
RL: 461 context switches (2% consensual)
RL: performed 0 reads and 16 write i/o operations
```
- So it seems that Pandas 2.0.0rc1 took ten megabytes less RAM... interesting to see that the PyArrow-backed dtypes make a measurable difference even on my small test set
- I should try to compare runs of larger input files
## 2023-03-24
- I added a Flyway SQL migration for the PNG bitstream format registry changes on DSpace 7.6
## 2023-03-26
- There seems to be a slightly high load on CGSpace
- I don't see any locks in PostgreSQL, but there's some new bot I have never heard of:
```console
92.119.18.13 - - [26/Mar/2023:18:41:47 +0200] "GET /handle/10568/16500/discover?filtertype_0=impactarea&filter_relational_operator_0=equals&filter_0=Climate+adaptation+and+mitigation&filtertype=sdg&filter_relational_operator=equals&filter=SDG+11+-+Sustainable+cities+and+communities HTTP/2.0" 200 7856 "-" "colly - https://github.com/gocolly/colly"
```
- In the last week I see a handful of IPs making requests with this agent:
```console
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.{2,3,4,5,6,7}.gz | grep go
colly | awk '{print $1}' | sort | uniq -c | sort -h
2 194.233.95.37
4304 92.119.18.142
9496 5.180.208.152
27477 92.119.18.13
```
- Most of these come from Packethub S.A. / ASN 62240 (CLOUVIDER Clouvider - Global ASN, GB)
- Oh, I've apparently seen this user agent before, as it is in our ILRI spider user agent overrides
- I exported CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-03-27
- The harvest on AReS was incredibly slow and I stopped it about half way twelve hours later
- Then I relied on the plugins to get missing items, which caused a high load on the server but actually worked fine
- Continue working on thumbnails on DSpace
## 2023-03-28
- Regarding ImageMagick there are a few things I've learned
- The `-quality` setting does different things for different output formats, see: https://imagemagick.org/script/command-line-options.php#quality
- The `-compress` setting controls the compression algorithm for image data, and is unrelated to lossless/lossy
- On that note, `-compress lossless` for JPEGs refers to Lossless JPEG, which is not well defined or supported and should be avoided
- See: https://imagemagick.org/script/command-line-options.php#compress
- The way DSpace currently does its supersampling by exporting to a JPEG, then making a thumbnail of the JPEG, is a double lossy operation
- We should be exporting to something lossless like PNG, PPM, or MIFF, then making a thumbnail from that
- The PNG format is always lossless so the `-quality` setting controls compression and filtering, but has no effect on the appearance or signature of PNG images
- You can use `-quality n` with WebP's `-define webp:lossless=true`, but I'm not sure about the interaction between ImageMagick quality and WebP lossless...
- Also, if converting from a lossless format to WebP lossless in the same command, ImageMagick will ignore quality settings
- The MIFF format is useful for piping between ImageMagick commands, but it is also lossless and the quality setting is ignored
- You can use a format specifier when piping between ImageMagick commands without writing a file
- For example, I want to create a lossless PNG from a distorted JPEG for comparison:
```console
$ magick convert reference.jpg -quality 85 jpg:- | convert - distorted-lossless.png
```
- If I convert the JPEG to PNG directly it will ignore the quality setting, so I set the quality and the output format, then pipe it to ImageMagick again to convert to lossless PNG
- In an attempt to quantify the generation loss from DSpace's "JPG JPG" method of creating thumbnails I wrote a script called `generation-loss.sh` to test against a new "PNG JPG" method
- With my sample set of seventeen PDFs from CGSpace I found that _the "JPG JPG" method of thumbnailing results in scores an average of 1.6% lower than with the "PNG JPG" method_.
- The average file size with _the "PNG JPG" method was only 200 bytes larger_.
- In my brief testing, the relationship between ImageMagick's `-quality` setting and WebP's `-define webp:lossless=true` setting are completely unpredictable:
```console
$ magick convert img/10568-103447.pdf.png /tmp/10568-103447.webp
$ magick convert img/10568-103447.pdf.png -define webp:lossless=true /tmp/10568-103447-lossless.webp
$ magick convert img/10568-103447.pdf.png -define webp:lossless=true -quality 50 /tmp/10568-103447-lossless-q50.webp
$ magick convert img/10568-103447.pdf.png -quality 10 -define webp:lossless=true /tmp/10568-103447-lossless-q10.webp
$ magick convert img/10568-103447.pdf.png -quality 90 -define webp:lossless=true /tmp/10568-103447-lossless-q90.webp
$ ls -l /tmp/10568-103447*
-rw-r--r-- 1 aorth aorth 359258 Mar 28 21:16 /tmp/10568-103447-lossless-q10.webp
-rw-r--r-- 1 aorth aorth 303850 Mar 28 21:15 /tmp/10568-103447-lossless-q50.webp
-rw-r--r-- 1 aorth aorth 296832 Mar 28 21:16 /tmp/10568-103447-lossless-q90.webp
-rw-r--r-- 1 aorth aorth 299566 Mar 28 21:13 /tmp/10568-103447-lossless.webp
-rw-r--r-- 1 aorth aorth 190718 Mar 28 21:13 /tmp/10568-103447.webp
```
- I'm curious to see a comparison between the ImageMagick `-define webp:emulate-jpeg-size=true` (aka `-jpeg_like` in cwebp) option compared to normal lossy WebP quality:
```console
$ for q in 70 80 90; do magick convert img/10568-103447.pdf.png -quality $q -define webp:emulate-jpeg-size=true /tmp/10568-103447-lossy-emulate-jpeg-q${q}.webp; done
$ for q in 70 80 90; do magick convert /tmp/10568-103447-lossy-emulate-jpeg-q${q}.webp /tmp/10568-103447-lossy-emulate-jpeg-q${q}.webp.png; done
$ for q in 70 80 90; do ssimulacra2 img/10568-103447.pdf.png /tmp/10568-103447-lossy-emulate-jpeg-q${q}.webp.png 2>/dev/null; done
81.29082887
84.42134524
85.84458964
$ for q in 70 80 90; do magick convert img/10568-103447.pdf.png -quality $q /tmp/10568-103447-lossy-q${q}.webp; done
$ for q in 70 80 90; do magick convert /tmp/10568-103447-lossy-q${q}.webp /tmp/10568-103447-lossy-q${q}.webp.png; done
$ for q in 70 80 90; do ssimulacra2 img/10568-103447.pdf.png /tmp/10568-103447-lossy-q${q}.webp.png 2>/dev/null; done
77.25789006
80.79140936
84.79108246
```
- Using `-define webp:method=6` (versus default 4) gets a ~0.5% increase on ssimulacra2 score
## 2023-03-29
- Looking at the `-define webp:near-lossless=$q` option in ImageMagick and I don't think it's working:
```console
$ for q in 20 40 60 80 90; do magick convert -flatten data/10568-103447.pdf\[0\] -define webp:near-lossless=$q -verbose /tmp/10568-103447-near-lossless-q${q}.webp; done
data/10568-103447.pdf[0]=>/tmp/10568-103447-near-lossless-q20.webp PDF 595x842 595x842+0+0 16-bit sRGB 80440B 0.080u 0:00.043
data/10568-103447.pdf[0]=>/tmp/10568-103447-near-lossless-q40.webp PDF 595x842 595x842+0+0 16-bit sRGB 80440B 0.080u 0:00.043
data/10568-103447.pdf[0]=>/tmp/10568-103447-near-lossless-q60.webp PDF 595x842 595x842+0+0 16-bit sRGB 80440B 0.090u 0:00.043
data/10568-103447.pdf[0]=>/tmp/10568-103447-near-lossless-q80.webp PDF 595x842 595x842+0+0 16-bit sRGB 80440B 0.090u 0:00.043
data/10568-103447.pdf[0]=>/tmp/10568-103447-near-lossless-q90.webp PDF 595x842 595x842+0+0 16-bit sRGB 80440B 0.080u 0:00.043
```
- The file sizes are all the same...
- If I try with `-quality $q` it works:
```console
$ for q in 20 40 60 80 90; do magick convert -flatten data/10568-103447.pdf\[0\] -quality $q -verbose /tmp/10568-103447-q${q}.webp; done
data/10568-103447.pdf[0]=>/tmp/10568-103447-q20.webp PDF 595x842 595x842+0+0 16-bit sRGB 52602B 0.080u 0:00.045
data/10568-103447.pdf[0]=>/tmp/10568-103447-q40.webp PDF 595x842 595x842+0+0 16-bit sRGB 64604B 0.090u 0:00.045
data/10568-103447.pdf[0]=>/tmp/10568-103447-q60.webp PDF 595x842 595x842+0+0 16-bit sRGB 73584B 0.080u 0:00.045
data/10568-103447.pdf[0]=>/tmp/10568-103447-q80.webp PDF 595x842 595x842+0+0 16-bit sRGB 88652B 0.090u 0:00.045
data/10568-103447.pdf[0]=>/tmp/10568-103447-q90.webp PDF 595x842 595x842+0+0 16-bit sRGB 113186B 0.100u 0:00.049
```
- I don't see any issues mentioning this in the ImageMagick GitHub issues, so I guess I have to file a bug
- I first [asked a question on their discussion board](https://github.com/ImageMagick/ImageMagick/discussions/6204) because I see that the near-lossless option should have been added to ImageMagick sometime after 2020 according to another discussion
- Meeting with Maria about the Alliance metadata on CGSpace
- As the Alliance is not a legal entity they want to reflect that somehow in CGSpace
- We discussed updating all metadata, but so many documents issued in the last few years have the Alliance indicated inside them and as affiliations in journal article acknowledgements, etc, we decided it is not the best option
- Instead, we propose to:
- Remove `Alliance of Bioversity International and CIAT` from the controlled vocabulary for affiliations ASAP
- Add `Bioversity International and the International Center for Tropical Agriculture` to the controlled vocabulary for affiliations ASAP
- Add a prominent note to the item page for every item in the Alliance community via a custom XMLUI theme (Maria and the Alliance publishing team to send the text)
## 2023-03-30
- The ImageMagick developers confirmed [my bug report](https://github.com/ImageMagick/ImageMagick/discussions/6204) and created a patch on master
- I'm not entirely sure how it works, but the developer seemed to imply we can use lossless mode plus a quality?
```console
$ magick convert -flatten data/10568-103447.pdf\[0\] -define webp:lossless=true -quality 90 /tmp/10568-103447.pdf.webp
```
- Now I see a difference between near-lossless and normal quality mode:
```console
$ for q in 20 40 60 80 90; do magick convert -flatten data/10568-103447.pdf\[0\] -define webp:lossless=true -quality $q /tmp/10568-103447-near-lossless-q${q}.webp; done
$ ls -l /tmp/10568-103447-near-lossless-q*
-rw-r--r-- 1 aorth aorth 108186 Mar 30 11:36 /tmp/10568-103447-near-lossless-q20.webp
-rw-r--r-- 1 aorth aorth 97170 Mar 30 11:36 /tmp/10568-103447-near-lossless-q40.webp
-rw-r--r-- 1 aorth aorth 97382 Mar 30 11:36 /tmp/10568-103447-near-lossless-q60.webp
-rw-r--r-- 1 aorth aorth 106090 Mar 30 11:36 /tmp/10568-103447-near-lossless-q80.webp
-rw-r--r-- 1 aorth aorth 105926 Mar 30 11:36 /tmp/10568-103447-near-lossless-q90.webp
$ for q in 20 40 60 80 90; do magick convert -flatten data/10568-103447.pdf\[0\] -quality $q /tmp/10568-103447-q${q}.webp; done
$ ls -l /tmp/10568-103447-q*
-rw-r--r-- 1 aorth aorth 52602 Mar 30 11:37 /tmp/10568-103447-q20.webp
-rw-r--r-- 1 aorth aorth 64604 Mar 30 11:37 /tmp/10568-103447-q40.webp
-rw-r--r-- 1 aorth aorth 73584 Mar 30 11:37 /tmp/10568-103447-q60.webp
-rw-r--r-- 1 aorth aorth 88652 Mar 30 11:37 /tmp/10568-103447-q80.webp
-rw-r--r-- 1 aorth aorth 113186 Mar 30 11:37 /tmp/10568-103447-q90.webp
```
- But after reading the source code in `coders/webp.c` I am not sure I understand, so I asked for clarification in the discussion
- Both Bosede and Abenet said mapping on CGSpace is taking a long time and I don't see any stuck locks so I decided to quickly restart postgresql
## 2023-03-31
- Meeting with Daniel and Naim from Alliance in Cali about CGSpace metadata, TIP, etc
<!-- vim: set sw=2 ts=2: -->

545
content/posts/2023-04.md Normal file
View File

@@ -0,0 +1,545 @@
---
title: "April, 2023"
date: 2023-04-02T08:19:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-04-02
- Run all system updates on CGSpace and reboot it
- I exported CGSpace to CSV to check for any missing Initiative collection mappings
- I also did a check for missing country/region mappings with csv-metadata-quality
- Start a harvest on AReS
<!--more-->
- I'm starting to get annoyed at my shell script for doing ImageMagick tests and looking to re-write it in something object oriented like Python
- There doesn't seem to be an official ImageMagick Python binding on pypi.org, perhaps I can use [Wand](https://docs.wand-py.org)?
- Testing Wand in Python:
```python
from wand.image import Image
with Image(filename='data/10568-103447.pdf[0]', resolution=144) as first_page:
print(first_page.height)
```
- I spent more time re-working my thumbnail scripts to compare the resized images and other minor changes
- I am realizing that doing the thumbnails directly from the source improves the ssimulacra2 score by 1-3% points compared to DSpace's method of creating a lossy supersample followed by a lossy resized thumbnail
## 2023-04-03
- The harvest on AReS that I started yesterday never finished, and actually seems to have died...
- Also, Fabio and Patrizio from Alliance emailed me to ask if there is something wrong with the REST API because they are having problems
- I stopped the harvest and started the plugins to get the remaining items via the sitemap...
## 2023-04-04
- Presentation about CGSpace metadata, controlled vocabularies, and curation to Pooja's communications and development team at UNEP
- I uploaded the presentation to CGSpace here: https://hdl.handle.net/10568/129896
- Someone from the system organization contacted me to ask how to download a few thousand PDFs from a spreadsheet with DOIs and Handles
```console
$ csvcut -c Handle ~/Downloads/2023-04-04-Donald.csv \
| sed \
-e 1d \
-e 's_https://hdl.handle.net/__' \
-e 's_https://cgspace.cgiar.org/handle/__' \
-e 's_http://hdl.handle.net/__' \
| sort -u > /tmp/handles.txt
```
- Then I used the `get_dspace_pdfs.py` script to download them
## 2023-04-05
- After some cleanup on Donald's DOIs I started the `get_scihub_pdfs.py` script
## 2023-04-06
- I did some more work to cleanup and streamline my next generation of DSpace thumbnail testing scripts
- I think I found a bug in ImageMagick 7.1.1.5 where CMYK to sRGB conversion fails if we use image operations like `-density` or `-define` before reading the input file
- I started [a discussion on the ImageMagick GitHub](https://github.com/ImageMagick/ImageMagick/discussions/6234) to ask
- Yesterday I started downloading the rest of the PDFs from Donald, those that had DOIs
- As a measure of caution, I extracted the list of DOIs and used my `crossref_doi_lookup.py` script to get their licenses from Crossref:
```console
$ ./ilri/crossref_doi_lookup.py -e xxxx@i.org -i /tmp/dois.txt -o /tmp/donald-crossref-dois.csv -d
```
- Then I did some CSV manipulation to extract the DOIs that were Creative Commons licensed, excluding any that were "No Derivatives", and re-formatting the DOIs:
```console
$ csvcut -c doi,license /tmp/donald-crossref-dois.csv \
| csvgrep -c license -m 'creativecommons' \
| csvgrep -c license -i -r 'by-(nd|nc-nd)' \
| sed -e 's_^10_https://doi.org/10_' \
-e 's/\(am\|tdm\|unspecified\|vor\): //' \
| tee /tmp/donald-open-dois.csv \
| wc -l
4268
```
- From those I filtered for the DOIs for which I had downloaded PDFs, in the `filename` column of the Sci-Hub script and copied them to a separate directory:
```console
$ for file in $(csvjoin -c doi /tmp/donald-doi-pdfs.csv /tmp/donald-open-dois.csv | csvgrep -c filename -i -r '^$' | csvcut -c filename | sed 1d); do cp --reflink=always "$file" "creative-commons-licensed/$file"; done
```
- I used BTRFS copy-on-write via reflinks to make sure I didn't duplicate the files :-D
- I ran out of time and had to stop the process around 3,127 PDFs
- I zipped them up and sent them to the others, along with a CSV of the DOIs, PDF filenames, and licenses
## 2023-04-17
- Abenet noticed a weird issue with [this item](https://cgspace.cgiar.org/handle/10568/75611)
- The item has metadata, but the page is blank
- When I try to edit the item's authorization policies in XMLUI I get a nullPointerException:
```
Java stacktrace: java.lang.NullPointerException
at org.dspace.app.xmlui.aspect.administrative.authorization.EditItemPolicies.addBody(EditItemPolicies.java:166)
at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:234)
at sun.reflect.GeneratedMethodAccessor347.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy201.startElement(Unknown Source)
at org.apache.cocoon.components.sax.XMLTeePipe.startElement(XMLTeePipe.java:87)
at org.apache.cocoon.xml.AbstractXMLPipe.startElement(AbstractXMLPipe.java:94)
at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:251)
at sun.reflect.GeneratedMethodAccessor347.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy203.startElement(Unknown Source)
at org.apache.cocoon.xml.AbstractXMLPipe.startElement(AbstractXMLPipe.java:94)
at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:251)
at sun.reflect.GeneratedMethodAccessor347.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy203.startElement(Unknown Source)
at org.apache.cocoon.environment.internal.EnvironmentChanger.startElement(EnvironmentStack.java:140)
at org.apache.cocoon.components.sax.XMLTeePipe.startElement(XMLTeePipe.java:87)
at org.apache.cocoon.xml.AbstractXMLPipe.startElement(AbstractXMLPipe.java:94)
at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:251)
at sun.reflect.GeneratedMethodAccessor347.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy203.startElement(Unknown Source)
at org.apache.cocoon.environment.internal.EnvironmentChanger.startElement(EnvironmentStack.java:140)
at org.apache.cocoon.components.sax.XMLTeePipe.startElement(XMLTeePipe.java:87)
at org.apache.cocoon.components.sax.AbstractXMLByteStreamInterpreter.parse(AbstractXMLByteStreamInterpreter.java:117)
at org.apache.cocoon.components.sax.XMLByteStreamInterpreter.deserialize(XMLByteStreamInterpreter.java:44)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:324)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:326)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:326)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.processXMLPipeline(AbstractProcessingPipeline.java:544)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:273)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.processXMLPipeline(AbstractProcessingPipeline.java:544)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:273)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.processXMLPipeline(AbstractProcessingPipeline.java:544)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:273)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.processXMLPipeline(AbstractProcessingPipeline.java:544)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:273)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.processXMLPipeline(AbstractProcessingPipeline.java:544)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:273)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.processXMLPipeline(AbstractProcessingPipeline.java:544)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:273)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.processXMLPipeline(AbstractProcessingPipeline.java:544)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:273)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.processXMLPipeline(AbstractProcessingPipeline.java:544)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:273)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.processXMLPipeline(AbstractProcessingPipeline.java:544)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:273)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:439)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.treeprocessor.sitemap.SerializeNode.invoke(SerializeNode.java:147)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:171)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcessor.java:247)
at org.apache.cocoon.components.treeprocessor.sitemap.MountNode.invoke(MountNode.java:117)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:171)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcessor.java:247)
at org.apache.cocoon.components.treeprocessor.sitemap.MountNode.invoke(MountNode.java:117)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:171)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcessor.java:247)
at org.apache.cocoon.servlet.RequestProcessor.process(RequestProcessor.java:351)
at org.apache.cocoon.servlet.RequestProcessor.service(RequestProcessor.java:169)
at org.apache.cocoon.sitemap.SitemapServlet.service(SitemapServlet.java:84)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:468)
at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:443)
at org.apache.cocoon.servletservice.spring.ServletFactoryBean$ServiceInterceptor.invoke(ServletFactoryBean.java:264)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
at com.sun.proxy.$Proxy186.service(Unknown Source)
at org.dspace.springmvc.CocoonView.render(CocoonView.java:113)
at org.springframework.web.servlet.DispatcherServlet.render(DispatcherServlet.java:1216)
at org.springframework.web.servlet.DispatcherServlet.processDispatchResult(DispatcherServlet.java:1001)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:945)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:867)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:951)
at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:853)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:647)
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:827)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:113)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter.doFilter(DSpaceCocoonServletFilter.java:160)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.servlet.multipart.DSpaceMultipartFilter.doFilter(DSpaceMultipartFilter.java:119)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:110)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:492)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:165)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:235)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:1025)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:451)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1201)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:654)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:317)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:750)
```
- I don't see anything on the DSpace issue tracker or mailing list so I asked about it on the DSpace Slack...
- Peter said CGSpace was slow and I see a lot of locks from the XMLUI
- I looked and found many locks that were many hours and days old so I killed some:
```console
$ psql < locks-age.sql | grep -E "[[:digit:]] days" | awk -F\| '{print $10}' | sort -u
1050672
1053773
1054602
1054702
1056782
1057629
1057630
$ psql < locks-age.sql | grep -E "[[:digit:]] days" | awk -F\| '{print $10}' | sort -u | xargs kill
```
- I'm also running a `dspace cleanup -v`, but it doesn't seem to be finishing
- I recall something like there being errors in the logs rather than on the command line in DSpace 6...
- I found it in the DSpace log:
```console
2023-04-17 21:09:46,004 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
Detail: Key (uuid)=(a7ddf477-1c04-4de0-9c7a-4d3c84a875bc) is still referenced from table "bundle".
```
- If I mark the primary bitstream as null manually the cleanup script continues until it finds a few more
- I ended up with a long list of UUIDs to fix before the script would complete:
```console
$ psql -d dspace -c "update bundle set primary_bitstream_id=NULL where primary_bitstream_id in ('a7ddf477-1c04-4de0-9c7a-4d3c84a875bc', '9582b661-9c2d-4c86-be22-c3b0942b646a', '210a4d5d-3af9-46f0-84cc-682dd1431762', '51115f07-0a60-4988-8536-b9ebd2a5e15e', '0fc5021d-3264-413a-b2e2-74bda38a394e', '4704fa62-b8ab-4dfe-b7aa-0e4905f8412a')"
```
- This process ended up taking a few days because each iteration ran for over four hours before failing on the next UUID, sighhhhh
## 2023-04-18
- Regarding the item Abenet noticed yesterday that has a blank page and a nullPointerException
- It appears OK on DSpace Test! https://dspacetest.cgiar.org/handle/10568/75611
- And according to the REST API on CGSpace the item was modified on 2023-04-11, so last week...
- According to the DSpace logs it was Francesca who edited the item last week, so I asked her for more information before I troubleshoot more
## 2023-04-19
- I fixed the Bioversity item by deleting the `9781138781276.jpg` bitstream via the REST API
- I *think* Francesca might have changed the "format" of it?
- Anyway, this item has a PDF so we have a proper thumbnail and don't need that other journal cover one
- I noticed a URL for this [Bioversity item](https://hdl.handle.net/10568/89049) redirects incorrectly
- I had mentioned this to Maria and Francesca a few months ago but it seems to never have been resolved
- The `dspace cleanup -v` finally finished after a few days of running and stopping...
- I decided to update the thumbnails in the Bioversity books collection because I saw a few old ones suffering from the CropBox issue
- Also, all day there's been a high load on CGSpace, with lots of locks in PostgreSQL
- I had been waiting until the bitstream cleanup finished... now I might need to restart PostgreSQL to kill some old locks as something needs to give
- I restarted PostgreSQL, but DSpace was still hanging on simple XMLUI options so I ended up restarting Tomcat
- Tag 544 ORCID identifiers with my script
- I updated my `generation-loss.sh` and `improved-dspace-thumbnails` scripts to include thirty-five PDFs from CGSpace (up from twenty-four) to get a larger sample
- Now starting to get some numbers comparing JPEG, WebP, and AVIF
- First, out of curiousity, I checked the average ssimulacra2 scores at Q75, Q80, and Q92 for each format:
| | Q75 | Q80 | Q92 |
|------|-----|-----|-----|
| JPEG | 71 | 74 | 88 |
| WebP | 74 | 77 | 82 |
| AVIF | 82 | 83 | 86 |
- Then I checked the quality and file size (bytes) needed to hit an average ssimulacra2 score of 80 with each format:
- **JPEG**: Q89, 124923 bytes
- **WebP**: Q86, 84662 bytes (33% smaller than JPEG size)
- **AVIF**: Q65, 67597 bytes (56% smaller than JPEG size)
- [Google's original WebP study](https://developers.google.com/speed/webp/docs/webp_study) uses this technique to compare WebP to JPEG too
- As the quality settings are not comparable between formats, we need to compare the formats at matching perceptual scores (ssimulacra2 in this case)
- I used a ssimulacra2 score of 80 because that's the about the highest score I see with WebP using my samples, though JPEG and AVIF do go higher
- Also, according to current ssimulacra2 (v2.1), a score of 70 is "high quality" and a score of 90 is "very high quality", so 80 should be reasonably high enough...
- Here is a plot of the qualities and ssimulacra2 scores:
![Quality vs Score](/cgspace-notes/2023/04/quality-vs-score-ssimulacra-v2.1.png)
- Export CGSpace to check for missing Initiatives mappings
## 2023-04-22
- Export the Initiatives collection to run it through csv-metadata-quality
- I wanted to make sure all the Initiatives items had correct regions
- I had to manually fix a few license identifiers and ISSNs
- Also, I found a few items submitted by MEL that had dates in DD/MM/YYYY format, so I sent them to Salem for him to investigate
- Start a harvest on AReS
## 2023-04-26
- Begin working on the list of non-AGROVOC CGSpace subjects for FAO
- The last time I did this was in 2022-06
- I used the following SQL query to dump values from all subject fields, lower case them, and group by counts:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2023-04-26-cgspace-subjects.csv WITH CSV HEADER;
COPY 26315
Time: 2761.981 ms (00:02.762)
```
- Then I extracted the subjects and looked them up against AGROVOC:
```console
$ csvcut -c subject /tmp/2023-04-26-cgspace-subjects.csv | sed '1d' > /tmp/2023-04-26-cgspace-subjects.txt
$ ./ilri/agrovoc_lookup.py -i /tmp/2023-04-26-cgspace-subjects.txt -o /tmp/2023-04-26-cgspace-subjects-results.csv
```
## 2023-04-27
- The AGROVOC lookup from yesterday finished, so I extracted all terms that did not match and joined them with the original CSV so I can see the counts:
- (I also note that the `agrovoc_lookup.py` script didn't seem to be caching properly, as it had to look up everything again the next time I ran it despite the requests cache being 174MB!)
```console
csvgrep -c 'number of matches' -r '^0$' /tmp/2023-04-26-cgspace-subjects-results.csv \
| csvcut -c subject \
| csvjoin -c subject /tmp/2023-04-26-cgspace-subjects.csv - \
> /tmp/2023-04-26-cgspace-non-agrovoc.csv
```
- I filtered for only those terms that had counts larger than fifty
- I also removed terms like "forages", "policy", "pests and diseases" because those exist as singular or separate terms in AGROVOC
- I also removed ambiguous terms like "cocoa", "diversity", "resistance" etc because there are various other preferred terms for those in AGROVOC
- I also removed spelling mistakes like "modeling" and "savanas" because those exist in their correct form in AGROVOC
- I also removed internal CGIAR terms like "tac", "crp", "internal review" etc (note: these are mostly from CGIAR System Office's subjects... perhaps I exclude those next time?)
- I note that many of *our* terms would match if they were singular, plural, or split up into separate terms, so perhaps we should pair this with an excercise to review our own terms
- I couldn't finish the work locally yet so I uploaded my list to Google Docs to continue later
## 2023-04-28
- The ImageMagick CMYK issue is bothering me still
- I am on a plane currently, but I have a Docker image of ImageMagick 7.1.1-3 and I compared the output of all CMYK PDFs using the same command on my local machine
- The images from the Docker environment are correct with *only* `-colorspace sRGB` (no profiles!) as the commenters on GitHub said
- This leads me to believe something wrong in my own environment, perhaps Ghostscript...?
- The container has Ghostscript 9.53.3~dfsg-7+deb11u2 from Debian 11, while my Arch Linux system has Ghostscript 10.01.1-1
<!-- vim: set sw=2 ts=2: -->

197
content/posts/2023-05.md Normal file
View File

@@ -0,0 +1,197 @@
---
title: "May, 2023"
date: 2023-05-03T08:53:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-05-03
- Alliance's TIP team emailed me to ask about issues authenticating on CGSpace
- It seems their password expired, which is annoying
- I continued looking at the CGSpace subjects for the FAO / AGROVOC exercise that I started last week
- There are many of our subjects that would match if they added a "-" like "high yielding varieties" or used singular...
- Also I found at least two spelling mistakes, for example "decison support systems", which would match if it was spelled correctly
- Work on cleaning, proofing, and uploading twenty-seven records for IFPRI to CGSpace
<!--more-->
- I notice there are a few dozen locks from the `dspaceWeb` pool that are five days old on CGSpace so I killed them
```console
$ psql < locks-age.sql | grep " days " | awk -F"|" '{print $10}' | sort -u | xargs kill
```
## 2023-05-04
- Sync DSpace Test with CGSpace
- I replaced one item's thumbnail with a WebP version and XMLUI displays it fine
- I spent some time checking the CMYK issue with Arch's ImageMagick 7 and the Docker container and I think ImageMagick 7 just handles CMYK wrong...
- libvips does it correctly automatically and looks closer to the PDF
- Meeting about CG Core types
## 2023-05-10
- Write a script to find the `metadata_field_id` values associated with the non-AGROVOC subjects I am working on for Sara
- This is useful because we want to know who to contact for a definition
- The script was:
```bash
while read -r subject; do
metadata_field_id=$(psql -h localhost -U postgres -d dspacetest -qtAX <<SQL
SELECT DISTINCT(metadata_field_id) FROM metadatavalue WHERE LOWER(text_value)='$subject'
SQL
)
metadata_field_id=$(echo $metadata_field_id | sed 's/[[:space:]]/||/g')
echo "$subject,$metadata_field_id"
done < <(csvcut -c 1 ~/Downloads/2023-04-26\ CGIAR\ non-AGROVOC\ subjects.csv | sed 1d)
```
- I also realized that Bernard Bett didn't have any items on CGSpace tagged with his ORCID identifier, so I tagged 230!
## 2023-05-11
- CG Core meeting
- Finalize looking at the CGSpace non-AGROVOC subjects for FAO
## 2023-05-12
- Export the Alliance community to do some country/region fixes
- I also sent Maria and Francesca the export because they want to add more regions and subregions
- Export the entire CGSpace to check for missing Initiative collection mappings
- I also adding missing regions
## 2023-05-16
- I finally cleaned up and published my latest evaluation of [JPEG, WebP, and AVIF](https://alanorth.github.io/improved-dspace-thumbnails/evaluating-jpeg-webp-avif.html)
- I [filed an issue on DSpace](https://github.com/DSpace/DSpace/issues/8849) to track this
## 2023-05-17
- Re-sync CGSpace to DSpace 7 Test
- I came up with a naive patch to use WebP instead of JPEG in the DSpace ImageMagick filter, and it works, but doesn't replace existing JPEGs... hmmm
- Also, it does PDF to WebP to WebP haha
## 2023-05-18
- I created a [pull request](https://github.com/DSpace/DSpace/pull/8850) to improve some minor documentation, typo, and logic issues in the DSpace ImageMagick thumbnail filters
- I realized that there is a quick win to the generation loss issue with ImageMagickThumbnailFilter
- We can use ImageMagick's internal MIFF instead of JPEG when writing the intermediate image
- According to the [libvips author PNG is very slow](https://github.com/libvips/libvips/issues/571)!
- I re-ran my `generation-loss.sh` script using MIFF and found that it had essentially the same results as PNG, which is about 1.1 points higher on the ssimulacra2 (v2.1) scoring scale
- Also, according to my tests with the cosmo rusage.com utility, I see that MIFF is indeed much faster than PNG
- I updated my pull request to add this quick win
- Weekly CG Core types meeting
- Low attendance so I just kept working on the spreadsheet
- We are at the stage of voting on definitions
## 2023-05-19
- I ported a few of the minor ImageMagick Thumbnail Filter improvements to our `6_x-prod` branch
## 2023-05-20
- I deployed the latest thumbnail changes on CGSpace, ran all updates, and rebooted it
- I exported CGSpace to check for missing Initiative mappings
- Then I started a harvest on AReS
## 2023-05-23
- Help Francesca with an import of a journal article with a few hundred authors
- I used the DSpace 7 live import from PubMed
- I also noticed a bug in the CrossRef live import if you change the DOI field, so I [filed an issue](https://github.com/DSpace/DSpace/issues/8865)
## 2023-05-25
- Meeting on output types
- Make a [pull request on DSpace to capture publisher during live import from Crossref](https://github.com/DSpace/DSpace/pull/8866)
## 2023-05-26
- Make a [pull request on DSpace to update checkstyle](https://github.com/DSpace/DSpace/pull/8868)
- Make a [pull request on DSpace-angular to fix an incorrect i18n UI string](https://github.com/DSpace/dspace-angular/pull/2274)
- I'm experimenting with replacing old thumbnails
- In the past we used to upload thumbnails for journal covers, but those were low quality and look horrible now
- Using the provenance field I want to identify items with 1 bitstream of type gif or jpg, then extract the item IDs along with DOIs:
```sql
\COPY (SELECT
text_value,
dspace_object_id
FROM
metadatavalue
WHERE
dspace_object_id IN (
SELECT
dspace_object_id
FROM
metadatavalue
WHERE
metadata_field_id = 28
AND place = 0
AND (text_value LIKE '%No. of bitstreams: 1%'
AND text_value SIMILAR TO '%.(gif|jpg|jpeg)%'))
AND metadata_field_id = 220) TO /tmp/items-with-old-bitstreams.csv WITH CSV HEADER;
```
- I extract the DOIs and look them up on CrossRef to see which are CC-BY, then extract those:
```console
$ csvcut -c text_value /tmp/items-with-old-bitstreams.csv | sed 1d > /tmp/dois.txt
$ ./ilri/crossref_doi_lookup.py -i /tmp/dois.txt -e fuuu@example.com -o /tmp/dois-resolved.csv
$ csvgrep -c license -m 'creativecommons' /tmp/dois-resolved.csv \
| csvgrep -c license -m 'by-nc-nd' --invert-match \
| csvcut -c doi \
| sed '2,$s_^\(.*\)$_https://doi.org/\1_' \
| sed 1d > /tmp/dois-for-cc-items-with-old-bitstreams.txt
```
- This results in 262 items that have DOIs that are CC-BY (but not ND)
- This is a good starting point, but misses some that had low-quality thumbnails uploaded after they were added (ie, there's no record of a bitstream in the provenance field)
- I ran the list through my Sci-Hub download script and filtered out a few that downloaded invalid PDFs (manually), then generated thumbnails for all of them:
```console
$ ~/src/git/DSpace/ilri/get_scihub_pdfs.py -i /tmp/dois-for-cc-items-with-old-bitstreams.txt -o bitstreams.csv
$ chrt -b 0 vipsthumbnail *.pdf --export-profile srgb -s 600x600 -o './%s.pdf.jpg[Q=02,optimize_coding,strip]'
```
- Then I joined the CSVs on the DOI column, filtered out any that we didn't find PDFs for, and formatted the resulting CSV with an id, filename, and bundle column:
```console
$ csvjoin -c doi bitstreams.csv /tmp/items-with-old-bitstreams.csv \
| csvgrep -c filename --invert-match -r '^$' \
| sed '1s/dspace_object_id/id/' \
| csvcut -c id,filename \
| sed -e '1s/^\(.*\)$/\1,bundle/' -e '2,$s/^\(.*\)$/\1.jpg__description:libvips thumbnail,THUMBNAIL/' > new-thumbnails.csv
```
- I did a dry run with `ilri/post_bitstreams.py` and it seems that most (all?) already have thumbnails from the last time I did a massive Sci-Hub check
- So relying on the provenance field is not very reliable it seems, and that was a waste of two hours...
- I did discover, while originally posting WebP thumbnails, that the format doesn't seem to be set correctly when uploading WebP via the REST API, but it does work when uploading via XMLUI—the format is set to Unknown
- POSTing a JPG to the THUMBNAIL bundle sets the format to JPEG...
- I am guessing that is a bug that I won't bother troubleshooting since the DSpace 6.x REST API is deprecated
## 2023-05-27
- Export CGSpace to check for missing Initiative collection mappings
- Then I also ran the csv-metadata-quality tool on the Initiatives to do some easy fixes like country/region mapping and whitespace fixes
- Start a havest on AReS
## 2023-05-29
- Re-create my local PostgreSQL 14 container:
```console
$ podman rm dspacedb14
$ podman pull docker.io/postgres:14-alpine
$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d docker.io/postgres:14-alpine -c shared_buffers=1024MB -c random_page_cost=1.1
```
- Export CGSpace again to do some major cleanups in OpenRefine
- I found a few countries that are in the ISO 3166-1 and UN M.49 lists, but not in ours so I added them to the list in `input-forms.xml` and regenerated the controlled vocabularies for the CGSpace Submission Guidelines
- There were a handful of issues with ISSNs, ISBNs, DOIs, access status, licenses, and missing CGIAR Trust Fund donors for Initiatives outputs
- This was about 455 items
- Helping the Alliance web team understand the DSpace REST API for determining which collection an item belongs to
<!-- vim: set sw=2 ts=2: -->

252
content/posts/2023-06.md Normal file
View File

@@ -0,0 +1,252 @@
---
title: "June, 2023"
date: 2023-06-02T10:29:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-06-02
- Spend some time testing my `post_bitstreams.py` script to update thumbnails for items on CGSpace
- Interestingly I found an item with a JFIF thumbnail and another with a WebP thumbnail...
- Meeting with Valentina, Stefano, and Sara about MODS metadata in CGSpace
- They have experience with improving the MODS interface in MELSpace's OAI-PMH for use with AGRIS and were curious if we could do the same in CGSpace
- From what I can see we need to upgrade the MODS schema from 3.1 to 3.7 and then just add a bunch of our fields to the crosswalk
<!--more-->
## 2023-06-04
- Upgrade CGSpace to Ubuntu 22.04
- The upgrade was mostly normal, but I had to unhold the openjdk package in order for `do-release-upgrade` to run:
```console
# apt-mark hold openjdk-8-jdk-headless:amd64 openjdk-8-jre-headless:amd64
```
- In [2022-11]({{< relref "2022-11.md" >}}) an upstream Java update broke the DSpace 6 Handle server so we will have to pin this again after the upgrade to Ubuntu 22.04
- After the upgrade I made sure CGSpace was working, then proceeded to upgrade PostgreSQL from 12 to 14, like I did on [DSpace Test in 2023-03]({{< relref "2023-03.md" >}})
- Then I had to downgrade OpenJDK to fix the Handle server using the ones I had previously downloaded for Ubuntu 20.04 because they no longer exist on Launchpad:
```console
# dpkg -i openjdk-8-j*8u342-b07*.deb
```
- Export CGSpace to fix missing Initiative collection mappings
- Start a harvest on AReS
- Work on the DSpace 7 migration a bit more
- I decided to rebase and drop all the submission form edits because they conflict every time upstream changes!
## 2023-06-06
- Fix some incorrect ORCID identifiers for an Alliance author on CGSpace
- Export our list of ORCID identifiers, resolve them, and update the records in CGSpace:
```console
$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml 2022-09-22-add-orcids.csv| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-06-06-orcids.txt
$ ./ilri/resolve_orcids.py -i /tmp/2023-06-06-orcids.txt -o /tmp/2023-06-06-orcids-names.txt -d
$ ./ilri/update_orcids.py -i /tmp/2023-06-06-orcids-names.txt -db dspacetest -u dspace -p 'ffff' -m 247
```
- Start working on updating the MODS schema in CGSpace from 3.1 to 3.8 based on Stefano and Salem's work last year
## 2023-06-08
- Continue working on the MODS schema mapping
- Export CGSpace to check and update `dcterms.extent` fields
- I normalized about 1,500 to use either "p. 1-6" or "5 p." format
- Also, I used this GREL expression to extract missing pages from the citation field: `cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*(pp?\.\s?\d+[-]\d+).*/)[0]`
- This was over 4,000 items with a format like "p. 1-6" and "pp. 1-6" in the citation
- I used another GREL expression to extract another 5,000: `cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*?(\d+\s+?[Pp]+\.).*/)[0]`
- This was for the format like "1 p." (note we had to protect against the greedy `.*` in the beginning)
- I also did some work to capture a handful of missing DOIs and ISSNs, but it was only about 100 items and I will have to wait until the 10,000+ above finish importing
## 2023-06-09
- I see there are ~200 users in CGSpace that have registered with their CGIAR email address using a password as opposed to using Active Directory:
```sql
SELECT * FROM eperson WHERE email LIKE '%cgiar.org' AND netid IS NOT NULL AND password IS NOT NULL;
```
- I am wondering if I should delete their passwords and tell them use log in using LDAP
- As an initial test I will reset a few accounts including my own that have passwords and salts:
```sql
UPDATE eperson SET password=DEFAULT,salt=DEFAULT,digest_algorithm=DEFAULT WHERE netid IN ('axxxx', 'axxxx', 'bxxxx');
```
- I also decided to reset passwords/salts for CGIAR accounts that have not been active since 2021 (1.5 years ago):
```sql
UPDATE eperson SET password=DEFAULT,salt=DEFAULT,digest_algorithm=DEFAULT WHERE email LIKE '%cgiar.org' AND netid IS NOT NULL AND password IS NOT NULL AND salt IS NOT NULL AND last_active < '2022-01-01'::date;
```
- This was about 100 accounts...
- I will wait some more time before I decide what to do about the more current ones
- Add a few more ORCID identifiers to my list and tag them on CGSpace
## 2023-06-10
- Export CGSpace to check for missing Initiative mappings
- Start a harvest on AReS
## 2023-06-11
- File [an issue](https://github.com/DSpace/DSpace/issues/8900) on DSpace for the `Content-Disposition` bug causing images to get downloaded instead of opened inline
## 2023-06-12
- Export CGSpace to do some more work extracting volume and issue from citations for items where they are missing
- I found and fixed over 7,000!
- Then I found and extracted another 7,000 items with no extents (pages)
- Then I replaced all occurences of en dashes for ranges in pages with regular hyphens
## 2023-06-13
- Last night I finally figured out how to do basic overrides to the simple item view in Angular
- Add a handful of new ORCID identifiers to my list and tag them on CGSpace
- Extract a list of all the proposed actions for CG Core output types and create a [new issue for them on CG Core's GitHub repository](https://github.com/AgriculturalSemantics/cg-core/issues/45)
- Extract a list of all the proposed actions for CG Core output types for MARLO and create [a new issue for them on MARLO's GitHub repository](https://github.com/CCAFS/MARLO/issues/2479)
- Meeting with Indira, Ryan, and Abenet to discuss plans for the DSpace 7 focus group
## 2023-06-14
- Did some more work on the DSpace 7 Test to improve the submission forms and the look and feel
- Extract a list of all the proposed actions for CG Core output types for MEL and create [a new issue for them on MEL's GitHub repository](https://github.com/CodeObia/MEL/issues/11216)
- I filed [an issue about the yarn merge-i18n script](https://github.com/DSpace/dspace-angular/issues/2309)
- I made [a pull request for some Finnish language i18n strings](https://github.com/DSpace/dspace-angular/pull/2306)
- I made [a pull request to lint the i18n en.json5 file](https://github.com/DSpace/dspace-angular/pull/2306)
## 2023-06-15
- A lot more work on DSpace 7
- I tested some pull requests and worked on the style of the item view and homepage
## 2023-06-16
- A lot more work on DSpace 7
- I made [a pull request to adjust font weight in item counts ](https://github.com/DSpace/dspace-angular/pull/2316)
- I made [a pull request to update the ESLint configuration for JSON5](https://github.com/DSpace/dspace-angular/pull/2317)
## 2023-06-17
- Export CGSpace to check for missing Initiative collection mappings
- I also spent some time doing sanity checks on countries, regions, DOIs, and more
- I lowercased all our AGROVOC keywords in `dcterms.subject`:
```sql
dspace=# BEGIN;
BEGIN
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 2392
dspace=*# COMMIT;
COMMIT
```
- Start a harvest on AReS
## 2023-06-19
- Today I started getting an error on DSpace 7 Test
- The page loads, and then when it is almost done it goes blank to white with this in the console:
```console
ERROR DOMException: CSSStyleSheet.cssRules getter: Not allowed to access cross-origin stylesheet
```
- I restarted Angular, but it didn't fix it
- The `yarn test:rest` script shows everything OK, and I haven't changed anything recently...
- I re-compiled the Angular UI using the default theme and it was the same...
- I tried in Firefox Nightly and it works...
- So it must be something related to the browser
- I tried clearing all the session storage / cookies and refreshing and it worked
- I switched back to the CGSpace theme and it happened again
- I had a hunch it might be due to the GDPR cookie plugin in my browser, so I disabled that and then refreshed and it worked... hmmm
- Upload thumbnails for about 42 IITA Journal Articles after resolving their DOIs and making sure they were not CC ND
- I fixed a few bugs in `get_scihub_pdfs.py` in the process
## 2023-06-21
- Stefano got back to me about the MODS OAI-PMH schema test on DSpace Test
- He said that it's fine if we use iso8601 encoding for dates instead of w3cdtf and asked if we can create a custom end point for AGRIS that only includes types like Journal Articles similar to how Salem did it: https://melspace.loc.codeobia.com/oai/agris?verb=ListRecords&metadataPrefix=mods
- I updated DSpace Test with the new date format and said I'd work on the custom AGRIS set
## 2023-06-25
- Export CGSpace to check for missing Initiative collection mappings
- I wanted to start a harvest on AReS but I've seen the load on the server high for a few days and I'm not sure what it is
- I decided to run all updates and reboot it since it's Sunday anyway
## 2023-06-26
- Since the new DSpace 7 will respect newlines in metadata fields I am curious to see how many of our abstracts have poor newlines
- I exported CGSpace and used a custom text facet with this GREL expression in OpenRefine to count the number of newlines in each cell:
```console
value.split('\n').length()
```
- Also useful to check for general length of the text in the cell to make sure it's a reasonably long string
- I spent some time trying to find a pattern that I could use to identify "easy" targets, but there are so many exceptions that it will have to be done manually
- I fixed a few dozen
- Do a bit of work on thumbnails on CGSpace
- I'm trying to troubleshoot the Discovery error I get on DSpace 7:
```console
java.lang.NullPointerException: Cannot invoke "org.dspace.discovery.configuration.DiscoverySearchFilterFacet.getIndexFieldName()" because the return value of "org.dspace.content.authority.DSpaceControlledVocabularyIndex.getFacetConfig()" is null
```
- I reverted to the default `submission-forms.xml` and the `getFacetConfig()` error goes away...
- Kill some long-held locks on CGSpace PostgreSQL, as some users are complaining of slowness in archiving
- I did some testing of the LDAP login issue related to groupmaps
- It does seem to be a regression from the [LDAP auth patch](https://github.com/DSpace/DSpace/pull/8814) from last month, so I [filed an issue](https://github.com/DSpace/DSpace/issues/8920)
- I spent some time on working on Angular and I figured out how to add a custom Angular component to show the UN SDG Goal icons on DSpace 7
## 2023-06-27
- I debugged the NullPointerException and somehow it disappeared
- It seems to be related to the external controlled vocabularies in the submission form
- I removed them all, then added them all back, and now the issue is solved... hmmmm
- Oh now, now they are gone again, sigh...
## 2023-06-28
- Spent a lot of time debugging the browse indexes
- Looking at the [DSpace 7 demo API](https://api7.dspace.org/server/api/discover/browses) I see the four default browse indexes from `dspace.cfg` and the one default `srsc` one that gets automatically enabled from the `<vocabulary>srsc</vocabulary>` in the `submission-forms.xml`
- The same API call on my test DSpace 7 configuration results in the HTTP 500 I've been seeing for some time, and I am pretty sure it's due to the automagic configuration of hierarchical browses based on the submission form
- Yes, if I remove them all from my submission form then this works: http://localhost:8080/server/api/discover/browses
- I went through each of our vocabularies and tested them one by one:
- dcterms-subject: OK
- dc-contributor-author: NO
- cg-creator-identifier: NO
- cg-contributor-affiliation: OK (and with `facetType: "affiliation"` in API response?!)
- cg-contributor-donor: OK (`facetType: "sponsorship"`)
- cg-journal: NO
- cg-coverage-subregion: NO
- cg-species-breed: NO
- Now I need to figure out what it is about those five that causes them to not work!
- Ah, after debugging with someone on the DSpace Slack, I realized that DSpace expects these vocabularies to have corresponding indexes configured in `discovery.xml`, and they must be added as search filters AND sidebar facets.
## 2023-06-29
- I noticed there is now a [patched version of the Handle JAR for DSpace 6.x](https://github.com/DSpace/DSpace/issues/8557#issuecomment-1595340249)
- This fixes the [issue in OpenJDK 1.8.0_352](https://groups.google.com/g/dspace-tech/c/PqjfA5mqG4w/m/FhxI5oXhFwAJ?pli=1), so we can remove the apt pin on JDK now
- I deployed it on CGSpace and it's working!
- I lowercased all our AGROVOC terms because I noticed a few that were not:
```console
dspace=# BEGIN;
BEGIN
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 53
dspace=*# COMMIT;
```
- After more discussion about the NullPointerException related to browse options, I filed [an issue](https://github.com/DSpace/DSpace/issues/8927)
## 2023-06-30
- I added another custom component to display CGIAR Impact Area icons in the DSpace 7 test
<!-- vim: set sw=2 ts=2: -->

324
content/posts/2023-07.md Normal file
View File

@@ -0,0 +1,324 @@
---
title: "July, 2023"
date: 2023-07-01T17:14:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-07-01
- Export CGSpace to check for missing Initiative collection mappings
- Start harvesting on AReS
## 2023-07-02
- Minor edits to the `crossref_doi_lookup.py` script while running some checks from 22,000 CGSpace DOIs
## 2023-07-03
- I analyzed the licenses declared by Crossref and found with high confidence that ~400 of ours were incorrect
- I took the more accurate ones from Crossref and updated the items on CGSpace
- I took a few hundred ISBNs as well for where we were missing them
- I also tagged ~4,700 items with missing licenses as "Copyrighted; all rights reserved" based on their Crossref license status being TDM, mostly from Elsevier, Wiley, and Springer
- Checking a dozen or so manually, I confirmed that if Crossref only has a TDM license then it's usually copyrighted (could still be open access, but we can't tell via Crossref)
- I would be curious to write a script to check the Unpaywall API for open access status...
- In the past I found that their *license* status was not very accurate, but the open access status might be more reliable
- More minor work on the DSpace 7 item views
- I learned some new Angular template syntax
- I created a custom component to show Creative Commons licenses on the simple item page
- I also decided that I don't like the Impact Area icons as a component because they don't have any visual meaning
## 2023-07-04
- Focus group meeting with CGSpace partners about DSpace 7
- I added a themed file selection component to the CGSpace theme
- It displays the bistream description instead of the file name, just like we did in DSpace 6 XMLUI
- I added a custom component to show share icons
## 2023-07-05
- I spent some time trying to update OpenRXV from Angular 9 to 10 to 11 to 12 to 13
- Most things work but there are some minor bugs it seems
- Mishell from CIP emailed me to say she was having problems approving an item on CGSpace
- Looking at PostgreSQL I saw there were a dozen or so locks that were several hours and even over one day old so I killed those processes and told her to try again
## 2023-07-06
- Types meeting
- I wrote a Python script to check Unpaywall for some information about DOIs
## 2023-07-7
- Continue exploring Unpaywall data for some of our DOIs
- In the past I've found their _licensing_ information to not be very reliable (preferring Crossref), but I think their _open access status_ is more reliable, especially when the provider is listed as being the publisher
- Even so, sometimes the version can be "acceptedVersion", which is presumably the author's version, as opposed to the "publishedVersion", which means it's available as open access on the publisher's website
- I did some quality assurance and found ~100 that were marked as Limited Access, but should have been Open Access, and fixed a handful of licenses
- Delete duplicate metadata as described in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
- Start working on some statistics on AGROVOC usage for my presenation next week
- I used the following SQL query to dump values from all subject fields and lower case them:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119)) to /tmp/2023-07-07-cgspace-subjects.csv WITH CSV HEADER;
COPY 26443
Time: 2564.851 ms (00:02.565)
```
- Then I extracted the subjects and looked them up against AGROVOC:
```console
$ csvcut -c subject /tmp/2023-07-07-cgspace-subjects.csv | sed '1d' > /tmp/2023-07-07-cgspace-subjects.txt
$ ./ilri/agrovoc_lookup.py -i /tmp/2023-07-07-cgspace-subjects.txt -o /tmp/2023-07-07-cgspace-subjects-results.csv
```
- I did some more tests with Angular 13 on OpenRXV and found out why the repository type dropdown wasn't working
- It was because of a missing 1-line JSON file in the data directory, which is runtime data, not code
- I copied the data directory from the production serve and rebuild and the site is working well now
- I did a full harvest with plugins and it worked!
- So it seems Angular 13.4.0 will work, yay
## 2023-07-08
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
- The AGROVOC lookup finished, so I checked the number of matches:
```console
$ csvgrep -c 'match type' -r '^.+$' ~/Downloads/2023-07-07-cgspace-subjects-resolved.csv | sed 1d | wc -l
12528
```
- So that's 12,528 out of 26,443 unique terms (47.3%)
- I did a LOT of work on the OpenRXV frontend build dependencies to bring more in line with Angular 13
## 2023-07-10
- I did a lot more work on OpenRXV to test and update dependencies
- I deployed the latest version on the production server
## 2023-07-12
- CGSpace upgrade meeting with Americas and Africa group
## 2023-07-13
- Michael Victor asked me to help Aditi extract some information from CGSpace
- She was interested in journal articles published between 2018 and 2023 with a range of subjects related to drought, flooding, resilience, etc
- I used an advanced query with some AGROVOC terms:
```console
dcterms.issued:[2018 TO 2023] AND dcterms.type:"Journal Article" AND (dcterms.subject:flooding OR dcterms.subject:flood OR dcterms.subject:"extreme weather events" OR dcterms.subject:drought OR dcterms.subject:"drought resistance" OR dcterms.subject:"drought tolerance" OR dcterms.subject:"soil salinity" OR dcterms.subject:"pests of plants" OR dcterms.subject:pests OR dcterms.subject:heat OR dcterms.subject:fertilizers OR dcterms.subject:"fertilizer technology" OR dcterms.subject:"rice fields" OR dcterms.subject:"landscape conservation" OR dcterms.subject:"landscape restoration" OR dcterms.subject:livestock)
```
- Interestingly, some variations of this same exact query produce no search results, and I see this error in the DSpace log:
```console
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dcterms.issued:[2018 TO 2023] AND dcterms.type:"Journal Article" AND (dcterms.subject:flooding OR dcterms.subject:flood OR dcterms.subject:"extreme weather events" OR dcterms.subject:drought OR dcterms.subject:"drought resistance" OR dcterms.subject:"drought tolerance" OR dcterms.subject:"soil salinity" OR dcterms.subject:"pests of plants" OR dcterms.subject:pests OR dcterms.subject:heat OR dcterms.subject:fertilizers OR dcterms.subject:"fertilizer technology" OR dcterms.subject:"rice fields" OR dcterms.subject:livestock OR dcterms.subject:"landscape conservation" OR dcterms.subject:"landscape restoration\"\)': Lexical error at line 1, column 617. Encountered: <EOF> after : "\"landscape restoration\\\"\\)"
```
- It seems to be when there is a quoted search term at the end of the parenthesized group
- For what it's worth this same query worked fine on DSpace 7.6
## 2023-07-15
- Export CGSpace to fix missing Initiative collection mappings
- Start a harvest on AReS
## 2023-07-17
- Rasika had sent me a list of new ORCID identifiers for new IWMI staff so I combined them with our existing list and ran `resolve_orcids.py` to refresh the names in our database
- I updated the list, updated names in the database, and tagged new authors with missing identifiers in existing items
## 2023-07-18
- Meeting with IWMI, IRRI, and IITA colleagues about CGSpace upgrade plans
- Maria from the Alliance mentioned having some submissions stuck on CGSpace
- I looked and found a number of locks stuck for many nineteen, eighteen, and more hours...
- I killed them and told her to try again
```console
$ psql < locks-age.sql | less -S
$ psql < locks-age.sql | grep -E " (19|18|17|16|12):" | awk -F"|" '{print $10}' | sort -u | xargs kill
```
## 2023-07-19
- I had to kill a bunch more locked processes in PostgreSQL, I'm not sure what's going on
- After some discussion about an advanced search bug with Tim on Slack, I filed [an issue on GitHub](https://github.com/DSpace/DSpace/issues/8962)
## 2023-07-20
- I added a new metadata field for CGIAR Impact Platforms (`cg.subject.impactPlatform`) to CGSpace
## 2023-07-22
- Export CGSpace tp fix missing Initiative collections
- Start a harvest on AReS
## 2023-07-24
- Test Salem's new JavaScript-based DSpace Statistics API and send him some feedback
- I noticed a few times that the Solr service on my DSpace 7 instance is getting OOM killed
- I had been using a 4g Solr heap, but maybe we don't need that much
- Tomcat is also using 4.6GB, and then there's PostgreSQL... so perhaps it's all a bit much on this system now
## 2023-07-25
- Start testing exporting DSpace 6 Solr cores to import on DSpace 7:
```console
$ chrt -b 0 dspace solr-export-statistics -i statistics
```
- I'm curious how long it takes and how much data there will be
- The size of the Solr data directory is currently 82GB
- The export took about 2.5 hours and created 6,000 individual CSVs, one for each day of Solr stats
- The size of the exported CSVs is about 88GB
- I will copy just a few years to import on the DSpace 7 test server
- So importing these is going to require removing the Atmire custom fields:
```console
$ dspace solr-import-statistics -i statistics
Exception: Error from server at http://localhost:8983/solr/statistics: ERROR: [doc=1a92472e-e39d-4602-9b4d-da022df8f233] unknown field 'containerCommunity'
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/statistics: ERROR: [doc=1a92472e-e39d-4602-9b4d-da022df8f233] unknown field 'containerCommunity'
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:681)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:266)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
at org.dspace.util.SolrImportExport.importIndex(SolrImportExport.java:465)
at org.dspace.util.SolrImportExport.main(SolrImportExport.java:148)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:277)
at org.dspace.app.launcher.ScriptLauncher.handleScript(ScriptLauncher.java:133)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:98)
```
- I will try using solr-import-export-json, which I've used in the past to skip Atmire custom fields in Solr:
```console
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2022.json -f 'time:[2022-01-01T00\:00\:00Z TO 2022-12-31T23\:59\:59Z]' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,geoIpCountryCode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId,core_update_run_nb
```
- Some users complained that CGSpace was slow and I found a handful of locks that were hours and days old...
- I killed those and told them to try again
- After importing the Solr statistics into DSpace 7 I realized that my DSpace Statistics API will work fine
- I made some minor modifications to the Ansible infrastructure scripts to make sure it is enabled and then activated it on DSpace 7 Test
## 2023-07-26
- Debugging lock issues on CGSpace
- I see the blocking PIDs for some long-held locks are "idle in transaction":
```console
$ ps auxw | grep -E "(1864132|1659487)"
postgres 1659487 0.0 0.5 3269900 197120 ? Ss Jul25 0:03 postgres: 14/main: cgspace cgspace 127.0.0.1(61648) idle in transaction
postgres 1864132 0.1 0.7 3275704 254528 ? Ss 07:27 0:08 postgres: 14/main: cgspace cgspace 127.0.0.1(36998) idle in transaction
postgres 1880388 0.0 0.0 9208 2432 pts/3 S+ 08:48 0:00 grep -E (1864132|1659487)
```
- I used some other scripts and found that those processes were executing the following statement:
```console
select nextval ('public.tasklistitem_seq')
```
- I don't know why these can get blocked for hours without resolution, but for now I just killed them
- For what it's worth [these sequences were removed in DSpace 7.0](https://github.com/DSpace/DSpace/commit/16ae96b4c3d833c2a4acd1f05985d424c3a52bd7) along with the "traditional" item workflow—maybe that means we won't have such contention issues in DSpace 7!
- I wrote a slightly longer regex to match locks that have been stuck for more than 1 hour based on the output of the `locks-age.sql` script and killed them:
```console
$ psql < locks-age.sql | awk -F"|" '/ [[:digit:]][1-9]:[[:digit:]]{2}:[[:digit:]]{2}\./ {print $10}' | sort -u | xargs kill
```
- I filed [an issue for missing Altmetric badges on DSpace 7 Angular](https://github.com/DSpace/dspace-angular/issues/2400)
## 2023-07-27
- Export CGSpace to check countries, regions, types, and Initiatives
- There were a few minor issues in countries and regions, and I noticed 186 items without types!
- Then I ran the file through csv-metadata-quality to make sure items with countries have appropriate regions
- Brief discussion about OpenRXV bugs and fixes with Moayad
- I was toying with the idea of using an expanded whitespace check/fix based on [ESLint's no-irregular-whitespace](https://eslint.org/docs/latest/rules/no-irregular-whitespace) rule in csv-metadata-quality
- I found 176 items in CGSpace with such whitespace in their titles alone
- I compared the results of removing these characters and replacing them with a space
- In _most_ cases removing it is the correct thing to do, for example "Pesticides : une arme à double tranchant" → "Pesticides: une arme à double tranchant"
- But in some items it is tricky, for example "L'environnement juridique est-il propice à la gestion" → "L'environnement juridique est-il propice àla gestion"
- I guess it would really need some good heuristics or a human to verify...
- I upgraded OpenRXV to Angular v14
## 2023-07-28
- After a bit more testing I merged the [Angular v14 changes to OpenRXV master](https://github.com/ilri/OpenRXV/pull/184)
- I am getting an error trying to import the 2020 Solr statistics from CGSpace to DSpace 7:
```console
Exception in thread "main" org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/statistics: ERROR: [doc=0008a7c1-e552-4a4e-93e4-4d23bf39964b] Error adding field 'workflowItemId'='0812be47-1bfe-45e2-9208-5bf10ee46f81' msg=For input string: "0812be47-1bfe-45e2-9208-5bf10ee46f81"
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:745)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:259)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:234)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:102)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:69)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:82)
at it.damore.solr.importexport.App.insertBatch(App.java:295)
at it.damore.solr.importexport.App.lambda$writeAllDocuments$10(App.java:276)
at it.damore.solr.importexport.BatchCollector.lambda$accumulator$0(BatchCollector.java:71)
at java.base/java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
at it.damore.solr.importexport.App.writeAllDocuments(App.java:252)
at it.damore.solr.importexport.App.main(App.java:150)
```
- Ahhhh, in DSpace 6 this field was a string in the Solr statistics schema, but in DSpace 7 it is an integer...?
- Oh, it seems to be an Atmire change in our DSpace 6... hmmm, so we need to ignore the `workflowItemId` field when exporting
- Upstream: https://github.com/DSpace/DSpace/blob/dspace-6_x/dspace/solr/statistics/conf/schema.xml#L328
- ILRI: https://github.com/ilri/DSpace/blob/6_x-prod/dspace/solr/statistics/conf/schema.xml#L344
- I am wondering if we can skip all these workflow fields since I don't think we are using any aspects of statistics related to workflows
- I diffed our Solr statistics schema with the one from vanilla DSpace 6 and got a list of all the fields that were different:
```
isInternal,workflowItemId,containerCommunity,containerCollection,containerItem,containerBitstream,dateYear,dateYearMonth,filterquery,complete_query,simple_query,complete_query_search,simple_query_search,ngram_query_search,ngram_simplequery_search,text,storage_statistics_type,storage_size,storage_nb_of_bitstreams,name,first_name,last_name,p_communities_id,p_communities_name,p_communities_map,p_group_id,p_group_name,p_group_map,group_id,group_name,group_map,parent_count,bitstreamId,bitstreamCount,actingGroupId,actorMemberGroupId,actingGroupParentId,rangeDescription,range,version_id,file_id,cua_version,core_update_run_nb,orphaned
```
- I will combine it with the other fields I was skipping above and try the export again:
```console
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2020.json -f 'time:[2020-01-01T00\:00\:00Z TO 2020-12-31T23\:59\:59Z]' -k uid -S actingGroupId,actingGroupParentId,actorMemberGroupId,author_mtdt,author_mtdt_search,bitstreamCount,bitstreamId,complete_query,complete_query_search,containerBitstream,containerCollection,containerCommunity,containerItem,core_update_run_nb,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,file_id,filterquery,first_name,geoipcountrycode,geoIpCountryCode,group_id,group_map,group_name,ip_ngram,ip_search,isArchived,isInternal,iso_mtdt,iso_mtdt_search,isWithdrawn,last_name,name,ngram_query_search,ngram_simplequery_search,orphaned,parent_count,p_communities_id,p_communities_map,p_communities_name,p_group_id,p_group_map,p_group_name,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,referrer_ngram,referrer_search,simple_query,simple_query_search,solr_update_time_stamp,storage_nb_of_bitstreams,storage_size,storage_statistics_type,subject_mtdt,subject_mtdt_search,text,userAgent_ngram,userAgent_search,version_id,workflowItemId
```
- Export a list of affiliations from the Initiatives community for Peter:
```console
$ dspace metadata-export -i 10568/115087 -f /tmp/2023-07-28-initiatives.csv
$ csvcut -c 'cg.contributor.affiliation[en_US]' ~/Downloads/2023-07-28-initiatives.csv \
| sed -e 1d -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' -e '/^$/d' \
| sort | uniq -c | sort -hr \
| awk 'BEGIN { FS = "^[[:space:]]+[[:digit:]]+[[:space:]]+" } {print $2}'\
| sed -e '1i cg.contributor.affiliation' -e 's/^\(.*\)$/"\1"/' \
> /tmp/2023-07-28-initiatives-affiliations.csv
```
- This is a method I first used in 2023-01 to export affiliations ONLY used in items in the Initiatives community
- I did the same for authors and investors
## 2023-07-29
- Export CGSpace to look for missing Initiative collection mappings
- I found a bunch of locks waiting for many hours and killed them:
```console
$ psql < locks-age.sql | awk -F"|" '$9 ~ / [[:digit:]][1-9]:[[:digit:]]{2}:[[:digit:]]{2}\./ {print $10}' | sort -u | xargs kill
```
- This looks for a pattern matching something like `11:30:48.598436` in the age column (not 00:00:00) and kills them
- Start a harvest on AReS
<!-- vim: set sw=2 ts=2: -->

266
content/posts/2023-08.md Normal file
View File

@@ -0,0 +1,266 @@
---
title: "August, 2023"
date: 2023-08-03T11:18:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-08-03
- I finally got around to working on Peter's cleanups for affiliations, authors, and donors from last week
- I did some minor cleanups myself and applied them to CGSpace
- Start working on some batch uploads for IFPRI
<!--more-->
## 2023-08-04
- Minor cleanups on IFPRI's batch uploads
- I also did a duplicate check and found thirteen items that seem to be duplicates, so I sent them to Leigh to check
- I read this [interesting blog post about PostgreSQL's `log_statement` function](https://www.endpointdev.com/blog/2012/06/logstatement-postgres-all-full-logging/)
- Someone pointed out that this also lets you take advantage of [PgBadger](https://github.com/darold/pgbadger) analysis
- I enabled statement logging on DSpace Test and I will check it in a few days
- Reading about DSpace 7 REST API again
- Here is how to get the first page of 100 items: https://dspace7test.ilri.org/server/api/discover/search/objects?dsoType=item&page=1&size=100
- I really want to benchmark this to see how fast we can get all the pages
- Another thing I notice is that the bitstreams are not here, so that will be an extra call...
## 2023-08-05
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-08-07
- I'm checking the PostgreSQL logs now that statement logging has been enabled for a few days on DSpace Test
- I see the logs are about 7 or 8 GB, which is larger than expected—and this is the test server!
- I will now play with pgbadger to see if it gives any useful insights
- Hmm, it sems the `log_statement` advice was old as pgbadger itself says:
> Do not enable log_statement as its log format will not be parsed by pgBadger.
... and:
> Warning: Do not enable both log_min_duration_statement, log_duration and log_statement all together, this will result in wrong counter values. Note that this will also increase drastically the size of your log. log_min_duration_statement should always be preferred.
- So we need to follow pgbadger's instructions rather to get a suitable log file
- After enabling the new settings I see that our log file is going to be reaallllly big... hmmmm will check tomorrow morning
- More work on the IFPRI batch uploads
## 2023-08-08
- Apply more corrections to authors from Peter on CGSpace
- I finally figured out a `log_line_prefix` for PostgreSQL that works for pgBadger:
```console
log_line_prefix = '%t [%p]: user=%u,db=%d,app=%a,client=%h '
```
- Now I can generate reports:
```console
# /usr/bin/pgbadger -I -q /var/log/postgresql/postgresql-14-main.log -O /srv/www/pgbadger
```
- Ideally we would run this incremental report every day on the postgresql-14-main.log.1 aka yesterday's version of the log file after it is rotated
- Now I have to see how large the file will be...
- I did some final updates to the ninety IFPRI records and uploaded them to DSpace Test first, then to CGSpace
## 2023-08-11
- Fix bug with header background on DSpace 7 on mobile
## 2023-08-12
- Export CGSpace to check for missing Initiative collection mappings
- I deployed the latest OpenRXV master branch with Angular v14 and backend updates on the server
- Start a harvest on AReS
## 2023-08-14
- I ported the DSpace 6.x REST API patch to allow specifying a bundle name when POSTing a bitstream to the legacy REST API in DSpace 7.6
## 2023-08-16
- I noticed that the DSpace statistics pages don't seem to work on communities or collections
- I finally took time to look in the DSpace log file and found this for one:
```console
2023-08-16 14:30:31,873 WARN dace8f96-f034-488e-b38c-9f2eb5d0e002 6cbd0b18-6852-4294-99a5-02dfcab0a469 org.dspace.app.rest.exception.DSpaceApiExceptionControllerAdvice @ Request is invalid or incorrect (status:400 exception: Invalid UUID string: -1 at: java.base/java.util.UUID.fromString1(UUID.java:280))
```
- I'm surprised to see this because those should have been dealt with when we upgraded to DSpace 6
- Looking in the Solr statistics core I see ~1,000,000 documents with the ID `-1`, and about 57,000,000 that don't
- Also interesting, faceting by `dateYear` I see:
- 2023: 209566
- 2022: 403871
- 2021: 336548
- 2020: 31659
- ... none before 2020
- They are all type 5, which is "Site" aka the home page, according to `dspace-api/src/main/java/org/dspace/core/Constants.java`
- Ah hah, and I can see in my DSpace 7 test Solr there are a bunch of hits with `type: 5` that have "-1" of course, but also newer ones that have an actual UUID
- I used the `/server/api/dso/find?uuid=3945ec23-2426-4fce-a2ea-48b38b91547f` endpoint to find out that there is a new `/server/api/core/sites` endpoint listing exactly one site (the home page) with this ID
- So for now I can replace all the "-1" documents with this ID on the test server at least, then I will have to remember to do that during the migration of the production instance
- I did a new export from DSpace 6 using solr-import-export-json with a query limiting it to documents of type 5 and negative 1 ID:
```console
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-fix-uuid.json -f 'id:\-1 AND type:5 AND time:[2020-01-01T00\:00\:00Z TO 2023-12-31T23\:59\:59Z]' -k uid -S actingGroupId,actingGroupParentId,actorMemberGroupId,author_mtdt,author_mtdt_search,bitstreamCount,bitstreamId,complete_query,complete_query_search,containerBitstream,containerCollection,containerCommunity,containerItem,core_update_run_nb,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,file_id,filterquery,first_name,geoipcountrycode,geoIpCountryCode,group_id,group_map,group_name,ip_ngram,ip_search,isArchived,isInternal,iso_mtdt,iso_mtdt_search,isWithdrawn,last_name,name,ngram_query_search,ngram_simplequery_search,orphaned,parent_count,p_communities_id,p_communities_map,p_communities_name,p_group_id,p_group_map,p_group_name,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,referrer_ngram,referrer_search,simple_query,simple_query_search,solr_update_time_stamp,storage_nb_of_bitstreams,storage_size,storage_statistics_type,subject_mtdt,subject_mtdt_search,text,userAgent_ngram,userAgent_search,version_id,workflowItemId
```
- Then I replaced the IDs with the UUID of the site homepage on DSpace 7 Test:
```console
$ sed -i 's/"id":"-1"/"id":"3945ec23-2426-4fce-a2ea-48b38b91547f"/' /tmp/statistics-fix-uuid.json
```
- I re-imported those records and I no longer see the "-1" IDs, but still get the same error in the log
- I don't understand, maybe there is some voodoo, so I rebooted the server
- Hmm, no, it's not a voodoo cache issue, so I really need to debug this:
```console
2023-08-16 15:44:07,122 WARN dace8f96-f034-488e-b38c-9f2eb5d0e002 036b88e6-7548-4852-9646-f345ce3bfcc2 org.dspace.app.rest.exception.DSpaceApiExceptionControllerAdvice @ Request is invalid or incorrect (status:400 exception: Invalid UUID string: -1 at: java.base/java.util.UUID.fromString1(UUID.java:280))
```
- On a related note, I figured out that the root site already has a UUID in DSpace 6, and it's exactly the one above (3945ec23-2426-4fce-a2ea-48b38b91547f)
- I noticed it while looking at the [DSpace 6 REST API's hierarchy page](https://cgspace.cgiar.org/rest/hierarchy)
- So I can update these "-1" IDs with "type:5" in our production I think...
## 2023-08-17
- I decided to update the "-1" IDs in Solr on DSpace 6
- Unfortunately, in Solr there is no way to update only documents matching a query, so we have to export and re-import
- I exported all documents with "type:5" (Homepage) and replaced the ID in the JSON:
```console
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-fix-uuid.json -f 'type:5' -k uid -S actingGroupId,actingGroupParentId,actorMemberGroupId,author_mtdt,author_mtdt_search,bitstreamCount,bitstreamId,complete_query,complete_query_search,containerBitstream,containerCollection,containerCommunity,containerItem,core_update_run_nb,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,file_id,filterquery,first_name,geoipcountrycode,geoIpCountryCode,group_id,group_map,group_name,ip_ngram,ip_search,isArchived,isInternal,iso_mtdt,iso_mtdt_search,isWithdrawn,last_name,name,ngram_query_search,ngram_simplequery_search,orphaned,parent_count,p_communities_id,p_communities_map,p_communities_name,p_group_id,p_group_map,p_group_name,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,referrer_ngram,referrer_search,simple_query,simple_query_search,solr_update_time_stamp,storage_nb_of_bitstreams,storage_size,storage_statistics_type,subject_mtdt,subject_mtdt_search,text,userAgent_ngram,userAgent_search,version_id,workflowItemId
$ sed -i 's/"id":"-1"/"id":"3945ec23-2426-4fce-a2ea-48b38b91547f"/' /tmp/statistics-fix-uuid.json
```
- (Oops, skipping the fields above was not necessary, since I'm importing back into DSpace 6 where those fields exist)
- Then I re-imported:
```
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-fix-uuid.json -k uid
```
- This worked, but I still see new records coming in that have "id:-1" so I will need to repeat this during the migration.
- I also notice many stats records that have erroneous cities:
- `"city":"com.maxmind.geoip2.record.City [ {} ]"`
- `"city":"com.maxmind.geoip2.record.City [ {\"geoname_id\":1002145,\"names\":{\"de\":\"George\",\"en\":\"George\",\"ru\":\"Джордж\",\"fr\":\"George\",\"ja\":\"ジョージ\"}} ]"`
## 2023-08-18
- Export CGSpace to check for missing Initiative collection mappings
## 2023-08-19
- Start a harvest on AReS
## 2023-08-21
- Experiment with the DSpace 7 REST API
- I wrote a Python script to benchmark harvesting all 100,000+ items using the `/api/discover/search/objects` endpoint 100 items at a time
- I was able to harvest the entire 106,000 items in fifty-two minutes, which seems slow, but that's about ten times faster than with the legacy REST API...
- Still, I need to benchmark a bit more, as the item response doesn't include collection mappings or thumbnails
- Reading the [API docs](https://github.com/DSpace/RestContract/blob/main/README.md#etags--conditional-headers) it seems that we should be able to use the standard `If-Modified-Since` header for some endpoints
- I tried it on the `/api/discover/search/objects` and `/api/core/items` endpoints, but apparently those don't support this header because I don't see a `Last-Modified` header in the response
- According to the docs, it means that these endpoints indeed don't support it...
## 2023-08-22
- I was experimenting with the DSpace 7 REST API again
- This time looking at the thumbnail responses in item endpoints
- According to [the documentation](https://github.com/DSpace/RestContract/blob/main/items.md#main-thumbnail) the API will respond with HTTP 200 if there is a thumbnail, and HTTP 204 if there is no content
- That means we need to make the request before we can even find out!
- Tim on DSpace Slack pointed out the DSpace 7 REST API's [projections](https://github.com/DSpace/RestContract/blob/main/projections.md)
- This means we can embed resources like thumbnail and owningCollection in the item (and other) requests, for example: https://dspace7test.ilri.org/server/api/discover/search/objects?dsoType=item&embed=thumbnail,owningCollection
## 2023-08-23
- I benchmarked the DSpace 7 REST API with the new embeds and it took four hours and seventeen minutes to get all 106,000 items on DSpace 7 Test
- So this is much slower than the results I saw earlier this week, but maybe slightly faster than DSpace 6?
- Maria from Alliance contacted me to say they have agreed to use UN M.49 regions more strictly in TIP, so they want to replace our non-standard "Latin America" region with "Latin America and the Caribbean", "Caribbean" and "Americas" on all Alliance outputs
- I exported their community on CGSpace and fixed the metadata in OpenRefine
- I tried to run `dspace cleanup -v` on CGSpace, but got this error:
```
Caused by: org.postgresql.util.PSQLException: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
Detail: Key (uuid)=(61bff7da-c8e3-420f-841c-ec5e8238d716) is still referenced from table "bundle".
```
- The solution, as always, is to delete those IDs manually in PostgreSQL:
```
$ psql -d dspace -c "UPDATE bundle SET primary_bitstream_id=NULL WHERE primary_bitstream_id IN ('61bff7da-c8e3-420f-841c-ec5e8238d716');"
UPDATE 1
```
- I also tried to delete all users who haven't logged in since 2017 using the groomer script, but it crashes due to those users still having items or workflows or whatever:
```console
$ dspace dsrun org.dspace.eperson.Groomer -a -b 08/23/2017 -d
```
- I see that it is now [possible in DSpace 7 to delete such users](https://github.com/DSpace/DSpace/pull/2229) so we will have to wait
## 2023-08-24
- I spent some time trying to get themes to extend in DSpace 7
- I finally got a basic ILRI theme working, but there is a bug that causes theme components to get duplicated
## 2023-08-25
- Meeting with Altmetric about the next phase of their integration with CGSpace
- A bit of cleanup on CGSpace metadata
- I fixed DOIs, licenses, dates, subjects, affiliations, titles, publishers, types, and titles in 1,240 items
## 2023-08-26
- A few weeks ago we received a request from the Fruits and Vegetables Initiative saying that they've gotten approval to begin using the long name instead of the short one everywhere, apparently for SEO reasons
- After communicating with PRMS and other teams working on systems using this metadata I finally updated them in CGSpace
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
- I fixed ~200 titles with new lines, excessive whitespace, and Unicode FFFD characters
- There are many more with 00A0, 200B, etc, but those need more careful inspection
## 2023-08-28
- Day one of CGSpace partners meeting in Addis
- Oh this is a game changer, I just realized that we can use Solr query syntax in the DSpace 7 REST API, so we can do this for example:
```
https://dspace7test.ilri.org/server/api/discover/search/objects?query=lastModified%3A%5B2023-08-01T00%3A00%3A00Z%20TO%20%2A%5D
```
- Which is this query: `lastModified:[2023-08-01T00:00:00Z TO *]`
- The queries need to be URL encoded of course
- Oh nice, and we can do the same for accession date:
```
https://dspace7test.ilri.org/server/api/discover/search/objects?query=dc.date.accessioned_dt%3A%5B2023-08-01T00%3A00%3A00Z%20TO%20%2A%5D
```
- That is this query: `dc.date.accessioned_dt:[2023-08-01T00:00:00Z TO *]`
- We need to use the dt version of the accession date because that is the one that has a date type
- This query give 290 results, which should be the items submitted in August!
## 2023-08-29
- Day two of CGSpace partners meeting in Addis
## 2023-08-30
- Day three of CGSpace partners meeting in Addis
- I did a lot of work on the CGSpace Angular theme for DSpace 7
- Many changes to Discovery filters and search results
## 2023-08-31
- Day four of CGSpace partners meeting in Addis
- I removed the old Bioversity and CIAT subjects from Discovery facets on CGSpace
- Maria and Leroy said they are no longer using them so we don't need to keep indexing and displaying them
- I did a lot of work on the CGSpace Angular theme for DSpace 7
- Now we have clickable keywords that go to Discovery instead of browse, as well as some new icons
- We don't need to use the clunky browse links to get clickable links any more so I will disable those
<!-- vim: set sw=2 ts=2: -->

243
content/posts/2023-09.md Normal file
View File

@@ -0,0 +1,243 @@
---
title: "September, 2023"
date: 2023-09-02T17:29:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-09-02
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
<!--more-->
## 2023-09-03
- I figured out how to use Altmetric and Dimensions badges in the DSpace Angular frontend
- It still feels hacky, but using [AfterViewInit](https://stackoverflow.com/questions/41936631/how-to-trigger-the-function-after-dom-markup-is-loaded-in-angular-style-applicat), and importing the Altmetric `embed.js` in the component works
- The style on mobile also needs work...
## 2023-09-06
- Discussion with Marie about finalizing the output types list on GitHub
- I did some review and cleanup in preparation for publishing the new list
## 2023-09-07
- Export CGSpace to start doing a review of the metadata
- First I will start by extracting all items with DOIs, along with some fields I can compare against Crossref:
```console
$ csvgrep -c 'cg.identifier.doi[en_US]' -r 'doi.org' ~/Downloads/2023-09-07-cgspace.csv \
| csvcut -c 'id,dc.title[en_US],dcterms.issued[en_US],dcterms.available[en_US],cg.issn[en_US],cg.isbn[en_US],cg.volume[en_US],cg.issue[en_US],cg.number[en_US],dcterms.extent[en_US],cg.identifier.doi[en_US],cg.reviewStatus[en_US],cg.isijournal[en_US],dcterms.license[en_US],dcterms.accessRights[en_US],dcterms.type[en_US],dc.identifier.uri[en_US]' \
> /tmp/2023-09-07-cgspace-dois.csv
$ csvgrep -c 'cg.identifier.doi[en_US]' -r 'doi.org' ~/Downloads/2023-09-07-cgspace.csv | csvcut -c 'cg.identifier.doi[en_US]' | sed 1d > /tmp/2023-09-07-cgspace-dois.txt
```
- Then I resolved the DOIs from Crossref:
```console
$ ./ilri/crossref_doi_lookup.py -i /tmp/2023-09-07-cgspace-dois.txt -o /tmp/2023-09-07-cgspace-dois-results.csv -e a.orth@cgiar.org
```
- A user emailed to ask about uploading a 180MB PDF to CGSpace
- I used GhostScript to try reducing it using the `screen`, `ebook` and `prepress` presets:
```console
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=primer-screen.pdf Primer\ \(digital\)_Climate-\ smart\ and\ regenerative\ agriculture\ in\ climate\ change\ adaptation.pdf
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=primer-ebook.pdf Primer\ \(digital\)_Climate-\ smart\ and\ regenerative\ agriculture\ in\ climate\ change\ adaptation.pdf
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/prepress -dNOPAUSE -dQUIET -dBATCH -sOutputFile=primer-prepress.pdf Primer\ \(digital\)_Climate-\ smart\ and\ regenerative\ agriculture\ in\ climate\ change\ adaptation.pdf
```
- The `prepress` one is 300DPI and looks visually identical to the original, so I proposed that we use that one
## 2023-09-08
- I did a review of the metadata for our items with DOIs, comparing with data from Crossref
- I spot checked a handful of issue / online dates and licenses, and saw that Crossref's dates are always more accurate than ours when they differ
- I also filled in some missing volumes, issues, ISSNs, and extents
- This results in 14,000 changes to existing items, which will take several days to import unfortunately
- After eight hours the first file is only about 2/3 finished... sigh
- Meet with Peter to discuss changes to the DSpace 7 test
- Minor updates to submission forms and some new ideas for the home page and item page
- I figured out how to use a themed home page component and add a cards UI to our CGSpace theme
## 2023-09-09
- I can't believe that almost 18 hours later the first CSV import with 5,000 changes is not done...
- Run all system updates on CGSpace and reboot it, as it had been two months since the last time
## 2023-09-10
- Minor work on the DSpace 7 home page
## 2023-09-11
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-09-12
- Minor work on DSpace 7 home page
- Minor work on CG Core types
- I published a new HTML version of the updated IPtypes and archived the current version as v2.0.0 so we can still reference it
## 2023-09-13
- Stefano reminded me about the updated OAI MODS mappings on CGSpace so I re-applied them on DSpace Test and updated the OAI index so he could confirm
- Now I'm ready to put it on CGSpace if he confirms
- I created a basic theme for CIP on DSpace 7
- While doing that I noticed that a bunch of CIP bitstreams didn't have the latest 500px thumbnails so I re-ran filter-media on a handful of their collections
- I had two occurrences of an OOM kill of the Tomcat 9 java process on DSpace 7 test tonight
- Once while doing a Discovery index, the other while doing filter media
## 2023-09-15
- Discuss issues with the Altmetric API with the Altmetric support team
- Apparently we can use a different API, the [Explorer API](https://www.altmetric.com/explorer/documentation/api), since we already have access to the Explorer dashboard
- I reduced the Solr heap size on DSpace 7 from 3GB to 2GB
- Apparentlty I already did this from 4GB to 3GB a few months ago
- The Solr admin interface was showing Solr taking ~1GB of RAM so I think this should be safe
- Mark on DSpace Slack said he uses PM2's `--max-memory-restart` so the processes restart when they hit the limit
- Also, he said he had to reduce `cache:serverSide:botCache:max` from 1000 to 500 to cache less SSR pages in memory
- I decided to try deploying DSpace 7 Test on a Hetzner server with 64GB RAM, 6 CPUs, and 2x512GB NVMe SSD
## 2023-09-16
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
- Configure the privacy policy page on DSpace 7 using a themed component with the text from our DSpace 6 site
- I realized that for all my custom Angular components I should be using `routerLink` instead of `href` when I am constructing links
- The `routerLink` routes within the single page application and saves state, while the `href` reloads the page
- Using the `routerLink` way is faster and results in less flashing and jumping in the page when navigating
- See: https://stackoverflow.com/a/61588147
## 2023-09-17
- I added an About page to DSpace 7 Test using similar logic to the privacy page
## 2023-09-18
- I filed a GitHub issue for being unable to navigate dropdown lists using the keyboard on the dspace-angular submission form: https://github.com/DSpace/dspace-angular/issues/2500
- I filed a GitHub issue for the search filters capitalizing metadata values: https://github.com/DSpace/dspace-angular/issues/2501
## 2023-09-19
- Complete migration of DSpace 7 Test from Linode to Hetzner
- Export some years of Solr stats from CGSpace to import on the new DSpace 7 Test:
```console
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2020-2022.json -f 'time:[2020-01-01T00\:00\:00Z TO 2022-12-31T23\:59\:59Z]' -k uid -S actingGroupId,actingGroupParentId,actorMemberGroupId,author_mtdt,author_mtdt_search,bitstreamCount,bitstreamId,complete_query,complete_query_search,containerBitstream,containerCollection,containerCommunity,containerItem,core_update_run_nb,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,file_id,filterquery,first_name,geoipcountrycode,geoIpCountryCode,group_id,group_map,group_name,ip_ngram,ip_search,isArchived,isInternal,iso_mtdt,iso_mtdt_search,isWithdrawn,last_name,name,ngram_query_search,ngram_simplequery_search,orphaned,parent_count,p_communities_id,p_communities_map,p_communities_name,p_group_id,p_group_map,p_group_name,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,referrer_ngram,referrer_search,simple_query,simple_query_search,solr_update_time_stamp,storage_nb_of_bitstreams,storage_size,storage_statistics_type,subject_mtdt,subject_mtdt_search,text,userAgent_ngram,userAgent_search,version_id,workflowItemId
```
- Ben sent me an export of ILRI presentations from Slideshare and asked if we could see if any are missing on CGSpace
- First I exported CGSpace and extracted the `cg.identifier.url` column so I could normalize all Slideshare URLs to use "https://www.slideshare.net" instead of localized variants (es.slideshare.net, fr.slideshare.net, etc) as well as non-https links and links with query params and slashes at the end
- This was about 250 URLs
- I extracted the URL field from both our list and the Slideshare list and then used [GNU `join` to print non-matched lines](https://unix.stackexchange.com/questions/274548/join-two-files-each-with-two-columns-including-non-matching-lines):
```console
$ join -t, -v 2 -11 -21 -o auto /tmp/cgspace-ilri-slideshare-sorted-only-urls-sorted.csv /tmp/ilri-slideshare-sorted-sorted.csv | wc -l
542
```
- Important to note that you must use GNU `sort` on the fiels first, as I had tried sorting in vim and it didn't satisfy `join`
- So it seems there are 542 Slideshare presentations we are missing
## 2023-09-20
- Regarding the incorrect city in Solr statistics, I see we have 1,600,000 of them
- Before filing a GitHub issue, I want to check if they maybe come from an Atmire module, as I see them clustered around two particular CUA versions:
```json
{
"responseHeader": {
"status": 0,
"QTime": 2760,
"params": {
"q": "city:com.maxmind.geoip2.record.City*",
"facet.field": "cua_version",
"indent": "true",
"rows": "0",
"wt": "json",
"facet": "true",
"_": "1695192301927"
}
},
"response": {
"numFound": 1661863,
"start": 0,
"docs": []
},
"facet_counts": {
"facet_queries": {},
"facet_fields": {
"cua_version": [
"6.x-4.1.10-ilri-RC7",
1112186,
"6.x-4.1.10-ilri-RC5",
451180,
"6.x-4.1.10-ilri-RC9",
0
]
},
"facet_dates": {},
"facet_ranges": {},
"facet_intervals": {}
}
}
```
- I migrated AReS from Linode to Hetzner
- I asked on Slack and someone told me that we need to edit `src/app/menu.resolver.ts` to add new drop down menus to the top navbar
- It works, though is unfortunate that we can't do it in a theme
## 2023-09-21
- More minor work on DSpace 7 home page and menus
- Meeting to discuss types and DSpace 7 migration plans
- Create a DSpace 7 theme for IITA
## 2023-09-22
- Create a DSpace 7 theme for IWMI
- I had some issues with pm2 on the new DSpace 7 Test
- It seems to be due to mixing systemd starting versus manually starting / stopping...
- After reading the discussion in [this pm2 issue](https://github.com/Unitech/pm2/issues/2914) I realize that we probably need to use `--no-daemon` to have systemd fully manage the processes without pm2 trying to save state
## 2023-09-23
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-09-25
- CGSpace metadata and community / collection cleanup
- Review some patches on DSpace Angular
- Create a basic Alliance theme for DSpace 7
## 2023-09-27
- I realized that we can get controlled vocabularies from DSpace 7's REST API, for both value-pairs and hierarchical controlled vocabularies, ie:
https://dspace7test.ilri.org/server/api/submission/vocabularies/common_iso_languages/entries
## 2023-09-29
- Meeting with Aditi and others to discuss plan for using CGSpace to do a systematic review of CGIAR research on climate change
- I cleaned up metadata for a hundred or so items, and realized we will need to do more to make sure abstracts and open access status are correct since there will be a laser focus on the metadata
## 2023-09-30
- Export CGSpace to check for missing Initiative collection mappings
- Still working on checking Unpaywall for access rights and licenses for our DOIs
- Regarding Unpaywall's "evidence" metadata about whether an item is open access or not, after looking at dozens of items manually:
- evidence: "oa journal (via doaj)" <---- yes
- evidence: "open (via free article)" <---- hmmm, not always correct
- evidence: "open (via page says license)" <--- noooo, can't rely on that
- evidence: "open (via page says Open Access)" <---- yes...?
- evidence: "open (via free pdf)" <---- hmmm, not always correct
- evidence: "oa journal (via publisher name)" <---- noooo
- I updated access status for about four hundred more items based on this, and licenses for a dozen or so
<!-- vim: set sw=2 ts=2: -->

150
content/posts/2023-10.md Normal file
View File

@@ -0,0 +1,150 @@
---
title: "October, 2023"
date: 2023-10-02T09:05:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-10-02
- Export CGSpace to check DOIs against Crossref
- I found that [Crossref's metadata is in the public domain under the CC0 license](https://www.crossref.org/documentation/retrieve-metadata/rest-api/rest-api-metadata-license-information/)
- One interesting thing is the abstracts, which are copyrighted by the copyright owner, meaning Crossref cannot waive the copyright under the terms of the CC0 license, because it is not theirs to waive
- We can be on the safe side by using only abstracts for items that are licensed under Creative Commons
<!--more-->
- This GREL extracts the _text_ content of the `<jats:p>` tags (ie, no other JATS XML markup tags like `<jats:i>`, `<jats:sub>`, etc):
```console
forEach(value.parseXml().select("jats|p"),i,i.xmlText()).join("")
```
- Note that we need to use `select("jats|p")` instead of `select("jats:p")` for OpenRefine's parseXml, and we need to `join()` on the end
- I updated metadata for about 3,000 items using Crossref metadata
- I stripped trailing periods for titles where they were missing on the Crossref titles
- I copied abstracts for about 600 items that were missing them, for items that were Creative Commons
- I updated publishers for a few thousand more where ours and Crossref disagreed, checking a handful manually first
- I also added subjects to the `crossref_doi_lookup.py` script to see if they will be useful for us
- When checking with csv-metadata-quality I can validate those subjects against AGROVOC and add them if they are valid
## 2023-10-03
- I added the item type to the collection subscription email on DSpace 6
- It's done differently on DSpace 7 so I'll have to see how to do it there...
- Test a patch that fixes a bug with item versioning disabled in DSpace 7
- I hadn't realized that DSpace 7 defaulted to versioning being enabled, whereas we never used this in DSpace 6 (yet)
- Submit [an issue regarding duplicate Discovery sort fields](https://github.com/DSpace/DSpace/issues/9104) in DSpace 7
## 2023-10-05
- Some discussion this week about issue and online dates for journal articles, with regards to PRMS
- I looked more closely at the [Crossref API docs](https://github.com/CrossRef/rest-api-doc/blob/master/api_format.md) and realized (again) that their "issue" date is not the same as our issue date—they take the earlier of the print and online dates!
- Also, *very many* items have no print date at all, perhaps due to delays, errors, or simply because the journal is "online only"!
- I suggested again that PRMS should consider both, and take the earlier of the two, then make sure whether the date is in the current reporting period
- I managed to find 80 items with print publishing dates from 2023 and updated those from Crossref, but for the rest we will have to think about how we handle them
## 2023-10-06
- More discussion about dates after looking closely at them yesterday and today
- Crossref doesn't always have both issued and online dates—sometimes they have one, sometimes the other, and sometimes both, so we cannot rely on them 100% for that.
- In some cases, the item is available online for months (or even a year!), but has not been included in an issue yet, and thus has no "issue" date, for example:
- https://doi.org/10.1002/csc2.20914 <--- published online January 2023!
- https://doi.org/10.1111/mcn.13401 <--- published online July 2022!
- Even journals make mistakes: this journal article was "issued" in 2022, but online in 2023! This is not Crossref's fault, but the journal's!
- https://doi.org/10.1186/s40066-022-00400-6
- I found a bunch more strange cases regarding dates and recommended to PRMS team that they use the earlier of the issued and online dates
- Meet with Aditi to start discussing the scope of knowledge products we can get for the CGIAR climate change synthesis
## 2023-10-07
- I spent a few hours (!) debugging an issue in Python when downloading PDFs
- I think it ended up being due to `requests_cache`!!! Grrrr
- On a positive note I've greatly refactored my script for discovering and downloading PDFs from Unpaywall
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-10-08
- Starting to see some stuck locks on CGSpace this morning
- I will give notice and restart CGSpace
- Work on Python script to harvest DSpace REST API and save to CSV
## 2023-10-11
- File an issue on the DSpace issue tracker regarding the MaxMind JSON objects in our Solr statistics: https://github.com/DSpace/DSpace/issues/9118
## 2023-10-12
- Discuss MODS issues in CGSpace's OAI-PMH with Stefano and Valentina
- AGRIS can currently only support MODS 3.7 so they need us to roll our 3.8 work from 2023-06 back down, which requires some minor changes to the crosswalk
## 2023-10-13
- I did some more minor work to get the MODS 3.7 changes ready for AGRIS on DSpace Test
## 2023-10-14
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
- I deployed the AGRIS changes for OAI-PMH on CGSpace
## 2023-10-16
- Fix some typos in ILRI subjects on CGSpace
- These were affecting the taxonomy on ilri.org
- I exported CGSpace and did some validation and cleanup on ILRI subjects, moving some to AGROVOC subjects
- Port the MODS 3.7 crosswalk from DSpace 6 to DSpace 7
- It works fine, we only need to take note that the OAI-PMH endpoint is now relative to the `/server` path instead of a dedicated OAI path
## 2023-10-17
- Export CGSpace to do some cleanups all over on invalid metadata values
- I found many metadata values in the wrong field, wrong format, etc
- This ended up being cleanups for 694 items
## 2023-10-20
- Export CGSpace to check for missing Initiative collection mappings
- I also did a run of looking up all Initiative outputs with DOIs against Crossref to check for missing dates, publishers, etc
- I found issued dates for a few, and online dates for over 100
- I also fixed some incorrect licenses, access status, and abstracts
## 2023-10-23
- Export a list of Internal Documents for Peter to review to see if we can re-classify some
- Peter sent changes for 740 items so I applied them on CGSpace
- Testing the changes for OpenRXV DSpace 7 compatibility
## 2023-10-24
- Sync DSpace 7 Test with a fresh CGSpace snapshot
- Meeting with FARA to discuss DSpace training and support
- Meeting with IFPRI about migrating to CGSpace
## 2023-10-25
- Maria was asking about an error deleting an item in the Alliance community
- The error was "Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:..."
- According to my notes this error happened a few times in the past and is some kind of corner case regarding permissions
- I deleted the item for her
- I deleted a handful of old CRP groups on CGSpace
## 2023-10-27
- Peter sent me a list of journal articles from Altmetric that have an ILRI affiliation, but no Handle
- I used my `crossref_doi_lookup.py` script to fetch the metadata for them using their DOIs, then did a bunch of cleanup in OpenRefine
- Test some LDAP patches for DSpace 7
## 2023-10-30
- Some work on metadata for Aditi's review
- I found more preprints grrrr
## 2023-10-31
- Peter got back to me with the cleanups on ILRI journal articles from Altmetric that we didn't have on CGSpace
- I did another duplicate check and found four more duplicates that had been uploaded yesterday
- Then I did a quick sanity check and uploaded the remaining 19 items to CGSpace
<!-- vim: set sw=2 ts=2: -->

215
content/posts/2023-11.md Normal file
View File

@@ -0,0 +1,215 @@
---
title: "November, 2023"
date: 2023-11-02T12:59:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-11-01
- Work a bit on the ETL pipeline for the CGIAR Climate Change Synthesis
- I improved the filtering and wrote some Python using pandas to merge my sources more reliably
## 2023-11-02
- Export CGSpace to check missing Initiative collection mappings
- Start a harvest on AReS
<!--more-->
- IFPRI contacted us about importing their Slideshare presentations to CGSpace
- There are ~1,700 of them and date back to as early as 2008
- I did a quick cleanup of the metadata export from Slideshare (including tagging with some AGROVOC in OpenRefine) and uploaded to DSpace Test
## 2023-11-03
- A little bit of work on the CGIAR Climate Change Synthesis
- Discuss some CGSpace migration plans with Leigh from IFPRI
- For their Slideshare content we agreed:
- Exclude private
- Exclude deleted
- Exclude non presentation types
- Exclude duplicates within the collection for now until we can sort them out
- That leaves about 1,500 items out of the 1,700
- I did a duplicate check against CGSpace and found 44 items with 1.0 similarity so I removed those
## 2023-11-04
- Export CGSpace to check for missing Initiative collection mappings
- I ran through the list of potential duplicates on the IFPRI Slideshare presentations
## 2023-11-05
- Work with Salem to migrate AReS to the new version
## 2023-11-07
- DSpace 7 Test went down and there is very high load on the server
- I saw very high load from Java but didn't have time to check exactly what was wrong so I just rebooted the host
- A few hours after restarting the system went down again, with very high load from Java again
- I see lots of messages like this in the Tomcat log:
```
tomcat9[732]: [9955.662s][info ][gc] GC(6291) Pause Full (G1 Compaction Pause) 4085M->4080M(4096M) 677.251ms
tomcat9[732]: [9955.662s][info ][gc] GC(6290) Concurrent Mark Cycle 677.558ms
tomcat9[732]: [9955.666s][info ][gc] GC(6292) To-space exhausted
```
- I see some messages in `dspace.log` about heap space:
```
Caused by: java.lang.OutOfMemoryError: Java heap space
```
- I will increase Tomcat's heap from 4096m to 5120m
- A few hours later it happened again, so I increased the heap from 5120m to 6144m
- Not sure what's going on today...
- I tested moving the CGIAR Fund Council community to the CGIAR historic archive on DSpace Test:
```console
$ dspace community-filiator -r -p 10568/83389 -c 10947/2516
$ dspace community-filiator -s -p 10947/2515 -c 10947/2516
$ dspace index-discovery -r 10947/2516
$ dspace index-discovery -r 10947/2515
$ dspace index-discovery -r 10568/83389
$ dspace index-discovery
```
- I think this is the minimal we can do to avoid a full Discovery reindex which is very expensive
- I helped Maria resize some massive PDFs for upload to CGSpace using GhostScript prepress mode as I had done before in [September, 2023]({{< relref "2023-09.md" >}}),
## 2023-11-08
- DSpace 7 Test has very high load again and I see more Java heap space errors in the log
```console
# grep -c 'Caused by: java.lang.OutOfMemoryError: Java heap space' /home/dspace7/log/dspace.log-2023-11-07
35
# grep -c 'Caused by: java.lang.OutOfMemoryError: Java heap space' /home/dspace7/log/dspace.log
7
```
- I don't know what is happening... I will increase the heap size from 6144m to 7168m again...
- I did some work on the value mappings in AReS
- I wanted to test the import/export feature, and found that I could get a JSON and convert it to CSV for manipulation in OpenRefine
- Importing duplicates records, so I deleted and re-created the index in Elasticsearch first
- Then I started a new harvest on AReS to make sure the mappings are applied
## 2023-11-09
- Ryan asked me for help uploading a large PDF to CGSpace
- I tried my usual GhostScript preprint invocation and found the size decrease significantly, but some minor artifacts appeared in the images
- Interestingly, the [GhostScript docs](https://ghostscript.com/docs/9.54.0/VectorDevices.htm) mention that `prepress` doesn't give the best results:
> Please be aware that the /prepress setting does not indicate the highest quality conversion. Using any of these presets will involve altering the input, and as such may result in a PDF of poorer quality (compared to the input) than simply using the defaults. The 'best' quality (where best means closest to the original input) is obtained by not setting this parameter at all (or by using /default).
- Also, I found [a question on StackOverflow discussing some further techniques for PDFs with images](https://stackoverflow.com/questions/40849325/ghostscript-pdfwrite-specify-jpeg-quality):
```console
$ gs -sOutputFile=137166-default-dct.pdf -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH -dPDFSETTINGS=/default -c "<< /ColorACSImageDict << /VSamples [ 1 1 1 1 ] /HSamples [ 1 1 1 1 ] /QFactor 0.08 /Blend 1 >> /ColorImageDownsampleType /Bicubic /ColorConversionStrategy /LeaveColorUnchanged >> setdistillerparams" -f 137166.pdf
```
- This looks much better, and is still much smaller than the original
- Also, I used `pdfimages` to extract all the images from the original and the one above and found:
```console
$ du -sh images-*
886M images-default-dct
1012M images-original
```
- And from [WeCompress's analysis](https://www.wecompress.com/en/analyze) I see that the images are 85% of the size of the PDF
## 2023-11-10
- I finished checking the IFPRI Slideshare records and added some tagging of countries, regions, and CRPs and then uploaded them to CGSpace
## 2023-11-11
- Salem fixed a bug on OpenRXV that was splitting country values by "," before matching them with ISO countries
- I exported CGSpace to check for missing Initiative collection mappings
- Start a fresh harvest on AReS
## 2023-11-16
- Discuss mapping ICARDA outputs from Initiatives to ICARDA collections on CGSpace
- I added MEL's CGSpace user to the administrator group of a handful of collections
- I also did a batch mapping of 274 existing Initiative outputs from ICARDA to the relevant collections
## 2023-11-18
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-11-22
- I was checking out the [DSpace 7 statistics](https://github.com/DSpace/RestContract/blob/main/statistics-reports.md) again and found that we have total visits and total downloads for each DSpace object, for example [this item](https://dspace7test.ilri.org/items/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748):
- TotalVisits: https://dspace7test.ilri.org/server/api/statistics/usagereports/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748_TotalVisits
- TotalDownloads: https://dspace7test.ilri.org/server/api/statistics/usagereports/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748_TotalDownloads
- And the numbers match those in my dspace-statisitcs-api *exactly*!
- This can be useful to get an individual DSpace object's stats, but there is no way to iterate over all objects like all items...
- We can look at using this to draw stats on the community, collection, and item pages
## 2023-11-23
- Brian King was asking me how many PDFs we had in CGSpace so I got a rough estimate using this SQL query:
```console
localhost/dspace7= ☘ SELECT COUNT(uuid) FROM bitstream WHERE bitstream_format_id=(SELECT bitstream_format_id FROM bitstreamformatregistry WHERE mimetype='application/pdf');
count
───────
47818
(1 row)
```
- It's been some time since I looked at our Solr statistics to find new bots
- I found a few new ones that I [submitted to COUNTER-Robots](https://github.com/atmire/COUNTER-Robots/pull/60) and added to our local bot list:
- GuzzleHttp/7
- Owler@ows.eu/1
- newspaperjs
- I ran my old `check-spider-hits.sh` script with a list of bots from our local overrides to purge hits from Solr:
```console
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 30 hits from ubermetrics in statistics
Purging 59 hits from curb in statistics
Purging 36 hits from bitdiscovery in statistics
Purging 87 hits from omgili in statistics
Purging 47 hits from Vizzit in statistics
Purging 109 hits from Java\/17-ea in statistics
Purging 40 hits from AdobeUxTechC4-Async in statistics
Purging 21 hits from ZaloPC-win32-24v473 in statistics
Purging 21 hits from nbertaupete95 in statistics
Purging 52 hits from Scoop\.it in statistics
Purging 16 hits from WebAPIClient in statistics
Purging 241 hits from RStudio in statistics
Purging 1255 hits from ^MEL in statistics
Purging 47850 hits from GuzzleHttp in statistics
Purging 8714 hits from Owler in statistics
Purging 1083 hits from newspaperjs in statistics
Purging 369 hits from ^Chrome$ in statistics
Purging 1474 hits from curl in statistics
Total number of bot hits purged: 61504
```
- I also noticed 35,000 requests over the past few years from lowercase user agents, which is [definitely weird](https://developers.whatismybrowser.com/api/features/user-agent-checks/weird/#all_lower_case), for example:
- `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/89.0.4389.90 safari/537.36`
- `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/90.0.4430.93 safari/537.36`
- I'm gonna add those to our overrides and purge them:
```console
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 35816 hits from ^mozilla in statistics
Total number of bot hits purged: 35816
```
## 2023-11-30
- Minor updates to our OAI MODS crosswalk
- Stefano found a minor markup issue with our alternative titles (`<titleInfo>` tag)
- Very high load on CGSpace since after lunch
- I killed some locks that had been stuck for a few hours
<!-- vim: set sw=2 ts=2: -->

271
content/posts/2023-12.md Normal file
View File

@@ -0,0 +1,271 @@
---
title: "December, 2023"
date: 2023-12-01T08:48:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-12-01
- There is still high load on CGSpace and I don't know why
- I don't see a high number of sessions compared to previous days in the last few weeks
<!-- more -->
```console
$ for file in dspace.log.2023-11-[23]*; do echo "$file"; grep -a -oE 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
dspace.log.2023-11-20
22865
dspace.log.2023-11-21
20296
dspace.log.2023-11-22
19688
dspace.log.2023-11-23
17906
dspace.log.2023-11-24
18453
dspace.log.2023-11-25
17513
dspace.log.2023-11-26
19037
dspace.log.2023-11-27
21103
dspace.log.2023-11-28
23023
dspace.log.2023-11-29
23545
dspace.log.2023-11-30
21298
```
- Even the number of unique IPs is not very high compared to the last week or so:
```console
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort | uniq | wc -l
17023
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.2.gz | sort | uniq | wc -l
17294
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.3.gz | sort | uniq | wc -l
22057
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.4.gz | sort | uniq | wc -l
32956
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.5.gz | sort | uniq | wc -l
11415
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.6.gz | sort | uniq | wc -l
15444
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.7.gz | sort | uniq | wc -l
12648
```
- It doesn't make any sense so I think I'm going to restart the server...
- After restarting the server the load went down to normal levels... who knows...
- I started trying to see how I'm going to generate the fake statistics for the Alliance bitstream that was replaced
- I exported all the statistics for the owningItem now:
```console
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/stats-export.json -f 'owningItem:b5862bfa-9799-4167-b1cf-76f0f4ea1e18' -k uid
```
- Importing them into DSpace Test didn't show the statistics in the Atmire module, but I see them in Solr...
## 2023-12-02
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-12-04
- Send a message to Altmetric support because the item IWMI highlighted last month still doesn't show the attention score for the Handle after I tweeted it several times weeks ago
- Spent some time writing a Python script to fix the literal MaxMind City JSON objects in our Solr statistics
- There are about 1.6 million of these, so I exported them using solr-import-export-json with the query `city:com*` but ended up finding many that have missing bundles, container bitstreams, etc:
```
city:com* AND -bundleName:[* TO *] AND -containerBitstream:[* TO *] AND -file_id:[* TO *] AND -owningItem:[* TO *] AND -version_id:[* TO *]
```
- (Note the negation to find fields that are missing)
- I don't know what I want to do with these yet
## 2023-12-05
- I finished the `fix_maxmind_stats.py` script and fixed 1.6 million records and imported them on CGSpace after testing on DSpace 7 Test
- Altmetric said there was a glitch regarding the Handle and DOI linking and they successfully re-scraped the item page and linked them
- They sent me a list of current production IPs and I notice that some of them are in our nginx bot network list:
```console
$ for network in $(csvcut -c network /tmp/ips.csv | sed 1d | sort -u); do grepcidr $network ~/src/git/rmg-ansible-public/roles/dspace/files/nginx/bot-networks.conf; done
108.128.0.0/13 'bot';
46.137.0.0/16 'bot';
52.208.0.0/13 'bot';
52.48.0.0/13 'bot';
54.194.0.0/15 'bot';
54.216.0.0/14 'bot';
54.220.0.0/15 'bot';
54.228.0.0/15 'bot';
63.32.242.35/32 'bot';
63.32.0.0/14 'bot';
99.80.0.0/15 'bot'
```
- I will remove those for now so that Altmetric doesn't have any unexpected issues harvesting
## 2023-12-08
- Finalized the script to generate Solr statistics for Alliance research Mirjam
- The script is `ilri/generate_solr_statistics.py`
- I generated ~3,200 statistics based on her records of the download statistics of [that item](https://hdl.handle.net/10568/131997) and imported them on CGSpace
- Did some work on the DSpace 7 submission form
- Peter asked for lists of affiliations, investors, and publishers to do some cleanups
- I generated a list from a CSV export instead of doing it based on a SQL dump...
```console
$ csvcut -c 'cg.contributor.affiliation[en_US]' /tmp/initiatives.csv \
| sed -e 1d -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' -e '/^$/d' \
| sort | uniq -c | sort -hr \
| awk 'BEGIN { FS = "^[[:space:]]+[[:digit:]]+[[:space:]]+" } {print $2}'\
| sed -e '1i cg.contributor.affiliation' -e 's/^\(.*\)$/"\1"/' \
> /tmp/2023-12-08-initiatives-affiliations.csv
```
- Export a list of authors as well:
```console
localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "dc.contributor.author", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 3 GROUP BY "dc.contributor.author" ORDER BY count DESC) to /tmp/2023-12-08-authors.csv WITH CSV HEADER;
COPY 102435
```
## 2023-12-11
- Work on OpenRXV dependencies and podman a bit
- Peter noticed that the statistics for this month are very very low on CGSpace
- I don't know what is going on, perhaps it is related to me adjusting the nginx config last week?
- Ah, it's probably because of the spider patterns I updated on 2023-11
## 2023-12-16
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-12-17
- Pull latest master branch for OpenRXV and deploy on the server
- I threw away some changes in the tree regarding the Angular base ref, and it broke AReS
- So note to self: we need to set the base ref in `frontend/Dockerfile` before building!
- Now Salem fixed the country map
## 2023-12-18
- Work a bit on the IFPRI-ISNAR archive from Leigh
- More work on the DSpace 7 home page
## 2023-12-19
- More work on the DSpace 7 home page
- The Alliance TIP team is testing deposits to the DSpace 7 REST API and getting an HTTP 500 error
- In the DSpace logs I see this after they log in, create the item, and update the metadata:
```
2023-12-19 17:49:28,022 ERROR unknown unknown org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
```
- I found some messages on the dspace-tech mailing list suggesting this might be an old bug: https://groups.google.com/g/dspace-tech/c/My1GUFYFGoU/m/tS7-WAJPAwAJ
- I restarted Tomcat and told the Alliance TIP team to try again
## 2023-12-20
- The Alliance guys said that submitting via REST works now... sigh, so that's just some old DSpace 5/6 REST API bug
- I lowercased all our AGROVOC keywords in `dcterms.subject` in SQL:
```console
dspace=# BEGIN;
BEGIN
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 462
dspace=*# COMMIT;
COMMIT
```
## 2023-12-25
- Looking into [Solr backups](https://solr.apache.org/guide/8_11/making-and-restoring-backups.html)
- Since we are not running in Solr Cloud mode we need to use the replication endpoint for Solr standalone
- This works:
```console
$ curl 'http://localhost:8983/solr/statistics/replication?command=backup'
{
"responseHeader":{
"status":0,
"QTime":26},
"status":"OK"}
```
- Then I saw the size of the snapshot reach the size of the index...
```console
# du -sh /var/solr/data/configsets/statistics/data/*
22G /var/solr/data/configsets/statistics/data/index
16G /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
# du -sh /var/solr/data/configsets/statistics/data/*
22G /var/solr/data/configsets/statistics/data/index
20G /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
# du -sh /var/solr/data/configsets/statistics/data/*
22G /var/solr/data/configsets/statistics/data/index
21G /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
# du -sh /var/solr/data/configsets/statistics/data/*
22G /var/solr/data/configsets/statistics/data/index
22G /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
```
- Then I deleted the core and restored from the snapshot backup:
```console
$ curl http://localhost:8983/solr/statistics/update -H "Content-type: text/xml" --data-binary '<delete><query>*:*</query></delete>'
$ curl http://localhost:8983/solr/statistics/update -H "Content-type: text/xml" --data-binary '<commit />'
$ curl 'http://localhost:8983/solr/statistics/replication?command=restore&name=statistics'
```
- Interestingly the import worked fine, but created a new data index:
```console
# du -sh /var/solr/data/configsets/statistics/data/*
4.0K /var/solr/data/configsets/statistics/data/index.properties
22G /var/solr/data/configsets/statistics/data/restore.20231225154626463
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
22G /var/solr/data/configsets/statistics/data/snapshot.statistics
```
- Not sure the implications of that—Solr uses the data just fine
- I can surely use this for atomic Solr backups
## 2023-12-27
- Delete duplicate metadata as described in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
- Do some other metadata cleanups on CGSpace
- I also looked up our DOIs on Crossref to get some missing abstracts and correct licenses and dates
- Some minor work on the CGSpace DSpace 7 theme to fix the navbar on mobile
- Some work on the IFPRI ISNAR archive
## 2023-12-28
- I started porting the [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) to DSpace 7
- Some work on the IFPRI ISNAR archive
- I ended up going through most of the PDFs to get better dates and abstracts
## 2023-12-29
- I created a new Hetzner server to replace the current DSpace 6 CGSpace next week when we migrate to DSpace 7
- Interesting, I haven't checked for content pointing to legacy domains in several years (!)
- `inurl:mahider.cgiar.org`: 0 results on Google!
- `inurl:mahider.ilri.org`: 2,100 results on Google
- `inurl:mahider.ilri.org inurl:https`: 2 results on Google (!)
- `inurl:dspace.ilri.org:` 1,390 results on Google
- `inurl:dspace.ilri.org inurl:https`: 0 results on Google (!)
- So it seems I can do away with the HTTPS virtual hosts finally
- Well my current certificates expired on 2021-02-13 and nobody noticed... so...
<!-- vim: set sw=2 ts=2: -->

430
content/posts/2024-01.md Normal file
View File

@@ -0,0 +1,430 @@
---
title: "January, 2024"
date: 2024-01-02T10:08:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-01-02
- Work on preparation of new server for DSpace 7 migration
- I'm not quite sure what we need to do for the Handle server
- For now I just ran the `dspace make-handle-config` script and diffed it with the one from DSpace 6
- I sent the bundle to the Handle admins to make sure it's OK before we do the migration
- Continue testing and debugging the cgspace-java-helpers on DSpace 7
- Work on IFPRI ISNAR archive cleanup
<!--more-->
## 2024-01-03
- I haven't heard from the Handle admins so I'm preparing a backup solution using nginx streams
- This seems to work in my simple tests (this must be outside the `http {}` block):
```
stream {
upstream handle_tcp_9000 {
server 188.34.177.10:9000;
}
server {
listen 9000;
proxy_connect_timeout 1s;
proxy_timeout 3s;
proxy_pass handle_tcp_9000;
}
}
```
- Here I forwarded a test TCP port 9000 from one server to another and was able to retrieve a test HTML that was running on the target
- I will have to do TCP and UDP on port 2641, and TCP/HTTP on port 8000.
- I did some more minor work on the IFPRI ISNAR archive
- I got some PDFs from the UMN AgEcon search and fixed some metadata
- Then I did some duplicate checking and found five items already on CGSpace
## 2024-01-04
- Upload 692 items for the ISNAR archive to CGSpace: https://cgspace.cgiar.org/handle/10568/136192
- Help Peter proof and upload 252 items from the 2023 Gender conference to CGSpace
- Meeting with IFPRI to discuss their migration to CGSpace
- We agreed to add two new fields, one for IFPRI project and one for IFPRI publication ranking
- Most likely we will use `cg.identifier.project` as a general field and consolidate other project fields there
- Not sure which field to use for the publication rank...
## 2024-01-05
- Proof and upload 51 items in bulk for IFPRI
- I did a big cleanup of user groups in anticipation of complaints about slow workflow tasks etc in DSpace 7
- I removed ILRI editors from all the dozens of CCAFS community and collection groups, and I should do the same for other CRPs since they are closed for two years now
## 2024-01-06
- Migrate CGSpace to DSpace 7
## 2024-01-07
- High load on the server and UptimeRobot saying the frontend is flapping
- I noticed tons of logs from pm2 in the systemd journal, so I disabled those in the systemd unit because they are available from pm2's log directory anyway
- I also noticed the same for Solr, so I disabled stdout for that systemd unit as well
- I spent a lot of time bringing back the nginx rate limits we used in DSpace 6 and it seems to have helped
- I see some client doing weird HEAD requests to search pages:
```
47.76.35.19 - - [07/Jan/2024:00:00:02 +0100] "HEAD /search/?f.accessRights=Open+Access%2Cequals&f.actionArea=Resilient+Agrifood+Systems%2Cequals&f.author=Burkart%2C+Stefan%2Cequals&f.country=Kenya%2Cequals&f.impactArea=Climate+adaptation+and+mitigation%2Cequals&f.itemtype=Brief%2Cequals&f.publisher=CGIAR+System+Organization%2Cequals&f.region=Asia%2Cequals&f.sdg=SDG+12+-+Responsible+consumption+and+production%2Cequals&f.sponsorship=CGIAR+Trust+Fund%2Cequals&f.subject=environmental+factors%2Cequals&spc.page=1 HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.2504.63 Safari/537.36"
```
- I will add their network blocks (AS45102) and regenerate my list of bot networks:
```console
$ wget https://asn.ipinfo.app/api/text/list/AS16276 \
https://asn.ipinfo.app/api/text/list/AS23576 \
https://asn.ipinfo.app/api/text/list/AS24940 \
https://asn.ipinfo.app/api/text/list/AS13238 \
https://asn.ipinfo.app/api/text/list/AS14061 \
https://asn.ipinfo.app/api/text/list/AS12876 \
https://asn.ipinfo.app/api/text/list/AS55286 \
https://asn.ipinfo.app/api/text/list/AS203020 \
https://asn.ipinfo.app/api/text/list/AS204287 \
https://asn.ipinfo.app/api/text/list/AS50245 \
https://asn.ipinfo.app/api/text/list/AS6939 \
https://asn.ipinfo.app/api/text/list/AS45102 \
https://asn.ipinfo.app/api/text/list/AS21859
$ cat AS* | sort | uniq | wc -l
4897
$ cat AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
$ wc -l /tmp/networks.txt
2017 /tmp/networks.txt
```
- I'm surprised to see the number of networks reduced from my current ones... hmmm.
- I will also update my list of Bing networks:
```console
$ ./ilri/bing-networks-to-ips.sh
$ ~/go/bin/mapcidr -a < /tmp/bing-ips.txt > /tmp/bing-networks.txt
$ wc -l /tmp/bing-networks.txt
250 /tmp/bing-networks.txt
```
## 2024-01-08
- Export list of publishers for Peter to select some amount to use as a controlled vocabulary:
```console
localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "dcterms.publisher", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 178 GROUP BY "dcterms.publisher" ORDER BY count DESC) to /tmp/2024-01-publishers.csv WITH CSV HEADER;
COPY 4332
```
- Address some feedback on DSpace 7 from users, including fileing some issues on GitHub
- https://github.com/DSpace/dspace-angular/issues/2730: List of available metadata fields is truncated when adding new metadata in "Edit Item"
- The Alliance TIP team was having issues posting to one collection via the legacy DSpace 6 REST API
- In the DSpace logs I see the same issue that they had last month:
```
ERROR unknown unknown org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
```
## 2024-01-09
- I restarted Tomcat to see if it helps the REST issue
- After talking with Peter about publishers we decided to get a clean list of the top ~100 publishers and then make sure all CGIAR centers, Initiatives, and Impact Platforms are there as well
- I exported a list from PostgreSQL and then filtered by count > 40 in OpenRefine and then extracted the metadata values:
```
$ csvcut -c dcterms.publisher ~/Downloads/2024-01-09-publishers4.csv | sed -e 1d -e 's/"//g' > /tmp/top-publishers.txt
```
- Export a list of ORCID identifiers from PostgreSQL to look them up on ORCID and update our controlled vocabulary:
```console
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=247) to /tmp/2024-01-09-orcid-identifiers.txt;
localhost/dspace7= ☘ \q
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/2024-01-09-orcid-identifiers.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2024-01-09-orcids.txt
$ ./ilri/resolve_orcids.py -i /tmp/2024-01-09-orcids.txt -o /tmp/2024-01-09-orcids-names.txt -d
```
- Then I updated existing ORCID identifiers in CGSpace:
```
$ ./ilri/update_orcids.py -i /tmp/2024-01-09-orcids-names.txt -db dspace -u dspace -p bahhhh
```
- Bizu seems to be having issues due to belonging to too many groups
- I see some messages from Solr in the DSpace log:
```
2024-01-09 06:23:35,893 ERROR unknown unknown org.dspace.authorize.AuthorizeServiceImpl @ Failed getting getting community/collection admin status for bahhhhh@cgiar.org The search error is: Error from server at http://localhost:8983/solr/search: org.apache.solr.search.SyntaxError: Cannot parse 'search.resourcetype:Community AND (admin:eef481147-daf3-4fd2-bb8d-e18af8131d8c OR admin:g80199ef9-bcd6-4961-9512-501dea076607 OR admin:g4ac29263-cf0c-48d0-8be7-7f09317d50ec OR admin:g0e594148-a0f6-4f00-970d-6b7812f89540 OR admin:g0265b87a-2183-4357-a971-7a5b0c7add3a OR admin:g371ae807-f014-4305-b4ec-f2a8f6f0dcfa OR admin:gdc5cb27c-4a5a-45c2-b656-a399fded70de OR admin:ge36d0ece-7a52-4925-afeb-6641d6a348cc OR admin:g15dc1173-7ddf-43cf-a89a-77a7f81c4cfc OR admin:gc3a599d3-c758-46cd-9855-c98f6ab58ae4 OR admin:g3d648c3e-58c3-4342-b500-07cba10ba52d OR admin:g82bf5168-65c1-4627-8eb4-724fa0ea51a7 OR admin:ge751e973-697d-419c-b59b-5a5644702874 OR admin:g44dd0a80-c1e6-4274-9be4-9f342d74928c OR admin:g4842f9c2-73ed-476a-a81a-7167d8aa7946 OR admin:g5f279b3f-c2ce-4c75-b151-1de52c1a540e OR admin:ga6df8adc-2e1d-40f2-8f1e-f77796d0eecd OR admin:gfdfc1621-382e-437a-8674-c9007627565c OR admin:g15cd114a-0b89-442b-a1b4-1febb6959571 OR admin:g12aede99-d018-4c00-b4d4-a732541d0017 OR admin:gc59529d7-002a-4216-b2e1-d909afd2d4a9 OR admin:gd0806714-bc13-460d-bedd-121bdd5436a4 OR admin:gce70739a-8820-4d56-b19c-f191855479e4 OR admin:g7d3409eb-81e3-4156-afb1-7f02de22065f OR admin:g54bc009e-2954-4dad-8c30-be6a09dc5093 OR admin:gc5e1d6b7-4603-40d7-852f-6654c159dec9 OR admin:g0046214d-c85b-4f12-a5e6-2f57a2c3abb0 OR admin:g4c7b4fd0-938f-40e9-ab3e-447c317296c1 OR admin:gcfae9b69-d8dd-4cf3-9a4e-d6e31ff68731 OR ... admin:g20f366c0-96c0-4416-ad0b-46884010925f)': too many boolean clauses The search resourceType filter was: search.resourcetype:Community
```
- There are 1,805 OR clauses in the full log!
- We previous had this issue in 2020-01 and 2020-02 with DSpace 5 and DSpace 6
- At the time the solution was to increase the `maxBooleanClauses` in Solr and to disable access rights awareness, but I don't think we want to do the second one now
- I saw many users of Solr in other applications increasing this to obscenely high numbers, so I think we should be OK to increase it from 1024 to 2048
- Re-visiting the DSpace user groomer to delete inactive users
- In 2023-08 I noticed that this was now [possible in DSpace 7](https://github.com/DSpace/DSpace/pull/2928)
- As a test I tried to delete all users who have been inactive since six years ago (Janury 9, 2018):
```console
$ dspace dsrun org.dspace.eperson.Groomer -a -b 01/09/2018 -d
```
- I tested it on DSpace 7 Test and it worked... I am debating running it on CGSpace...
- I see we have almost 9,000 users:
```console
$ dspace user -L > /tmp/users-before.txt
$ wc -l /tmp/users-before.txt
8943 /tmp/users-before.txt
```
- I decided to do the same on CGSpace and it worked without errors
- I finished working on the controlled vocabulary for publishers
## 2024-01-10
- I spent some time deleting old groups on CGSpace
- I looked into the use of the `cg.identifier.ciatproject` field and found there are only a handful of uses, with some even seeming to be a mistake:
```console
localhost/dspace7= ☘ SELECT DISTINCT text_value AS "cg.identifier.ciatproject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata
_field_id = 232 GROUP BY "cg.identifier.ciatproject" ORDER BY count DESC;
cg.identifier.ciatproject │ count
───────────────────────────┼───────
D145 │ 4
LAM_LivestockPlus │ 2
A215 │ 1
A217 │ 1
A220 │ 1
A223 │ 1
A224 │ 1
A227 │ 1
A229 │ 1
A230 │ 1
CLIMATE CHANGE MITIGATION │ 1
LIVESTOCK │ 1
(12 rows)
Time: 240.041 ms
```
- I think we can move those to a new `cg.identifier.project` if we create one
- The `cg.identifier.cpwfproject` field is similarly sparse, but the CCAFS ones are widely used
## 2024-01-12
- Export a list of affiliations to do some cleanup:
```console
localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 211 GROUP BY "cg.contributor.affiliation" ORDER BY count DESC) to /tmp/2024-01-affiliations.csv WITH CSV HEADER;
COPY 11719
```
- I first did some clustering and editing in OpenRefine, then I'll import those back into CGSpace and then do another export
- Troubleshooting the statistics pages that aren't working on DSpace 7
- On a hunch, I queried for for Solr statistics documents that **did not have an `id` matching the 36-character UUID pattern**:
```console
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&rows=0'
{
"responseHeader":{
"status":0,
"QTime":0,
"params":{
"q":"-id:/.{36}/",
"rows":"0"}},
"response":{"numFound":800167,"start":0,"numFoundExact":true,"docs":[]
}}
```
- They seem to come mostly from 2020, 2023, and 2024:
```console
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&facet.range=time&facet=true&facet.range.start=2010-01-01T00:00:00Z&facet.range.end=NOW&facet.range.gap=%2B1YEAR&rows=0'
{
"responseHeader":{
"status":0,
"QTime":13,
"params":{
"facet.range":"time",
"q":"-id:/.{36}/",
"facet.range.gap":"+1YEAR",
"rows":"0",
"facet":"true",
"facet.range.start":"2010-01-01T00:00:00Z",
"facet.range.end":"NOW"}},
"response":{"numFound":800168,"start":0,"numFoundExact":true,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{},
"facet_ranges":{
"time":{
"counts":[
"2010-01-01T00:00:00Z",0,
"2011-01-01T00:00:00Z",0,
"2012-01-01T00:00:00Z",0,
"2013-01-01T00:00:00Z",0,
"2014-01-01T00:00:00Z",0,
"2015-01-01T00:00:00Z",89,
"2016-01-01T00:00:00Z",11,
"2017-01-01T00:00:00Z",0,
"2018-01-01T00:00:00Z",0,
"2019-01-01T00:00:00Z",0,
"2020-01-01T00:00:00Z",1339,
"2021-01-01T00:00:00Z",0,
"2022-01-01T00:00:00Z",0,
"2023-01-01T00:00:00Z",653736,
"2024-01-01T00:00:00Z",144993],
"gap":"+1YEAR",
"start":"2010-01-01T00:00:00Z",
"end":"2025-01-01T00:00:00Z"}},
"facet_intervals":{},
"facet_heatmaps":{}}}
```
- They seem to come from 2023-08 until now (so way before we migrated to DSpace 7):
```console
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&facet.range=time&facet=true&facet.range.start=2023-01-01T00:00:00Z&facet.range.end=NOW&facet.range.gap=%2B1MONTH&rows=0'
{
"responseHeader":{
"status":0,
"QTime":196,
"params":{
"facet.range":"time",
"q":"-id:/.{36}/",
"facet.range.gap":"+1MONTH",
"rows":"0",
"facet":"true",
"facet.range.start":"2023-01-01T00:00:00Z",
"facet.range.end":"NOW"}},
"response":{"numFound":800168,"start":0,"numFoundExact":true,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{},
"facet_ranges":{
"time":{
"counts":[
"2023-01-01T00:00:00Z",1,
"2023-02-01T00:00:00Z",0,
"2023-03-01T00:00:00Z",0,
"2023-04-01T00:00:00Z",0,
"2023-05-01T00:00:00Z",0,
"2023-06-01T00:00:00Z",0,
"2023-07-01T00:00:00Z",0,
"2023-08-01T00:00:00Z",27621,
"2023-09-01T00:00:00Z",59165,
"2023-10-01T00:00:00Z",115338,
"2023-11-01T00:00:00Z",96147,
"2023-12-01T00:00:00Z",355464,
"2024-01-01T00:00:00Z",125429],
"gap":"+1MONTH",
"start":"2023-01-01T00:00:00Z",
"end":"2024-02-01T00:00:00Z"}},
"facet_intervals":{},
"facet_heatmaps":{}}}
```
- I see that we had 31,744 statistic events yesterday, and 799 have no `id`!
- I asked about this on Slack and will file an issue on GitHub if someone else also finds such records
- Several people said they have them, so it's a bug of some sort in DSpace, not our configuration
## 2024-01-13
- Yesterday alone we had 37,000 unique IPs making requests to nginx
- I looked up the ASNs and found 6,000 IPs from this network in Amazon Singapore: 47.128.0.0/14
## 2024-01-15
- Investigating the CSS selector warning that I've seen in PM2 logs:
```console
0|dspace-ui | 1 rules skipped due to selector errors:
0|dspace-ui | .custom-file-input:lang(en)~.custom-file-label -> unmatched pseudo-class :lang
```
- It seems to be a bug in Angular, as this selector comes from Bootstrap 4.6.x and is not invalid
- But that led me to a more interesting issue with `inlineCritical` optimization for styles in Angular SSR that might be responsible for causing high load in the frontend
- See: https://github.com/angular/angular/issues/42098
- See: https://github.com/angular/universal/issues/2106
- See: https://github.com/GoogleChromeLabs/critters/issues/78
- Since the production site was flapping a lot I decided to try disabling inlineCriticalCss
- There have been on and off load issues with the Angular frontend today
- I think I will just block all data center network blocks for now
- In the last week I see almost 200,000 unique IPs:
```console
# zcat -f /var/log/nginx/*access.log /var/log/nginx/*access.log.1 /var/log/nginx/*access.log.2.gz /var/log/nginx/*access.log.3.gz /var/log/nginx/*access.log.4.gz /var/log/nginx/*access.log.5.gz /var/log/nginx/*access.log.6.gz | awk '{print $1}' | sort -u |
tee /tmp/ips.txt | wc -l
196493
```
- Looking these IPs up I see there are 18,000 coming from Comcast, 10,000 from AT&T, 4110 from Charter, 3500 from Cox and dozens of other residential IPs
- I highly doubt these are home users browsing CGSpace... seems super fishy
- Also, over 1,000 IPs from SpaceX Starlink in the last week. RIGHT
- I will temporarily add a few new datacenter ISP network blocks to our rate limit:
- 16509 Amazon-02
- 701 UUNET
- 8075 Microsoft
- 15169 Google
- 14618 Amazon-AES
- 396982 Google Cloud
- The load on the server *immediately* dropped
## 2024-01-17
- It turns out AS701 (UUNET) is Verizon Business, which is used as an ISP for many staff at IFPRI
- This was causing them to see HTTP 429 "too many requests" errors on CGSpace
- I removed this ASN from the rate limiting
## 2024-01-18
- Start looking at Solr stats again
- I found one statistics record that has 22,000 of the same collection in `owningColl` and 22,000 of the same community in `owningComm`
- The record is from 2015 and think it would be easier to delete it than fix it:
```console
$ curl http://localhost:8983/solr/statistics/update -H "Content-type: text/xml" --data-binary '<delete><query>uid:3b4eefba-a302-4172-a286-dcb25d70129e</query></delete>'
```
- Looking again, there are at least 1,000 of these so I will need to come up with an actual solution to fix these
- I'm noticing we have 1,800+ links to defunct resources on bioversityinternational.org in the `cg.link.permalink` field
- I should ask Alliance if they have any plans to fix those, or upload them to CGSpace
## 2024-01-22
- Meeting with IWMI about ORCID integration on CGSpace now that we've migrated to DSpace 7
- File an issue for the inaccurate DSpace statistics: https://github.com/DSpace/DSpace/issues/9275
## 2024-01-23
- Meeting with IWMI about ORCID integration and the DSpace API for use with WordPress
- IFPRI sent me an list of their author ORCIDs to add to our controlled vocabulary
- I joined them with our current list and resolved their names on ORCID and updated them in our database:
```console
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml ~/Downloads/IFPRI\ ORCiD\ All.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2024-01-23-orcids.txt
$ ./ilri/resolve_orcids.py -i /tmp/2024-01-23-orcids.txt -o /tmp/2024-01-23-orcids-names.txt -d
$ ./ilri/update_orcids.py -i /tmp/2024-01-23-orcids-names.txt -db dspace -u dspace -p fuuu
```
- This adds about 400 new identifiers to the controlled vocabulary
- I consolidated our various project identifier fields for closed programs into one `cg.identifer.project`:
- `cg.identifier.ccafsproject`
- `cg.identifier.ccafsprojectpii`
- `cg.identifier.ciatproject`
- `cg.identifier.cpwfproject`
- I prefixed the existing 2,644 metadata values with "CCAFS", "CIAT", or "CPWF" so we can figure out where they came from if need be, and deleted the old fields from the metadata registry
## 2024-01-26
- Minor work on dspace-angular to clean up component styles
- Add `cg.identifier.publicationRank` to CGSpace metadata registry and submission form
## 2024-01-29
- Rework the nginx bot and network limits slightly to remove some old patterns/networks and remove Google
- The Google Scholar team contacted me to ask why their requests were timing out (well...)
<!-- vim: set sw=2 ts=2: -->

118
content/posts/2024-02.md Normal file
View File

@@ -0,0 +1,118 @@
---
title: "February, 2024"
date: 2024-02-05T11:10:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-02-05
- Delete duplicate metadata as described in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
- Lower case all the AGROVOC subjects on CGSpace
<!--more-->
```sql
dspace=# BEGIN;
BEGIN
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 180
dspace=*# COMMIT;
COMMIT
```
## 2024-02-06
- Discuss IWMI using the CGSpace REST API for their new website
- Export the IWMI community to extract their ORCID identifiers:
```console
$ dspace metadata-export -i 10568/16814 -f /tmp/iwmi.csv
$ csvcut -c 'cg.creator.identifier,cg.creator.identifier[en_US]' ~/Downloads/2024-02-06-iwmi.csv \
| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' \
| sort -u \
| tee /tmp/iwmi-orcids.txt \
| wc -l
353
$ ./ilri/resolve_orcids.py -i /tmp/iwmi-orcids.txt -o /tmp/iwmi-orcids-names.csv -d
```
- I noticed some similar looking names in our list so I clustered them in OpenRefine and manually checked a dozen or so to update our list
## 2024-02-07
- Maria asked me about the "missing" item from last week again
- I can see it when I used the Admin search, but not in her workflow
- It was submitted by TIP so I checked that user's workspace and found it there
- After depositing, it went into the workflow so Maria should be able to see it now
## 2024-02-09
- Minor edits to CGSpace submission form
- Upload 55 ISNAR book chapters to CGSpace from Peter
## 2024-02-19
- Looking into the collection mapping issue on CGSpace
- It seems to be by design in DSpace 7: https://github.com/DSpace/dspace-angular/issues/1203
- This is a massive setback for us...
## 2024-02-20
- Minor work on OpenRXV to fix a bug in the ng-select drop downs
- Minor work on the DSpace 7 nginx configuration to allow requesting robots.txt and sitemaps without hitting rate limits
## 2024-02-21
- Minor updates on OpenRXV, including one bug fix for missing mapped collections
- Salem had to re-work the harvester for DSpace 7 since the mapped collections and parent collection list are separate!
## 2024-02-22
- Discuss tagging of datasets and re-work the submission form to encourage use of DOI field for any item that has a DOI, and the normal URL field if not
- The "cg.identifier.dataurl" field will be used for "related" datasets
- I still have to check and move some metadata for existing datasets
## 2024-02-23
- This morning Tomcat died due to an OOM kill from the kernel:
```console
kernel: Out of memory: Killed process 698 (java) total-vm:14151300kB, anon-rss:9665812kB, file-rss:320kB, shmem-rss:0kB, UID:997 pgtables:20436kB oom_score_adj:0
```
- I don't see any abnormal pattern in my Grafana graphs, for JVM or system load... very weird
- I updated the submission form on CGSpace to include the new changes to URLs for datasets
- I also updated about 80 datasets to move the URLs to the correct field
## 2024-02-25
- This morning Tomcat died while I was doing a CSV export, with an OOM kill from the kernel:
```console
kernel: Out of memory: Killed process 720768 (java) total-vm:14079976kB, anon-rss:9301684kB, file-rss:152kB, shmem-rss:0kB, UID:997 pgtables:19488kB oom_score_adj:0
```
- I don't know why this is happening so often recently...
## 2024-02-27
- IFPRI sent me a list of authors to add to our list for now, until we can find a better way of doing it
- I extracted the existing authors from our controlled vocabulary and combined them with IFPRI's:
```console
$ xmllint --xpath '//node/isComposedBy/node()' dspace/config/controlled-vocabularies/dc-contributor-author.xml \
| grep -oE 'label=".*"' \
| sed -e 's/label="//' -e 's/"$//' > /tmp/authors
$ cat /tmp/authors /tmp/ifpri-authors | sort -u > /tmp/new-authors
```
## 2024-02-28
- I figured out a way to add a new Angular component to handle all our relation fields
## 2024-02-29
- Clean up a bunch of metadata on CGSpace
<!-- vim: set sw=2 ts=2: -->

207
content/posts/2024-03.md Normal file
View File

@@ -0,0 +1,207 @@
---
title: "March, 2024"
date: 2024-03-01T09:55:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-03-01
- Last week Bizu reported an issue with the "browse by issue date" drop down
- I verified it, and suspect it could be due to missing issue dates...
- It might be this issue: https://github.com/DSpace/dspace-angular/issues/2808
<!--more-->
- I spent some time trying to reproduce the bug affecting `onebox` fields that are configured to use external vocabularies and are not repeatable
- I filed an issue: https://github.com/DSpace/dspace-angular/issues/2846
## 2024-03-03
- I did some cleanups on abstracts, licenses, and dates from CrossRef
- I also did some minor cleanups to affiliations because I saw some incorrect and duplicate ones in our list
## 2024-03-05
- I tried a new technique to get some affiliations from Crossref using OpenRefine
- First I split them and clustered, resolving a few hundred clusters out of 1500 (!)
- Then I used a custom text facet with a few dozen CGIAR and other large affiliations to reduce the work
- Then I joined them with our affiliations, paying no attention to duplicates
- Then I deduped them using the Jython technique I learned in 2023-02
## 2024-03-06
- Peter sent me some more corrections for the authors that I had sent him in 2023-12
## 2024-03-08
- IFPRI sent me their 2023 records from CONTENTdm so I started working on those
- I found a way to match their ORCID identifiers in our list using Jython in OpenRefine:
```python
import re
with open(r"/tmp/cg-creator-identifier.txt",'r') as f :
orcid_ids = [orcid_id.strip() for orcid_id in f]
matched = False
for orcid_id in orcid_ids:
if re.search(r'.+: {}'.format(value), orcid_id):
matched = True
break
if matched:
return orcid_id
else:
return value
```
- I realized that [UNICEF was renamed to its current name in 1953](https://www.unicef.org/about-unicef/frequently-asked-questions#3) so I replaced all other variations in our vocabularies and metadata:
```sql
UPDATE metadatavalue SET text_value='United Nations Children''s Fund' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value IN ('United Nations International Children''s Emergency Fund', 'United Nations International Children''s Emergency Fund', 'UNICEF');
```
- Note the use of two single quotes to escape the one in the name
## 2024-03-11
- Experimenting with moving some of my Python scripts to the DSpace 7 REST API
- I need a way to get UUIDs for Handles...
- Seems that I can use a Discovery query like: https://dspace7test.ilri.org/server/api/discover/search/objects?dsoType=item&query=handle:10568/130864
- Then just take the first result...?
- I spent some time working on the script get abstracts from CGSpace, and found a bug in my logic
- I also noticed that one item had two abstracts, but the first one was blank!
- Looking deeper, I found 113 blank metadata values so I deleted those:
```sql
BEGIN;
DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='';
COMMIT;
```
- I also found a few dozen items with "N/A" for their citation, so I deleted those too:
```sql
BEGIN;
DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='N/A' AND metadata_field_id=146;
COMMIT;
```
- I deployed the change to disable Angular SSR's `inlineCriticalCss` on production because we had heavy load on the frontend and I've been meaning to do this permanently for some time
- Maria asked me for a CSV with all the broken Bioversity permalinks so I exported them for her:
```console
$ csvcut -c 'id,dc.title[en_US],dc.identifier.uri[en_US],cg.link.permalink[en_US]' ~/Downloads/2024-03-05-cgspace.csv \
| csvgrep -c 'cg.link.permalink[en_US]' -r '^.+$' > /tmp/2024-03-11-Bioversity-Permalinks.csv
```
## 2024-03-12
- Run the duplicate checker for IFPRI 2023 batch upload
## 2024-03-13
- I found about 428 duplicates in the IFPRI 2023 batch records
- Alarmingly, I found about 18 that are duplicated on CGSpace as well!
- I looked closer and decided that 11 were duplicates, so I merged the metadata and withdrew the later ones
- Alliance asked me to get him the Handles for items submitted by TIP that are not discoverable
- I found it easiest to use the `ds6_item2itemhandle` [DSpace SQL helper function](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) with a nested query on the provenance:
```sql
SELECT ds6_item2itemhandle(dspace_object_id) AS handle FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item WHERE NOT discoverable) AND metadata_field_id=28 AND text_value LIKE 'Submitted by Alliance TIP Submit%';
```
## 2024-03-14
- Looking in to reports of rate limiting of Altmetric's bot on CGSpace
- I don't see any HTTP 429 responses for their user agents in any of our logs...
- I tried myself on an item page and never hit a limit...
```console
$ for num in {1..60}; do echo -n "Request ${num}: "; curl -s -o /dev/null -w "%{http_code}" https://dspace7test.ilri.org/items/c9b8999d-3001-42ba-a267-14f4bfa90b53 && echo; done
Request 1: 200
Request 2: 200
Request 3: 200
Request 4: 200
...
Request 60: 200
```
- All responses were HTTP 200...
- In any case, I whitelisted their production IPs and told them to try again
- I imported 468 of IFPRI's 2023 records that were confirmed to not be duplicates to CGSpace
- I also spent some time merging metadata from 415 of the remaining 432 duplicates with the metadata for the existing items on CGSpace
- This was a bit of dirty work using csvkit, xsv, and OpenRefine
## 2024-03-17
- There are 17 records from IFPRI's 2023 batch that are remaining from the 432 that I identified as already being on CGSpace
- These are different in that they are duplicates on CGSpace as well, so the csvjoin failed and the metadata got messed up in my migration
- I looked closer and whittled this down to 14 actual records, and spent some time working on them
- I isolated 12 of these items that existed on CGSpace and added publication ranks, project identifiers, and provenance links
- Now there only remain two confusing records about the Inkomati catchment
## 2024-03-18
- Checking to see how many IFPRI records we have migrated so far:
```console
$ csvgrep -c 'dc.description.provenance[en_US]' -m 'Original URL from IFPRI CONTENTdm' cgspace.csv \
| csvcut -c 'id,dc.title[en_US],dc.identifier.uri[en_US],dc.description.provenance[en_US],dcterms.type[en_US]' \
| tee /tmp/ifpri-records.csv \
| csvstat --count
898
```
- I finalized the remaining two on Inkomati catchment and now we are at 900!
# 2024-03-19
- IWMI sent me some new author ORCID identifiers so I updated our list
- Started working on updating my data for the Ontology CoP webinar on CGIAR and AGROVOC
- First extracting all unique subjects on CGSpace:
```
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119)) to /tmp/2024-03-19-cgspace-subjects.csv WITH CSV HEADER;
COPY 28024
```
- Then I extracted the subjects and looked them up against AGROVOC:
```console
$ csvcut -c subject /tmp/2024-03-19-cgspace-subjects.csv | sed '1d' > /tmp/2024-03-19-cgspace-subjects.txt
$ ./ilri/agrovoc_lookup.py -i /tmp/2024-03-19-cgspace-subjects.txt -o /tmp/2024-03-19-cgspace-subjects-results.csv
```
## 2024-03-20
- Identify seven duplicates on CGSpace from the PRMS results and withdraw them from CGSpace
## 2024-03-21
- Look more closely at duplicates on CGSpace based on a fresh export
- Using DOIs I found ~842 that occur more than once for journal articles alone, so probably around 400 duplicates
- I did a handful of them, merging the metadata and withdrawing the duplicate, and decided to add `dcterms.replaces` with the handle in the original
## 2024-03-22
- Look at duplicate DOIs on CGSpace and address a dozen or so
## 2024-03-23
- Look at duplicate DOIs on CGSpace and address a dozen or so
- Update Tomcat and Solr to latest versions
- I had done some tests with these last week, and did a last minute test on DSpace 7 Test to make sure submission and searching worked
## 2024-03-24
- Slowly process several dozen more duplicate DOIs on CGSpace, sigh...
## 2024-03-26
- File an issue on dspace-angular about improving withdrawn item tombstones: https://github.com/DSpace/dspace-angular/issues/2880
- Merge metadata and withdraw more duplicates on CGSpace
<!-- vim: set sw=2 ts=2: -->

169
content/posts/2024-04.md Normal file
View File

@@ -0,0 +1,169 @@
---
title: "April, 2024"
date: 2024-04-04T10:23:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-04-04
- Work on CGSpace duplicate DOIs more
<!--more-->
## 2024-04-08
- Start working on IFPRI's 2022 batch import
- I ran the duplicate checker against CGSpace and started downloading all linked PDFs
## 2024-04-09
- Continue working on IFPRI's 2022 batch import
- I started validating the potential duplicates in OpenRefine
## 2024-04-12
- Finish working on the 650 IFPRI 2022 records that were not already on CGSpace, then uploaded them
- I need to merge the metadata for the remaining 212 that are already on CGSpace
- Spend some time looking at duplicate DOIs again...
## 2024-04-13
- Spend some time looking at duplicate DOIs again...
## 2024-04-14
- Spend some time looking at duplicate DOIs again...
## 2024-04-15
- Spend some time looking at duplicate DOIs again...
- Delete ~260 duplicate metadata values using the elaborate SQL and sort method I documented here: https://github.com/DSpace/DSpace/issues/8253#issuecomment-1331756418
- Tony noticed that the DSpace 7 REST API is very slow with the embeds so I profiled a bit:
```
$ time curl -s -o /dev/null 'https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&page=0&size=100&embed=thumbnail,bundles/bitstreams&sort=dcterms.issued,desc'
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 47.515 total
$ time curl -s -o /dev/null 'https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&page=0&size=100&sort=dcterms.issued,desc'
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 4.764 total
```
- Finalize processing the remaining 206 items from the IFPRI 2022 batch set that already existed on CGSpace
- I merged metadata with the existing items
- There are still six remaining items that I identified as being duplicates (3x2) in the IFPRI set itself
## 2024-04-16
- Spend some time looking at duplicate DOIs again...
- Assist Deborah with an advanced query on CGSpace for biodiversity and health:
```
dcterms.issued:[2010 TO 2024] AND dcterms.type:"Journal Article" AND (dc.title:"biodiversity" OR dcterms.subject:"biodiversity" OR dc.title:"health" OR dcterms.subject:"health")
```
- Remove CIMMYT URLs and citations from 277 journal articles on CGSpace since it is a bit tacky
- I used this Jython expression in OpenRefine with [Crossref's content negotiation](https://citation.crosscite.org/docs.html) to get citations for all DOIs:
```python
import urllib2
doi = cells['cg.identifier.doi[en_US]'].value
url = "https://api.crossref.org/works/" + doi + "/transform/text/x-bibliography"
useragent = "Python (mailto:a.o@cgiar.org)"
request = urllib2.Request(url.encode("utf-8"), headers={"User-Agent" : useragent})
get = urllib2.urlopen(request)
return get.read().decode('utf-8')
```
- It took ten or so minutes for it to finish (and note this is Python 2 inside OpenRefine so I had to be careful with Unicode), but worked well!
## 2024-04-18
- Write a SQL query to build the IFPRI CONTENTdm redirects to Handles:
```sql
SELECT m.text_value, h.handle FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=28 AND m.text_value LIKE 'Original URL%' AND h.resource_type_id=2;
```
- Similarly, I need a SQL query to get the redirects for duplicate Handles, querying for `dcterms.replaces`:
```sql
SELECT m.text_value AS handle_from, h.handle AS handle_to FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=181 AND h.resource_type_id=2;
```
- Then I can work that list into an nginx map with redirect, for example:
```console
server {
...
if ($new_uri) {
return 301 $new_uri;
}
}
map $request_uri $new_uri {
/handle/10568/112821 /handle/10568/97605;
}
```
## 2024-04-19
- Spend some time looking at duplicate DOIs again...
- Refresh ORCID identifiers from ORCID API and update CGSpace metadata and controlled vocabulary
## 2024-04-20
- I read an [interesting thread about DOI casing](https://github.com/greenelab/scihub/issues/9)
- Apparently the DOI specification says ASCII characters in DOIs are case insensitive
- Indeed, [Crossref recommends lower case](https://www.crossref.org/documentation/member-setup/constructing-your-dois/) for all DOIs
- I was curious about the DOIs in our database so I checked before and after lower casing:
```console
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=220 AND text_value IS NOT NULL AND text_value !='') TO /tmp/dois-sql-before.txt;
COPY 25675
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(lower(text_value)) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=220 AND text_value IS NOT NULL AND text_value !='') TO /tmp/dois-sql-after.txt;
COPY 25666
```
- I need to investigate options for lower casing these in the repository, for example in a curation task, and in all workflows around DSpace metadata...
## 2024-04-23
- Spent some time writing a Java curation task to normalize DOIs in items when they enter the workflow edit step
- The workflow curation tasks are not documented very well but I got a basic configuration working
- I found a bug in DSpace curation tasks and discussed on Slack
- I finalized the `NormalizeDOIs` curation task and released v7.6.1.1 of the [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) project
## 2024-04-24
- A bit more testing of the curation tasks
- I tested a patch by Mark Wood
- I added support for normalizing DOIs to this same format to my [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) project
## 2024-04-25
- I lowercased the remaining 3,900 DOIs on CGSpace that had uppercase ASCII characters
- Spend some time looking at duplicate DOIs again...
## 2024-04-26
- Spend some time looking at duplicate DOIs again...
## 2024-04-29
- Start working on the IFPRI 20202021 batch migration
- I modified my `check_duplicates.py` script to check for DOIs instead of titles, and use a similarity of 1.0 to make sure the match is exact
- I noticed something in the Tomcat log:
```console
tomcat9[690]: WARNING: The HTTP response header [Content-Disposition] with value [attachment; filename="Literature review on Womens Empowerment and their Resilience2.pdf"] has been removed from the response because it is invalid
tomcat9[690]: java.lang.IllegalArgumentException: The Unicode character [] at code point [8,217] cannot be encoded as it is outside the permitted range of 0 to 255
```
- I found the bitstream's ID and then used the `ds6_bitstream2itemhandle` [SQL helper function](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) to find the item's handle
- Then I replaced the curly quote with a regular quote in all bistreams
<!-- vim: set sw=2 ts=2: -->

197
content/posts/2024-05.md Normal file
View File

@@ -0,0 +1,197 @@
---
title: "May, 2024"
date: 2024-05-01T10:39:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-05-01
- I dumped all the CGSpace DOIs and resolved them with my `crossref_doi_lookup.py` script
- Then I did some work to add missing abstracts (about 900!), volumes, issues, licenses, publishers, and types, etc
<!--more-->
## 2024-05-05
- Spend some time looking at duplicate DOIs again...
## 2024-05-06
- Spend some time looking at duplicate DOIs again...
## 2024-05-07
- Discuss RSS feeds and OpenSearch with IWMI
- It seems our OpenSearch feed settings are using the defaults, so I need to copy some of those over from our old DSpace 6 branch
- I saw a patch for an interesting issue on DSpace GitHub: [Error submitting or deleting items - URI too long when user is in a large number of groups](https://github.com/DSpace/DSpace/issues/9544)
- I hadn't realized it, but we have lots of those errors:
```console
$ zstdgrep -a 'URI Too Long' log/dspace.log-2024-04-* | wc -l
1423
```
- Spend some time looking at duplicate DOIs again...
## 2024-05-08
- Spend some time looking at duplicate DOIs again...
- I finally finished looking at the duplicate DOIs for journal articles
- I updated the list of handle redirects and there are 386 of them!
## 2024-05-09
- Spend some time working on the IFPRI 20202021 batch
- I started by checking for exact duplicates (1.0 similarity) using DOI, type, and issue date
## 2024-05-12
- I couldn't figure out how to do a complex join on withdrawn items along with their metadata, so I pull out a few like titles, handles, and provenance separately:
```psql
dspace=# \COPY (SELECT i.uuid, m.text_value AS uri FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=25) TO /tmp/withdrawn-handles.csv CSV HEADER;
dspace=# \COPY (SELECT i.uuid, m.text_value AS title FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=64) TO /tmp/withdrawn-titles.csv CSV HEADER;
dspace=# \COPY (SELECT i.uuid, m.text_value AS submitted_by FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=28 AND m.text_value LIKE 'Submitted by%') TO /tmp/withdrawn-submitted-by.csv CSV HEADER;
```
- Then joined them:
```console
$ csvjoin -c uuid /tmp/withdrawn-title.csv /tmp/withdrawn-handles.csv /tmp/withdrawn-submitted-by.csv > /tmp/withdrawn.csv
```
- This gives me an insight into who submitted at 334 of the duplicates over the past few years...
- I fixed a few hundred titles with leading/trailing whitespace, newlines, and ligatures like ff, fi, fl, ffi, and ffl
## 2024-05-13
- Export a list of IFPRI information products with handle links and CONTENTdm links:
```
$ csvgrep -c 'dc.description.provenance[en_US]' -m 'CONTENTdm' cgspace.csv \
| csvcut -c 'id,dc.description.provenance[en_US],dc.identifier.uri[en_US]' \
| tee /tmp/ifpri-redirects.csv \
| csvstat --count
2645
```
- I discovered the `/server/api/pid/find` endpoint today, which is much more direct and manageable than the `/server/api/discover/search/objects?query=` endpoint when trying to get metadata for a Handle (item, collection, or community)
- The "pid" stands for permanent identifiers apparently, and we can use it like this:
```
https://dspace7test.ilri.org/server/api/pid/find?id=10568/118424
```
## 2024-05-15
- I got journal titles for 2,900 journal articles that were missing them from Crossref
## 2024-05-16
Helping IFPRI with some DSpace 7 API support, these are two queries for items issued in 2024:
- https://dspace7test.ilri.org/server/api/discover/search/objects?query=dcterms.issued:2024
- https://dspace7test.ilri.org/server/api/discover/search/objects?query=dcterms.issued_dt%3A%5B2024-01-01T00%3A00%3A00Z%20TO%20%2A%5D — note the Lucene search syntax is URL encoded version of `:[2024-01-01T00:00:00Z TO *]`
Both of them return the same number of results and seem identitical as far as I can see, but the second one uses Solr date indexes and requires the full Lucene datetime and range syntax
I wrote a new version of the `check_duplicates.py` script to help identify duplicates with different types
- Initially I called it `check_duplicates_fast.py` but it's actually not faster
- I need to find a way to deal with duplicates from IFPRI's repository because there are some mismatched types...
## 2024-05-20
Continue working through alternative duplicate matching for IFPRI
- Their item types are sometimes different than ours...
- One thing I think I can say for sure is that the default similarity factor in my script is 0.6, and I rarely see legitimate duplicates with such similarity so I might increase this to 0.7 to reduce the number of items I have to check
- Also, the difference in issue dates is currently 365, but I should reduce that a bit, perhaps to 270 days (9 months)
## 2024-05-22
- Finalize and upload the IFPRI 20202021 batch set
- I used a new technique to get missing licenses via Crossref (it's Python 2 because of OpenRefine's Jython):
```python
import urllib2
doi = cells['cg.identifier.doi[en_US]'].value
url = "https://api.crossref.org/works/" + doi
useragent = "Python (mailto:a.o@cgiar.org)"
request = urllib2.Request(url.encode("utf-8"), headers={"User-Agent" : useragent})
get = urllib2.urlopen(request)
return get.read().decode('utf-8')
```
## 2024-05-23
- Finalize last of the duplicates I found for the IFPRI 20202021 batch set (those that we missed initially due to mismatched types)
- Export a new list of IFPRI redirects from CONTENTdm:
```console
$ csvgrep -c 'dc.description.provenance[en_US]' -r 'Original URLs? from IFPRI CONTENTdm' cgspace.csv \
| csvcut -c 'id,dc.description.provenance[en_US],dc.identifier.uri[en_US]' \
| tee /tmp/ifpri-redirects.csv \
| csvstat --count
4004
```
I found a way to get abstracts from PLOS
- They offer an API that returns XML including the JATS-formatted abstracts
- I created a new column in OpenRefine by fetching specially crafted URLs based on the DOIs using this GREL:
```console
"https://journals.plos.org/plosone/article/file?id=" + cells['doi'].value + '&type=manuscript'
```
Then used `value.parseXml()` on the resulting text to extract the abstract's text:
```console
value.parseXml().select("abstract")[0].xmlText()
```
This doesn't preserve `<p>` tags though...
- Oh, nice, this does!
```console
forEach(value.parseHtml().select("abstract p"), i, i.htmlText()).join("\r\n\r\n")
```
For each paragraph inside an abstract, get the inner text and join them as one string separated by two newlines...
- Ah, some articles have multiple abstracts, for example: https://journals.plos.org/plosone/article/file?id=https://doi.org/10.1371/journal.pntd.0001859&type=manuscript
- I need to select the abstract that does **not** have any attributes (using [Jsoup selector syntax](https://jsoup.org/apidocs/org/jsoup/select/Selector.html))
```console
forEach(value.parseXml().select("abstract:not([*]) p"), i, i.xmlText()).join("\r\n\r\n")
```
Testing `xsv` (Rust) versus `csvkit` (Python) to filter all items with DOIs from a DSpace dump with 118,000 items:
```console
$ time xsv search -s doi 'doi\.org' /tmp/cgspace-minimal.csv | xsv select doi | xsv count
27339
xsv search -s doi 'doi\.org' /tmp/cgspace-minimal.csv 0.06s user 0.03s system 98% cpu 0.091 total
xsv select doi 0.02s user 0.02s system 40% cpu 0.091 total
xsv count 0.01s user 0.00s system 9% cpu 0.090 total
$ time csvgrep -c doi -m 'doi.org' /tmp/cgspace-minimal.csv | csvcut -c doi | csvstat --count
27339
csvgrep -c doi -m 'doi.org' /tmp/cgspace-minimal.csv 1.15s user 0.06s system 95% cpu 1.273 total
csvcut -c doi 0.42s user 0.05s system 36% cpu 1.283 total
csvstat --count 0.20s user 0.03s system 18% cpu 1.298 total
```
## 2024-05-27
- Working on IFPRI datasets batch migration
- 732 items total
- 6 duplicates on CGSpace
- 6 duplicates within set that need investigation
## 2024-05-28
- I'm thinking of increasing the frequency of thumbnail generation on CGSpace
- Currently the `dspace filter-media` script runs once at 3AM for all media types and seems to take ~10 minutes to run for all 118,000 items...
- I think I will make the thumbnailer run explicitly more often using `-p "ImageMagick PDF Thumbnail"`
<!-- vim: set sw=2 ts=2: -->

119
content/posts/2024-06.md Normal file
View File

@@ -0,0 +1,119 @@
---
title: "June, 2024"
date: 2024-06-03T14:14:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-06-03
- Working on IFPRI datasets
- I noticed the licenses were missing from Nilam's original file so I found a way to check [Dataverse's API for a persistent identifier](https://guides.dataverse.org/en/latest/api/native-api.html#export-metadata-of-a-dataset-in-various-formats)
- We have both Handles and DOIs for these datasets, both from Harvard's Dataverse
<!--more-->
- I used this GREL in OpenRefine to create a new column based on URLs using the DOI (uppercasing the DOI for Dataverse):
```
"https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi:" + value.split('https://doi.org/')[-1].toUppercase()
```
- Then I was able to extract the license text from the JSON response using:
```
value.parseJson()['datasetVersion']['termsOfUse']
```
- Similar for the Handle...
## 2024-06-04
- Some Dataverse entries have the license in `['datasetVersion']['license']` instead...
- I finalized cleaning the 722 IFPRI datasets and uploaded them to CGSpace
## 2024-06-14
- Minor cleanups on IFPRI's 20162019 batch migration file
- I will start with duplicates on unique identifiers like DOIs
## 2026-06-18
- Merge and upload metadata for duplicates in IFPRI's 20162019 set:
- 144 exact match on CGSpace via DOI, type, and date
- 32 with CGSpace handles
- I also spent some time converting the `ilri/post_bitstreams.py` script to use the DSpace 7 REST API via dspace-rest-client
- There are 28 PDFs specified for these 176 duplicates, and a handful of them do not already exist on CGSpace so I will upload them
## 2024-06-19
- Spent some time checking the remaining 3312 IFPRI 20162019 migration set for duplicates on CGSpace
- There seem to be about 50 exact matches of title, type, and issue date
## 2024-06-20
- Finalize merging and uploading metadata for 48 duplicates from the IFPRI 20162019 migration set
- Heavy load on both CGSpace and DSpace 7 Test this afternoon
- Took me a while to figure out it was due to someone / something hammering `/search` for a bunch of facets
- The `pm2 logs` command was more useful than the nginx logs to see the requests at least, for example:
```
0|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&spc.page=1&f.accessRights=Open%20Access,equals&f.dateIssued.min=2023&f.dateIssued.max=2024&f.country=Colombia,equals&f.subject=climate%20change,equals&f.region=Latin%20America%20and%20the%20Caribbean,equals&f.publisher=CGIAR%20FOCUS%20Climate%20Security,equals - - ms - -
1|dspace-ui | GET /search?f.accessRights=Open%20Access,equals&spc.page=1&f.sponsorship=CGIAR%20Trust%20Fund,equals&f.impactArea=Climate%20adaptation%20and%20mitigation,equals&f.region=Eastern%20Africa,equals&f.publisher=International%20Institute%20of%20Tropical%20Agriculture,equals - - ms - -
3|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&f.sdg=SDG%2012%20-%20Responsible%20consumption%20and%20production,equals&spc.page=1&f.affiliation=CGIAR%20Research%20Program%20on%20Climate%20Change,%20Agriculture%20and%20Food%20Security,equals&f.affiliation=Alliance%20of%20Bioversity%20International%20and%20CIAT,equals&f.dateIssued.min=2020&f.dateIssued.max=2021&f.impactArea=Environmental%20health%20and%20biodiversity,equals - - ms - -
```
- Still difficult to find the client, because the logs are all [coming from Angular's user agent](https://github.com/DSpace/dspace-angular/issues/2902) and IP
- I changed the nginx logging to use the `X-Forwarded-For` header, as the default `combined` log format uses `$remote_addr` by default, which is only accurate if the request doesn't come from Angular (ie directly to the API)
- From what I can see now the IPs are all coming from Huawei Cloud and Tencent
- The ASNs are AS136907 (Huawei) and AS132203 (Tencent)
- For now I will just add those to the list of bot networks
## 2024-06-21
- Update the nginx logging to use [nginx's `real_ip` module](http://nginx.org/en/docs/http/ngx_http_realip_module.html) to log the correct client IP
- I think this means we will start sending 'bot' to the Angular / Express frontend because bot IPs will be properly classified now...
- I will have to re-work or at least re-think that nginx configuration for requests going to the frontend because the proposed fix in https://github.com/DSpace/dspace-angular/issues/2902 is to pass on the client's user-agent
- Then I updated the list of bot networks:
```console
$ wget https://asn.ipinfo.app/api/text/list/AS12876 \
https://asn.ipinfo.app/api/text/list/AS132203 \
https://asn.ipinfo.app/api/text/list/AS13238 \
https://asn.ipinfo.app/api/text/list/AS136907 \
https://asn.ipinfo.app/api/text/list/AS14061 \
https://asn.ipinfo.app/api/text/list/AS14618 \
https://asn.ipinfo.app/api/text/list/AS16276 \
https://asn.ipinfo.app/api/text/list/AS16509 \
https://asn.ipinfo.app/api/text/list/AS203020 \
https://asn.ipinfo.app/api/text/list/AS204287 \
https://asn.ipinfo.app/api/text/list/AS21859 \
https://asn.ipinfo.app/api/text/list/AS23576 \
https://asn.ipinfo.app/api/text/list/AS24940 \
https://asn.ipinfo.app/api/text/list/AS396982 \
https://asn.ipinfo.app/api/text/list/AS45102 \
https://asn.ipinfo.app/api/text/list/AS50245 \
https://asn.ipinfo.app/api/text/list/AS55286 \
https://asn.ipinfo.app/api/text/list/AS6939 \
https://asn.ipinfo.app/api/text/list/AS8075
$ cat AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
$ wc -l /tmp/networks.txt
8675 /tmp/networks.txt
```
- Update list of ORCID identifiers with new ones from Alliance and IFPRI
- Finalize uploading the remaining 3,264 items from IFPRI's 20162019 batch migration to CGSpace
## 2024-06-24
- Minor updates to [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) and [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) to normalize a few more invalid DOI formats
## 2024-06-25
- Work on uploading some missing PDFs from the IFPRI 20162019 batch migration
## 2024-06-26
- Did a big cleanup of several thousand journal articles based on metadata from Crossref
<!-- vim: set sw=2 ts=2: -->

57
content/posts/2024-07.md Normal file
View File

@@ -0,0 +1,57 @@
---
title: "July, 2024"
date: 2024-07-01T09:37:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-07-01
- A bit of work to clean up duplicate DOIs on CGSpace
- A handful of book chapters, working papers, and journal articles using the wrong DOI
- I tried to delete all users who have been inactive since six years ago (July 1, 2018):
<!--more-->
```console
$ dspace dsrun org.dspace.eperson.Groomer -a -b 07/01/2018 -d
```
- File an issue on DSpace GitHub: [Allow configuring disallowed domains for self registration](https://github.com/DSpace/DSpace/issues/9675)
## 2024-07-11
- Minor fixes to normalize the IFPRI CONTENTdm URLs in provenance fields:
```console
dspace=# BEGIN;
BEGIN
dspace=*# UPDATE metadatavalue SET text_value = replace(text_value, 'cdm/ref', 'digital') WHERE text_value LIKE '%CONTENTdm%cdm/ref/%';
UPDATE 1876
dspace=*# UPDATE metadatavalue SET text_value = replace(text_value, 'CONTENTdm: ', 'CONTENTdm: ') WHERE text_value LIKE '%CONTENTdm: %';
UPDATE 21
dspace=*# COMMIT;
COMMIT
```
- Then export a new list of CONTENTdm redirects, excluding withdrawn items:
```console
dspace= ☘ \COPY (SELECT m.text_value, h.handle FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=28 AND m.text_value LIKE '%URL from IFPRI CONTENTdm%' AND h.resource_type_id=2 AND m.dspace_object_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn)) to /tmp/ifpri.csv CSV HEADER;
COPY 8568
```
- Similarly, get a list of withdrawn item redirects:
```console
dspace= ☘ \COPY (SELECT m.text_value AS handle_from, h.handle AS handle_to FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=181 AND h.resource_type_id=2 AND h.resource_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn)) to /tmp/handle-redirects.csv CSV HEADER;
COPY 396
```
## 2024-07-18
- I experimented with adding a regular expression to validate DOIs to the submission form
- It is a slightly modified version of the one found here: https://stackoverflow.com/questions/27910/finding-a-doi-in-a-document-or-page
- I decided it will probably be confusing to people and will have limited benefit, since we are normalizing most forms of DOIs to our preferred form after submission anyway
<!-- vim: set sw=2 ts=2: -->

71
content/posts/2024-08.md Normal file
View File

@@ -0,0 +1,71 @@
---
title: "August, 2024"
date: 2024-08-08T23:07:00-07:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-08-08
- While working on the CGIAR Climate Change Synthesis I learned some new tricks with OpenRefine
<!--more-->
- The first was to retrieve affiliations from OpenAlex and extract them from JSON with this GREL:
```
forEach(
value.parseJson()['authorships'],
a,
forEach(
a.parseJson()['institutions'],
i,
i['display_name']
).join("||")
).join("||")
```
- It is a nested `forEach` to extract all institutions for all authors
- Second was a better way to deduplicate lists in Jython while preserving list order:
```python
# better dedupe preserves order
seen = set()
deduped_list = [x for x in value.split("||") if x not in seen and not seen.add(x)]
return "||".join(deduped_list)
```
## 2024-08-20
- Delete duplicate metadata values using the method I described in this GitHub issue: https://github.com/DSpace/DSpace/issues/8253#issuecomment-1331756418
## 2024-08-22
- Help IWMI with some OpenSearch RSS/Atom feeds for search results:
- https://cgspace.cgiar.org/server/opensearch/search?query=affiliation:"International Water Management Institute" AND initiative:"Climate Resilience" AND subject:flooding
- https://cgspace.cgiar.org/server/opensearch/search?query=affiliation:"International Water Management Institute" AND initiative:"Climate Resilience" AND subject:drought
- https://cgspace.cgiar.org/server/opensearch/search?query=affiliation:"International Water Management Institute" AND initiative:"Climate Resilience" AND subject:landslides
- Export list of withdrawn handle redirects:
```
dspace=# \COPY (SELECT m.text_value AS handle_from, h.handle AS handle_to FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=181 AND h.resource_type_id=2 AND h.resource_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn)) to /tmp/handle-redirects.csv CSV HEADER;
COPY 400
```
- Export list of IFPRI CONTENTdm redirects:
```
dspace-# \COPY (SELECT m.text_value, h.handle FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=28 AND m.text_value LIKE '%URL from IFPRI CONTENTdm%' AND h.resource_type_id=2 AND m.dspace_object_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn)) to /tmp/ifpri.csv CSV HEADER;
COPY 10794
```
- I filed [an issue](https://github.com/DSpace/dspace-angular/issues/3258) on DSpace Angular for anonymous users to be able to export search results to CSV
## 2024-08-26
- Spent some time trying to rebase our DSpace Angular themes on top of the massive header/navbar rework from [DSpace 7.6.2](https://github.com/DSpace/dspace-angular/pull/2858)
- Spent some time getting missing bibliographic metadata (issue dates, licenses, pages, volume, issue, publisher, etc) from Crossref for CGSpace
<!-- vim: set sw=2 ts=2: -->

147
content/posts/2024-09.md Normal file
View File

@@ -0,0 +1,147 @@
---
title: "September, 2024"
date: 2024-09-01T21:16:00-07:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-09-01
- Upgrade CGSpace to DSpace 7.6.2
<!--more-->
## 2024-09-05
- Finalize work on migrating DSpace Angular from Yarn to NPM
## 2024-09-06
- This morning Tomcat crashed due to an OOM kill:
```
Sep 06 00:00:24 server systemd[1]: tomcat9.service: A process of this unit has been killed by the OOM killer.
Sep 06 00:00:25 server systemd[1]: tomcat9.service: Main process exited, code=killed, status=9/KILL
Sep 06 00:00:25 server systemd[1]: tomcat9.service: Failed with result 'oom-kill'.
```
- According to the system journal, it was a Node.js dspace-angular process that tried to allocate memory and failed, thus invoking the OOM killer
- Currently I see high memory usage in those processes:
```console
$ pm2 status
┌────┬──────────────┬─────────────┬─────────┬─────────┬──────────┬────────┬──────┬───────────┬──────────┬──────────┬──────────┬──────────┐
│ id │ name │ namespace │ version │ mode │ pid │ uptime │ ↺ │ status │ cpu │ mem │ user │ watching │
├────┼──────────────┼─────────────┼─────────┼─────────┼──────────┼────────┼──────┼───────────┼──────────┼──────────┼──────────┼──────────┤
│ 0 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 994 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
│ 1 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 1015 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
│ 2 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 1029 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
│ 3 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 1042 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
└────┴──────────────┴─────────────┴─────────┴─────────┴──────────┴────────┴──────┴───────────┴──────────┴──────────┴──────────┴──────────┘
```
- I bet if I look in the logs I'd find some kind of heavy traffic on the frontend, causing high caching for Angular SSR
## 2024-09-08
- Analyzing memory use in our DSpace hosts, which have 32GB of memory
- Effective cache of PostgreSQL is estimated at 11GB, which seems way high since the database is only 2GB
- Realistically this should be how we adjust, with PostgreSQL using ~8GB (or less) and each dspace-angular process pinned at 2GB...
> Total - Solr - Tomcat Postgres - Nginx - Angular
> 31366 (1024×4.4) 7168 (8×1024) 512 - (4x2048) = 2796.4 left...
- I put some of these changes in on DSpace Test and will monitor this week
## 2024-09-10
- Some bot in South Africa made a ton of requests on the API and made the load hit the roof:
```
# grep -E '10/Sep/2024:[10-11]' /var/log/nginx/api-access.log | awk '{print $1}' | sort | uniq -c | sort -h
...
149720 102.182.38.90
```
- They are using several user agents so are obviously a bot:
```
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:130.0) Gecko/20100101 Firefox/130.0
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/111.0
Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0
```
- I added them to the list of bot networks in nginx and the load went down
## 2024-09-11
- Upgrade DSpace 7 Test to Ubuntu 24.04
- I did some minor maintenance to test dspace-statistics-api with Python 3.12
- I tagged version 1.4.4 and released it on GitHub
## 2024-09-14
- Noticed a persistent higher than usual load on CGSpace and checked the server logs
- Found some new data center subnets to block because they were making thousands of requests with normal user agents
- I enabled HTTP/3 in nginx
- I enabled the SSR patch in Angular: https://github.com/DSpace/dspace-angular/issues/3110
## 2024-09-16
- Experiment with the <a href="https://github.com/codeobia/dspace-statistics-api-js">dspace-statistics-api-js</a> on DSpace 7 Test
- In the past it always caused Solr to run out of memory, but I increased Solr's heap from 2g to 3g and it runs without crashing
- I attached VisualVM to Solr with a 3g and 4g heap and iterated over 1260 pages of results in the dspace-statistics-api-js:
![Solr with 3g heap](/cgspace-notes/2024/09/2024-09-16-Solr-3g-heap.png)
![Solr with 4g heap](/cgspace-notes/2024/09/2024-09-16-Solr-4g-heap.png)
## 2024-09-23
- Upgrade PostgreSQL from version 14 to 15 on DSpace Test the same way I did last year:
```console
# apt update
# apt install postgresql-15
# Update configs with Ansible
# systemctl stop tomcat9
# pg_ctlcluster 14 main stop
# tar -cvzpf var-lib-postgresql-14.tar.gz /var/lib/postgresql/14
# tar -cvzpf etc-postgresql-14.tar.gz /etc/postgresql/14
# pg_ctlcluster 15 main stop
# pg_dropcluster 15 main
# pg_upgradecluster 14 main
# pg_ctlcluster 15 main start
...
ERROR: could not find function "xml_is_well_formed" in file "/usr/lib/postgresql/15/lib/pgxml.so"
ERROR: function public.xml_is_well_formed(text) does not exist
ERROR: could not find function "xml_is_well_formed" in file "/usr/lib/postgresql/15/lib/pgxml.so"
ERROR: function public.xml_valid(text) does not exist
```
- After that I [re-indexed the database indexes](https://adamj.eu/tech/2021/04/13/reindexing-all-tables-after-upgrading-to-postgresql-13/) using a query:
```console
$ su - postgres
$ cat /tmp/generate-reindex.sql
SELECT 'REINDEX TABLE CONCURRENTLY ' || quote_ident(relname) || ' /*' || pg_size_pretty(pg_total_relation_size(C.oid)) || '*/;'
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname = 'public'
AND C.relkind = 'r'
AND nspname !~ '^pg_toast'
ORDER BY pg_total_relation_size(C.oid) ASC;
$ psql dspace < /tmp/generate-reindex.sql > /tmp/reindex.sql
$ <trim the extra stuff from /tmp/reindex.sql>
$ psql dspace < /tmp/reindex.sql
```
- The database shrunk by 186MB!
## 2024-09-29
- I upgraded the database on CGSpace to PostgreSQL 15
<!-- vim: set sw=2 ts=2: -->

82
content/posts/2024-10.md Normal file
View File

@@ -0,0 +1,82 @@
---
title: "October, 2024"
date: 2024-10-03T11:01:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-10-03
- I had an idea to get abstracts from OpenAlex
- For [copyright reasons they don't include plain abstracts](https://docs.openalex.org/api-entities/works/work-object#abstract_inverted_index), but the [pyalex](https://github.com/J535D165/pyalex) library can convert them on the fly
<!--more-->
- I filtered for journal articles that were Creative Commons and missing abstracts:
```console
$ csvcut -c 'id,dc.title[en_US],dcterms.abstract[en_US],cg.identifier.doi[en_US],dcterms.type[en_US],dcterms.language[en_US],dcterms.license[en_US]' ~/Downloads/2024-09-30-cgspace.csv | csvgrep -c 'dcterms.type[en_US]' -r '^Journal Article$' | csvgrep -c 'cg.identifier.doi[en_US]' -r '^.+$' | csvgrep -c 'dcterms.license[en_US]' -r '^CC-' | csvgrep -c 'dcterms.abstract[en_US]' -r '^$' | csvgrep -c 'dcterms.language[en_US]' -r '^en$' | grep -v "||" | grep -v -- '-ND' | grep -v -E 'https://doi.org/10.(2499|4160|17528)/' > /tmp/missing-abstracts.csv
```
- Then wrote a script to get them from OpenAlex
- After inspecting and cleaning a few dozen up in OpenRefine (removing "Keywords:" and copyright, and HTML entities, etc) I managed to get about 440
## 2024-10-06
- Since I increase Solr's heap from 2 to 3G a few weeks ago it seems like Solr is always using 100% CPU
- I don't understand this because it was running well before, and I only increased it in anticipation of running the dspace-statistics-api-js, though never got around to it
- I just realized that this may be related to the JMX monitoring, as I've seen gaps in the Grafana dashboards and remember that it took surprisingly long to scrape the metrics
- Maybe I need to change the scrape interval
## 2024-10-08
- I checked the VictoriaMetrics vmagent dashboard and saw that there were thousands of errors scraping the `jvm_solr` target from Solr
- So it seems like I do need to change the scrape interval
- I will increase it from 15s (global) to 20s for that job
- Reading some documentation I found [this reference from Brian Brazil that discusses this very problem](https://www.robustperception.io/keep-it-simple-scrape_interval-id/)
- He recommends keeping a single scrape interval for all targets, but also checking the slow exporter (`jmx_exporter` in this case) and seeing if we can limit the data we scrape
- To keep things simple for now I will increase the global scrape interval to 20s
- Long term I should limit the metrics...
- Oh wow, I found out that [Solr ships with a Prometheus exporter!](https://solr.apache.org/guide/8_11/monitoring-solr-with-prometheus-and-grafana.html) and even includes a Grafana dashboard
- I'm trying to run the Solr prometheus-exporter as a one-off systemd unit to test it:
```console
# cd /opt/solr-8.11.3/contrib/prometheus-exporter
# systemd-run --uid=victoriametrics --gid=victoriametrics --working-directory=/opt/solr-8.11.3/contrib/prometheus-exporter ./bin/solr-exporter -p 9854 -b http://localhost:8983/solr -f ./conf/solr-exporter-config.xml -s 20
```
- The default scrape interval is 60 seconds, so if we scrape it more than that the metrics will be stale
- From what I've seen this returns in less than one second so it should be safe to reduce the scrape interval
## 2024-10-19
- Heavy load on CGSpace today
- There is a noted increase just before 4PM local time
- I extracted a list of IPs:
```console
# grep -E '19/Oct/2024:1[567]' /var/log/nginx/api-access.log | awk '{print $1}' | sort -u > /tmp/ips.txt
```
- I looked them up and found some data center IPs that were using normal user agents with hundreds of IPs, for example:
- 154.47.29.168 # 212238 (CDNEXT - Datacamp Limited, GB)
- 91.210.64.12 # 29802 (HVC-AS, US) - HIVELOCITY, Inc.
- 103.221.57.120 # 132817 (DZCRD-AS-AP DZCRD Networks Ltd, BD)
- 109.107.150.136 # 201341 (CENTURION-INTERNET-SERVICES - trafficforce, UAB, LT) - Code200
- 185.210.207.1 # 209709 (CODE200-ISP1 - UAB code200, LT)
- 185.162.119.101 # 207223 (GLOBALCON - Global Connections Network LLC, US)
- 173.244.35.101 # 64286 (LOGICWEB, US) - Tesonet
- 139.28.160.141 # 396319 (US-INTERNET-396319, US) - OxyLabs
- 104.143.89.112 # 62874 (WEB2OBJECTS, US) - Web2Objects LLC
- I added some network blocks to the nginx conf
- Interestingly, I see so many IPs using the same user agent today:
```console
# grep "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.3" /var/log/nginx/api-access.log | awk '{print $1}' | sort -u | wc -l
767
```
- For reference, the current Chrome version is 129 or so...
- This is definitely worth looking into because it seems like one massive botnet
<!-- vim: set sw=2 ts=2: -->

50
content/posts/2024-11.md Normal file
View File

@@ -0,0 +1,50 @@
---
title: "November, 2024"
date: 2024-11-11T09:47:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-11-11
- Some IP in India is making tons of requests this morning with a normal user agent:
```console
# awk '{print $1}' /var/log/nginx/api-access.log | sort | uniq -c | sort -h | tail -n 40
...
513743 49.207.196.249
```
<!--more-->
- They are using this user agent:
```
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.3
```
## 2024-11-16
- I switched CGSpace to Node.js v20 since I've been using it in dev and test for months
## 2024-11-18
- I see a bot (188.34.177.10) on Hetzner has made 35,000 requests this morning and is pretending to be Googlebot, GoogleOther, etc
- Google publishes their range of IPs also: https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot
- Our nginx config doesn't rate limit the API but perhaps that needs to change...
- In DSpace 4/5/6 the API was separate from the user interface so we didn't need to enforce rate limits there because we encouraged using that over scraping the UI
- In DSpace 7 the API is used by the frontend and perhaps should have the same IP- and UA-based rate limiting
## 2024-11-19
- I notice 10,000 requests by a new bot yesterday:
```
20.38.174.208 - - [18/Nov/2024:07:02:50 +0100] "GET /server/oai/request?verb=ListRecords&resumptionToken=oai_dc%2F2024-10-18T13%3A00%3A49Z%2F%2F%2F400 HTTP/1.1" 503 190 "-" "Laminas_Http_Client"
```
- Seems to be some kind of PHP framework library
- Yesterday one IP in Argentina made nearly 1,000,000 requests using a normal user agent: 181.4.143.40
- 188.34.177.10 ended up making 700,000 requests using various Googlebot, GoogleOther, and even normal Chrome user agents
<!-- vim: set sw=2 ts=2: -->

28
content/posts/2024-12.md Normal file
View File

@@ -0,0 +1,28 @@
---
title: "December, 2024"
date: 2024-12-04T10:19:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-12-04
- We need to get view and download statistics for the last year from CGSpace
- The only way to get that is using Solr
<!--more-->
- After consulting the [Solr documentation](https://solr.apache.org/guide/8_11/working-with-dates.html) I came up with this facet query:
> facet.range=time&facet.range.start=NOW/MONTH-11MONTHS&facet.range.end=NOW/MONTH+1MONTH&facet.range.gap=+1MONTH
- [This StackOverflow answer](https://stackoverflow.com/questions/34290600/how-to-apply-facet-on-date-field-where-result-should-provide-number-of-records-f) helped too, recommending `NOW/MONTH` to get neatly bucketed months because this will use the beginning of the current month
- For views, I added the following query parameters: `q=type:2&fq=-isBot:true AND statistics_type:view`
> http://localhost:8983/solr/statistics/select?facet.range.end=NOW%2FMONTH%2B1MONTH&facet.range.gap=%2B1MONTH&facet.range.start=NOW%2FMONTH-11MONTHS&facet.range=time&facet=true&fq=-isBot%3Atrue%20AND%20statistics_type%3Aview&indent=true&q.op=OR&q=type%3A2&rows=0
- For downloads I added the following query parameters: `q=type:0&fq=-isBot:true AND statistics_type:view AND bundleName:ORIGINAL`
> http://localhost:8983/solr/statistics/select?facet.range.end=NOW%2FMONTH%2B1MONTH&facet.range.gap=%2B1MONTH&facet.range.start=NOW%2FMONTH-11MONTHS&facet.range=time&facet=true&fq=-isBot%3Atrue%20AND%20statistics_type%3Aview%20AND%20bundleName%3AORIGINAL&indent=true&q.op=OR&q=type%3A0&rows=0
<!-- vim: set sw=2 ts=2: -->

38
content/posts/2025-01.md Normal file
View File

@@ -0,0 +1,38 @@
---
title: "January, 2025"
date: 2025-01-03T11:09:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2025-01-03
- Trying to get search results for a large boolean query given to me by some researchers
- When searching via the Angular frontend I see an error in the Tomcat logs:
<!--more-->
```
Jan 03 09:08:40 dspace tomcat9[876]: Jan 03, 2025 9:08:40 AM org.apache.coyote.http11.Http11Processor service
Jan 03 09:08:40 dspace tomcat9[876]: INFO: Error parsing HTTP request header
Jan 03 09:08:40 dspace tomcat9[876]: Note: further occurrences of HTTP request parsing errors will be logged at DEBUG level.
Jan 03 09:08:40 dspace tomcat9[876]: java.lang.IllegalArgumentException: Request header is too large
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.http11.Http11InputBuffer.fill(Http11InputBuffer.java:778)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.http11.Http11InputBuffer.parseHeader(Http11InputBuffer.java:892)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.http11.Http11InputBuffer.parseHeaders(Http11InputBuffer.java:593)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:279)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:63)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:937)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1791)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1190)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:63)
Jan 03 09:08:40 dspace tomcat9[876]: at java.base/java.lang.Thread.run(Thread.java:840)
```
- The size of the query itself is 5362 bytes
- Increasing the `maxHttpHeaderSize` from the default of 8192 bytes to 16384 allows the search to complete successfully
- I notice that we had previously increased the `maxHttpHeaderSize` on the HTTP connector in Tomcat 7, which we are no longer using in Tomcat 9, so this is an overdue change
<!-- vim: set sw=2 ts=2: -->

View File

@@ -0,0 +1,66 @@
+++
title = "Harmonization of CG Core Output Types"
date = 2021-02-21T13:27:35+02:00
description = "Proposed changes to CG Core types after review of several CGIAR repositories."
categories = ["Notes"]
tags = ["Migration"]
url = "cgcore-types-harmonization"
draft = true
+++
Proposed changes to the CG Core controlled vocabulary for output types after review of actual usage by several CGIAR open access repositories.
With reference to [CG Core v2 draft standard](https://agriculturalsemantics.github.io/cg-core/cgcore.html) by Marie-Angélique as well as [DCMI DCTERMS](http://www.dublincore.org/specifications/dublin-core/dcmi-terms/).
<!--more-->
- [Proposed Changes](#proposed-changes)
- [Out of Scope](#out-of-scope)
- [Implementation Progress](#implementation-progress)
## Proposed Changes
As of 2021-01-18 the scope of the changes includes the following fields:
- cg.creator.id→cg.creator.identifier
- ORCID identifiers
- dc.format.extent→dcterms.extent
- dc.date.issued→dcterms.issued
- dc.description.abstract→dcterms.abstract
- dc.description→dcterms.description
- dc.description.sponsorship→cg.contributor.donor
- values from CrossRef or Grid.ac if possible
- dc.description.version→cg.reviewStatus
- cg.fulltextstatus→cg.howPublished
- CGSpace uses values like "Formally Published" or "Grey Literature"
- dc.identifier.citation→dcterms.bibliographicCitation
- cg.identifier.status→dcterms.accessRights
- current values are "Open Access" and "Limited Access"
- future values are possibly "Open" and "Restricted"?
- dc.language.iso→dcterms.language
- current values are ISO 639-1 (aka Alpha 2)
- future values are possibly ISO 639-3 (aka Alpha 3)?
- cg.link.reference→dcterms.relation
- dc.publisher→dcterms.publisher
- dc.relation.ispartofseries will be split into:
- series name: dcterms.isPartOf
- series number: cg.number
- dc.rights→dcterms.license
- Using [SPDX license identifiers](https://spdx.org/licenses/) if possible
- dc.source→cg.journal
- dc.subject→dcterms.subject
- dc.type→dcterms.type
- dc.identifier.isbn→cg.isbn
- dc.identifier.issn→cg.issn
- cg.targetaudience→dcterms.audience
### Out of Scope
The following fields are currently out of the scope of this migration because they are used internally by DSpace 5.x/6.x and would be difficult to change without significant modifications to the core of the code:
- dc.title (`IncludePageMeta.java` only considers DC when building pageMeta, which we rely on in XMLUI because of XSLT from DRI)
- dc.title.alternative
- dc.date.available
- dc.date.accessioned
- dc.identifier.uri (hard coded for Handle assignment upon item submission)
- dc.description.provenance
- dc.contributor.author (`IncludePageMeta.java` only considers DC when building pageMeta, which we rely on in XMLUI because of XSLT from DRI)

View File

@@ -34,7 +34,7 @@ Last week I had increased the limit from 30 to 60, which seemed to help, but now
$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
78
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -64,7 +64,7 @@ $ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspac
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -242,15 +242,15 @@ db.statementpool = true
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
-rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -66,7 +66,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -264,15 +264,15 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -28,7 +28,7 @@ Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_
I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
Update GitHub wiki for documentation of maintenance tasks.
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -58,7 +58,7 @@ Update GitHub wiki for documentation of maintenance tasks.
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -200,15 +200,15 @@ $ find SimpleArchiveForBio/ -iname &ldquo;*.pdf&rdquo; -exec basename {} ; | sor
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -38,7 +38,7 @@ I noticed we have a very interesting list of countries on CGSpace:
Not only are there 49,000 countries, we have some blanks (25)&hellip;
Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&rdquo;
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -68,7 +68,7 @@ Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&r
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -378,15 +378,15 @@ Bitstream: tést señora alimentación.pdf
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -28,7 +28,7 @@ Looking at issues with author authorities on CGSpace
For some reason we still have the index-lucene-update cron job active on CGSpace, but I&rsquo;m pretty sure we don&rsquo;t need it as of the latest few versions of Atmire&rsquo;s Listings and Reports module
Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -58,7 +58,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -316,15 +316,15 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -32,7 +32,7 @@ After running DSpace for over five years I&rsquo;ve never needed to look in any
This will save us a few gigs of backup space we&rsquo;re paying for on S3
Also, I noticed the checker log has some errors we should pay attention to:
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -62,7 +62,7 @@ Also, I noticed the checker log has some errors we should pay attention to:
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -495,15 +495,15 @@ dspace.log.2016-04-27:7271
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -34,7 +34,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
# awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
3168
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -64,7 +64,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -371,15 +371,15 @@ sys 0m20.540s
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -34,7 +34,7 @@ This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRec
You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund to cg.identifier.cpwfproject and then the rest to dc.description.sponsorship
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -64,7 +64,7 @@ Working on second phase of metadata migration, looks like this will work for mov
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -409,15 +409,15 @@ $ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-D
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -44,7 +44,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
In this case the select query was showing 95 results before the update
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -74,7 +74,7 @@ In this case the select query was showing 95 results before the update
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -325,15 +325,15 @@ discovery.index.authority.ignore-variants=true
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -42,7 +42,7 @@ $ git checkout -b 55new 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.5
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -72,7 +72,7 @@ $ git rebase -i dspace-5.5
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -389,15 +389,15 @@ $ JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx512m&#34; /home/cgspace.cgiar.org/bin
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -34,7 +34,7 @@ It looks like we might be able to use OUs now, instead of DCs:
$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &#34;dc=cgiarad,dc=org&#34; -D &#34;admigration1@cgiarad.org&#34; -W &#34;(sAMAccountName=admigration1)&#34;
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -64,7 +64,7 @@ $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &#34;dc=cgiarad,dc=org
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -478,8 +478,8 @@ $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2
</code></pre><ul>
<li>It actually works really well, and search results return much less hits now (before, after):</li>
</ul>
<p><img src="/cgspace-notes/2016/09/cgspace-search.png" alt="CGSpace search with &amp;ldquo;OR&amp;rdquo; boolean logic">
<img src="/cgspace-notes/2016/09/dspacetest-search.png" alt="DSpace Test search with &amp;ldquo;AND&amp;rdquo; boolean logic"></p>
<p><img src="/cgspace-notes/2016/09/cgspace-search.png" alt="CGSpace search with &ldquo;OR&rdquo; boolean logic">
<img src="/cgspace-notes/2016/09/dspacetest-search.png" alt="DSpace Test search with &ldquo;AND&rdquo; boolean logic"></p>
<ul>
<li>Found a way to improve the configuration of Atmire&rsquo;s Content and Usage Analysis (CUA) module for date fields</li>
</ul>
@@ -606,15 +606,15 @@ $ ./delete-metadata-values.py -i ilrisubjects-delete-13.csv -f cg.subject.ilri -
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -42,7 +42,7 @@ I exported a random item&rsquo;s metadata as CSV, deleted all columns except id
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -72,7 +72,7 @@ I exported a random item&rsquo;s metadata as CSV, deleted all columns except id
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -372,15 +372,15 @@ dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;h
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -26,7 +26,7 @@ Add dc.type to the output options for Atmire&rsquo;s Listings and Reports module
Add dc.type to the output options for Atmire&rsquo;s Listings and Reports module (#286)
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -56,7 +56,7 @@ Add dc.type to the output options for Atmire&rsquo;s Listings and Reports module
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -548,15 +548,15 @@ org.dspace.discovery.SearchServiceException: Error executing query
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -46,7 +46,7 @@ I see thousands of them in the logs for the last few months, so it&rsquo;s not r
I&rsquo;ve raised a ticket with Atmire to ask
Another worrying error from dspace.log is:
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -76,7 +76,7 @@ Another worrying error from dspace.log is:
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -668,7 +668,7 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
<li>This is how DSpace works, and I need to ask if there is a way to override someone&rsquo;s submission, as the other reviewer seems to not be paying attention, or has perhaps taken the item from the task pool?</li>
<li>Run a batch edit to add &ldquo;RANGELANDS&rdquo; ILRI subject to all items containing the word &ldquo;RANGELANDS&rdquo; in their metadata for Peter Ballantyne</li>
</ul>
<p><img src="/cgspace-notes/2016/12/batch-edit1.png" alt="Select all items with &amp;ldquo;rangelands&amp;rdquo; in metadata">
<p><img src="/cgspace-notes/2016/12/batch-edit1.png" alt="Select all items with &ldquo;rangelands&rdquo; in metadata">
<img src="/cgspace-notes/2016/12/batch-edit2.png" alt="Add RANGELANDS ILRI subject"></p>
<h2 id="2016-12-18">2016-12-18</h2>
<ul>
@@ -784,15 +784,15 @@ $ exit
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -28,7 +28,7 @@ I checked to see if the Solr sharding task that is supposed to run on January 1s
I tested on DSpace Test as well and it doesn&rsquo;t work there either
I asked on the dspace-tech mailing list because it seems to be broken, and actually now I&rsquo;m not sure if we&rsquo;ve ever had the sharding task run successfully over all these years
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -58,7 +58,7 @@ I asked on the dspace-tech mailing list because it seems to be broken, and actua
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -369,15 +369,15 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -50,7 +50,7 @@ DELETE 1
Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
Looks like we&rsquo;ll be using cg.identifier.ccafsprojectpii as the field name
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -80,7 +80,7 @@ Looks like we&rsquo;ll be using cg.identifier.ccafsprojectpii as the field name
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -423,15 +423,15 @@ COPY 1968
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -54,7 +54,7 @@ Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing reg
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600&#43;0&#43;0 8-bit CMYK 168KB 0.000u 0:00.000
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -84,7 +84,7 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -355,15 +355,15 @@ $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.spon
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -40,7 +40,7 @@ Testing the CMYK patch on a collection with 650 items:
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -70,7 +70,7 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thu
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -585,15 +585,15 @@ $ gem install compass -v 1.0.3
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="May, 2017"/>
<meta name="twitter:description" content="2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it&rsquo;s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire&rsquo;s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace."/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -48,7 +48,7 @@
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -391,15 +391,15 @@ UPDATE 187
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="June, 2017"/>
<meta name="twitter:description" content="2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg."/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -48,7 +48,7 @@
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -270,15 +270,15 @@ $ JAVA_OPTS=&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34; [dspace]/bin/dspace import
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ Merge changes for WLE Phase II theme rename (#329)
Looking at extracting the metadata registries from ICARDA&rsquo;s MEL DSpace database so we can compare fields with CGSpace
We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the output into quasi XML:
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -66,7 +66,7 @@ We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -275,15 +275,15 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -60,7 +60,7 @@ This was due to newline characters in the dc.description.abstract column, which
I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -90,7 +90,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -517,15 +517,15 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -32,7 +32,7 @@ Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two
Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account is both in the approvers step as well as the group
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -62,7 +62,7 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -659,15 +659,15 @@ Cert Status: good
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -34,7 +34,7 @@ http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -64,7 +64,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -443,15 +443,15 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -48,7 +48,7 @@ Generate list of authors on CGSpace for Peter to go through and correct:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -78,7 +78,7 @@ COPY 54701
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -944,15 +944,15 @@ $ cat dspace.log.2017-11-28 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sor
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -30,7 +30,7 @@ The logs say &ldquo;Timeout waiting for idle object&rdquo;
PostgreSQL activity says there are 115 connections currently
The list of connections to XMLUI and REST API for today:
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -60,7 +60,7 @@ The list of connections to XMLUI and REST API for today:
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -783,15 +783,15 @@ DELETE 20
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -150,7 +150,7 @@ dspace.log.2018-01-02:34
Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -180,7 +180,7 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -1452,15 +1452,15 @@ Catalina:type=Manager,context=/,host=localhost activeSessions 8
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -30,7 +30,7 @@ We don&rsquo;t need to distinguish between internal and external works, so that
Yesterday I figured out how to monitor DSpace sessions using JMX
I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu&rsquo;s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -60,7 +60,7 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu&rsquo;s munin-pl
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -1038,15 +1038,15 @@ UPDATE 3
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -24,7 +24,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
Export a CSV of the IITA community metadata for Martin Mueller
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -54,7 +54,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -585,15 +585,15 @@ Fixed 5 occurences of: GENEBANKS
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -26,7 +26,7 @@ Catalina logs at least show some memory errors yesterday:
I tried to test something on DSpace Test but noticed that it&rsquo;s down since god knows when
Catalina logs at least show some memory errors yesterday:
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -56,7 +56,7 @@ Catalina logs at least show some memory errors yesterday:
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -594,15 +594,15 @@ $ pg_restore -O -U dspacetest -d dspacetest -W -h localhost /tmp/dspace_2018-04-
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -38,7 +38,7 @@ http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E
Then I reduced the JVM heap size from 6144 back to 5120m
Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -68,7 +68,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -523,15 +523,15 @@ $ psql -h localhost -U postgres dspacetest
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -58,7 +58,7 @@ real 74m42.646s
user 8m5.056s
sys 2m7.289s
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -88,7 +88,7 @@ sys 2m7.289s
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -517,15 +517,15 @@ $ sed &#39;/^id/d&#39; 10568-*.csv | csvcut -c 1,2 &gt; map-to-cifor-archive.csv
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ During the mvn package stage on the 5.8 branch I kept getting issues with java r
There is insufficient memory for the Java Runtime Environment to continue.
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -66,7 +66,7 @@ There is insufficient memory for the Java Runtime Environment to continue.
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -569,15 +569,15 @@ dspace=# select count(text_value) from metadatavalue where resource_type_id=2 an
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -46,7 +46,7 @@ Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did
The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes
I ran all system updates on DSpace Test and rebooted it
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -76,7 +76,7 @@ I ran all system updates on DSpace Test and rebooted it
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -442,15 +442,15 @@ $ dspace database migrate ignored
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -30,7 +30,7 @@ I&rsquo;ll update the DSpace role in our Ansible infrastructure playbooks and ru
Also, I&rsquo;ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month
I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -60,7 +60,7 @@ I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -748,15 +748,15 @@ UPDATE metadatavalue SET text_value=&#39;ja&#39; WHERE resource_type_id=2 AND me
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -26,7 +26,7 @@ I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nai
Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nairobi right now
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -56,7 +56,7 @@ I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nai
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -656,15 +656,15 @@ $ curl -X GET -H &#34;Content-Type: application/json&#34; -H &#34;Accept: applic
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ Send a note about my dspace-statistics-api to the dspace-tech mailing list
Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage
Today these are the top 10 IPs:
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -66,7 +66,7 @@ Today these are the top 10 IPs:
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -553,15 +553,15 @@ $ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 -d
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ Then I ran all system updates and restarted the server
I noticed that there is another issue with PDF thumbnails on CGSpace, and I see there was another Ghostscript vulnerability last week
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -66,7 +66,7 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -594,15 +594,15 @@ UPDATE 1
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -50,7 +50,7 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
357 207.46.13.1
903 54.70.40.11
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -80,7 +80,7 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -952,6 +952,7 @@ $ http &#39;http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&am
<blockquote class="twitter-tweet"><p lang="en" dir="ltr"><a href="https://twitter.com/hashtag/ILRI?src=hash&amp;ref_src=twsrc%5Etfw">#ILRI</a> research: Towards unlocking the potential of the hides and skins value chain in Somaliland <a href="https://t.co/EZH7ALW4dp">https://t.co/EZH7ALW4dp</a></p>&mdash; ILRI.org (@ILRI) <a href="https://twitter.com/ILRI/status/1086330519904673793?ref_src=twsrc%5Etfw">January 18, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<ul>
<li>The shortened link is <a href="goo.gl/fb/VRj9Gq">goo.gl/fb/VRj9Gq</a> and it shows a &ldquo;Dynamic Link not found&rdquo; error from Firebase:</li>
</ul>
@@ -1264,15 +1265,15 @@ identify: CorruptImageProfile `xmp&#39; @ warning/profile.c/SetImageProfileInter
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -72,7 +72,7 @@ real 0m19.873s
user 0m22.203s
sys 0m1.979s
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -102,7 +102,7 @@ sys 0m1.979s
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -1344,15 +1344,15 @@ Please see the DSpace documentation for assistance.
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -46,7 +46,7 @@ Most worryingly, there are encoding errors in the abstracts for eleven items, fo
I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -76,7 +76,7 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -1208,15 +1208,15 @@ sys 0m2.551s
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -64,7 +64,7 @@ $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u ds
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -94,7 +94,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -1299,15 +1299,15 @@ UPDATE 14
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -48,7 +48,7 @@ DELETE 1
But after this I tried to delete the item from the XMLUI and it is still present&hellip;
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -78,7 +78,7 @@ But after this I tried to delete the item from the XMLUI and it is still present
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -631,15 +631,15 @@ COPY 64871
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -34,7 +34,7 @@ Run system updates on CGSpace (linode18) and reboot it
Skype with Marie-Angélique and Abenet about CG Core v2
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -64,7 +64,7 @@ Skype with Marie-Angélique and Abenet about CG Core v2
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -317,15 +317,15 @@ UPDATE 2
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -21,7 +21,7 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-07/" />
<meta property="article:published_time" content="2019-07-01T12:13:51+03:00" />
<meta property="article:modified_time" content="2019-10-28T13:39:25+02:00" />
<meta property="article:modified_time" content="2023-08-14T10:39:08+02:00" />
@@ -38,7 +38,7 @@ CGSpace
Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -50,7 +50,7 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
"url": "https://alanorth.github.io/cgspace-notes/2019-07/",
"wordCount": "2330",
"datePublished": "2019-07-01T12:13:51+03:00",
"dateModified": "2019-10-28T13:39:25+02:00",
"dateModified": "2023-08-14T10:39:08+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@@ -68,7 +68,7 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -330,7 +330,7 @@ dc.identifier.issn
<li>Also, Jane asked me to check the Data Portal to see which email address requests for confidential data are going</li>
</ul>
</li>
<li>Yesterday Theirry from CTA asked me about an error he was getting while submitting an item on CGSpace: &ldquo;Unable to load Submission Information, since WorkspaceID (ID:S106658) is not a valid in-process submission.&rdquo;</li>
<li>Yesterday Thierry from CTA asked me about an error he was getting while submitting an item on CGSpace: &ldquo;Unable to load Submission Information, since WorkspaceID (ID:S106658) is not a valid in-process submission.&rdquo;</li>
<li>I looked in the DSpace logs and found this right around the time of the screenshot he sent me:</li>
</ul>
<pre tabindex="0"><code>2019-07-10 11:50:27,433 INFO org.dspace.submit.step.CompleteStep @ lewyllie@cta.int:session_id=A920730003BCAECE8A3B31DCDE11A97E:submission_complete:Completed submission with id=106658
@@ -554,15 +554,15 @@ issn.validate(&#39;1020-3362&#39;)
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -46,7 +46,7 @@ After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s luck
Run system updates on DSpace Test (linode19) and reboot it
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -76,7 +76,7 @@ Run system updates on DSpace Test (linode19) and reboot it
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -573,15 +573,15 @@ sys 2m27.496s
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -72,7 +72,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
7249 2a01:7e00::f03c:91ff:fe18:7396
9124 45.5.186.2
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -102,7 +102,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -581,15 +581,15 @@ $ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institut
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="October, 2019"/>
<meta name="twitter:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U&#43;00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c &#39;id,dc."/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -48,7 +48,7 @@
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -385,15 +385,15 @@ $ dspace import -a -c 10568/104057 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -58,7 +58,7 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#34;/rest/bitstreams&#34;
106781
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -88,7 +88,7 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -692,15 +692,15 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -46,7 +46,7 @@ Make sure all packages are up to date and the package manager is up to date, the
# dpkg -C
# reboot
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -76,7 +76,7 @@ Make sure all packages are up to date and the package manager is up to date, the
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -404,15 +404,15 @@ UPDATE 1
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -56,7 +56,7 @@ I tweeted the CGSpace repository link
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -86,7 +86,7 @@ I tweeted the CGSpace repository link
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -604,15 +604,15 @@ COPY 2900
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -38,7 +38,7 @@ The code finally builds and runs with a fresh install
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -68,7 +68,7 @@ The code finally builds and runs with a fresh install
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -1275,15 +1275,15 @@ Moving: 21993 into core statistics-2019
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -42,7 +42,7 @@ You need to download this into the DSpace 6.x source and compile it
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -72,7 +72,7 @@ You need to download this into the DSpace 6.x source and compile it
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -484,15 +484,15 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -48,7 +48,7 @@ The third item now has a donut with score 1 since I tweeted it last week
On the same note, the one item Abenet pointed out last week now has a donut with score of 104 after I tweeted it last week
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -78,7 +78,7 @@ On the same note, the one item Abenet pointed out last week now has a donut with
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -658,15 +658,15 @@ $ psql -c &#39;select * from pg_stat_activity&#39; | wc -l
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -34,7 +34,7 @@ I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -64,7 +64,7 @@ I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -477,15 +477,15 @@ Caused by: java.lang.NullPointerException
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ I sent Atmire the dspace.log from today and told them to log into the server to
In other news, I checked the statistics API on DSpace 6 and it&rsquo;s working
I tried to build the OAI registry on the freshly migrated DSpace 6 on DSpace Test and I get an error:
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -66,7 +66,7 @@ I tried to build the OAI registry on the freshly migrated DSpace 6 on DSpace Tes
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -811,15 +811,15 @@ $ csvcut -c &#39;id,cg.subject.ilri[],cg.subject.ilri[en_US],dc.subject[en_US]&#
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -38,7 +38,7 @@ I restarted Tomcat and PostgreSQL and the issue was gone
Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the 5_x-prod branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter&rsquo;s request
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -68,7 +68,7 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -1142,15 +1142,15 @@ Fixed 4 occurences of: Muloi, D.M.
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ It is class based so I can easily add support for other vocabularies, and the te
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -66,7 +66,7 @@ It is class based so I can easily add support for other vocabularies, and the te
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -798,15 +798,15 @@ $ grep -c added /tmp/2020-08-27-countrycodetagger.log
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -48,7 +48,7 @@ I filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/39
I filed an issue on OpenRXV to make some minor edits to the admin UI: https://github.com/ilri/OpenRXV/issues/40
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -78,7 +78,7 @@ I filed an issue on OpenRXV to make some minor edits to the admin UI: https://gi
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -717,15 +717,15 @@ solr_query_params = {
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -44,7 +44,7 @@ During the FlywayDB migration I got an error:
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -74,7 +74,7 @@ During the FlywayDB migration I got an error:
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -1241,15 +1241,15 @@ $ ./delete-metadata-values.py -i 2020-10-31-delete-74-sponsors.csv -db dspace -u
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -32,7 +32,7 @@ So far we&rsquo;ve spent at least fifty hours to process the statistics and stat
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -62,7 +62,7 @@ So far we&rsquo;ve spent at least fifty hours to process the statistics and stat
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -731,15 +731,15 @@ $ ./fix-metadata-values.py -i 2020-11-30-fix-hung-orcid.csv -db dspace63 -u dspa
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ I started processing those (about 411,000 records):
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -66,7 +66,7 @@ I started processing those (about 411,000 records):
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -869,15 +869,15 @@ $ query-json &#39;.items | length&#39; /tmp/policy2.json
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -50,7 +50,7 @@ For example, this item has 51 views on CGSpace, but 0 on AReS
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -80,7 +80,7 @@ For example, this item has 51 views on CGSpace, but 0 on AReS
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -688,15 +688,15 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -60,7 +60,7 @@ $ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#3
}
}
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -90,7 +90,7 @@ $ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#3
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -898,15 +898,15 @@ dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE dspace_object_
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -34,7 +34,7 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -64,7 +64,7 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -875,15 +875,15 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -44,7 +44,7 @@ Perhaps one of the containers crashed, I should have looked closer but I was in
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -74,7 +74,7 @@ Perhaps one of the containers crashed, I should have looked closer but I was in
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -1042,15 +1042,15 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@@ -36,7 +36,7 @@ I looked at the top user agents and IPs in the Solr statistics for last month an
I will add the RI/1.0 pattern to our DSpace agents overload and purge them from Solr (we had previously seen this agent with 9,000 hits or so in 2020-09), but I think I will leave the Microsoft Word one&hellip; as that&rsquo;s an actual user&hellip;
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.133.1">
@@ -66,7 +66,7 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@@ -685,15 +685,15 @@ May 26, 02:57 UTC
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

Some files were not shown because too many files have changed in this diff Show More