2021-06-03 20:54:49 +02:00
<!DOCTYPE html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< meta name = "viewport" content = "width=device-width, initial-scale=1, shrink-to-fit=no" >
2021-07-11 13:22:14 +02:00
< meta property = "og:title" content = "July, 2021" / >
< meta property = "og:description" content = "2021-07-01
2021-06-03 20:54:49 +02:00
2021-07-11 13:22:14 +02:00
Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:
2021-06-03 20:54:49 +02:00
2021-07-11 13:22:14 +02:00
localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
2021-06-03 20:54:49 +02:00
" />
< meta property = "og:type" content = "article" / >
< meta property = "og:url" content = "https://alanorth.github.io/cgspace-notes/2021-06/" / >
2021-07-11 13:22:14 +02:00
< meta property = "article:published_time" content = "2021-06-01T08:53:07+03:00" / >
< meta property = "article:modified_time" content = "2021-06-01T08:53:07+03:00" / >
2021-06-03 20:54:49 +02:00
< meta name = "twitter:card" content = "summary" / >
2021-07-11 13:22:14 +02:00
< meta name = "twitter:title" content = "July, 2021" / >
< meta name = "twitter:description" content = "2021-07-01
2021-06-03 20:54:49 +02:00
2021-07-11 13:22:14 +02:00
Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:
2021-06-03 20:54:49 +02:00
2021-07-11 13:22:14 +02:00
localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
2021-06-03 20:54:49 +02:00
"/>
2021-07-11 13:22:14 +02:00
< meta name = "generator" content = "Hugo 0.85.0" / >
2021-06-03 20:54:49 +02:00
< script type = "application/ld+json" >
{
"@context": "http://schema.org",
"@type": "BlogPosting",
2021-07-11 13:22:14 +02:00
"headline": "July, 2021",
2021-06-03 20:54:49 +02:00
"url": "https://alanorth.github.io/cgspace-notes/2021-06/",
2021-07-11 13:22:14 +02:00
"wordCount": "873",
"datePublished": "2021-06-01T08:53:07+03:00",
"dateModified": "2021-06-01T08:53:07+03:00",
2021-06-03 20:54:49 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
< / script >
< link rel = "canonical" href = "https://alanorth.github.io/cgspace-notes/2021-06/" >
2021-07-11 13:22:14 +02:00
< title > July, 2021 | CGSpace Notes< / title >
2021-06-03 20:54:49 +02:00
<!-- combined, minified CSS -->
< link href = "https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel = "stylesheet" integrity = "sha256-vrgBLtwIuhC+AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin = "anonymous" >
<!-- minified Font Awesome for SVG icons -->
< script defer src = "https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity = "sha256-/7/qCIqaFmbsZcOoy0kG4qDk+S3HDbv0AKElrSQiEjo=" crossorigin = "anonymous" > < / script >
<!-- RSS 2.0 feed -->
< / head >
< body >
< div class = "blog-masthead" >
< div class = "container" >
< nav class = "nav blog-nav" >
< a class = "nav-link " href = "https://alanorth.github.io/cgspace-notes/" > Home< / a >
< / nav >
< / div >
< / div >
< header class = "blog-header" >
< div class = "container" >
< h1 class = "blog-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/" rel = "home" > CGSpace Notes< / a > < / h1 >
< p class = "lead blog-description" dir = "auto" > Documenting day-to-day work on the < a href = "https://cgspace.cgiar.org" > CGSpace< / a > repository.< / p >
< / div >
< / header >
< div class = "container" >
< div class = "row" >
< div class = "col-sm-8 blog-main" >
< article class = "blog-post" >
< header >
2021-07-11 13:22:14 +02:00
< h2 class = "blog-post-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/2021-06/" > July, 2021< / a > < / h2 >
2021-06-03 20:54:49 +02:00
< p class = "blog-post-meta" >
2021-07-11 13:22:14 +02:00
< time datetime = "2021-06-01T08:53:07+03:00" > Tue Jun 01, 2021< / time >
2021-06-03 20:54:49 +02:00
in
< span class = "fas fa-folder" aria-hidden = "true" > < / span > < a href = "/cgspace-notes/categories/notes/" rel = "category tag" > Notes< / a >
< / p >
< / header >
2021-07-11 13:22:14 +02:00
< h2 id = "2021-07-01" > 2021-07-01< / h2 >
2021-06-21 15:24:40 +02:00
< ul >
2021-07-11 13:22:14 +02:00
< li > Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:< / li >
2021-06-21 15:24:40 +02:00
< / ul >
2021-07-11 13:22:14 +02:00
< pre > < code class = "language-console" data-lang = "console" > localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
< / code > < / pre > < h2 id = "2021-07-04" > 2021-07-04< / h2 >
2021-06-21 15:24:40 +02:00
< ul >
2021-07-11 13:22:14 +02:00
< li > Update all Docker containers on the AReS server (linode20) and rebuild OpenRXV:< / li >
2021-06-21 15:24:40 +02:00
< / ul >
2021-07-11 13:22:14 +02:00
< pre > < code class = "language-console" data-lang = "console" > $ cd OpenRXV
$ docker-compose -f docker/docker-compose.yml down
$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose -f docker/docker-compose.yml build
2021-06-21 15:24:40 +02:00
< / code > < / pre > < ul >
2021-07-11 13:22:14 +02:00
< li > Then run all system updates and reboot the server< / li >
< li > After the server came back up I cloned the < code > openrxv-items-final< / code > index to < code > openrxv-items-temp< / code > and started the plugins
2021-06-21 15:24:40 +02:00
< ul >
2021-07-11 13:22:14 +02:00
< li > This will hopefully be faster than a full re-harvest… < / li >
2021-06-21 15:24:40 +02:00
< / ul >
< / li >
2021-07-11 13:22:14 +02:00
< li > I opened a few GitHub issues for OpenRXV bugs:
2021-06-21 15:24:40 +02:00
< ul >
2021-07-11 13:22:14 +02:00
< li > < a href = "https://github.com/ilri/OpenRXV/issues/103" > Hide “ metadata structure” section in repository setup< / a > < / li >
< li > < a href = "https://github.com/ilri/OpenRXV/issues/104" > Improve “ start plugins” and “ commit indexing” buttons< / a > < / li >
< li > < a href = "https://github.com/ilri/OpenRXV/issues/105" > Allow running plugins individually< / a > < / li >
< li > < a href = "https://github.com/ilri/OpenRXV/issues/106" > Hide the “ DSpace add missing items” < / a > < / li >
2021-06-21 15:24:40 +02:00
< / ul >
< / li >
2021-07-11 13:22:14 +02:00
< li > Rebuild DSpace Test (linode26) from a fresh Ubuntu 20.04 image on Linode< / li >
< li > The start plugins on AReS had seventy-five errors from the < code > dspace_add_missing_items< / code > plugin for some reason so I had to start a fresh indexing< / li >
< li > I noticed that the WorldFish data has dozens of incorrect countries so I should talk to Salem about that because they manage it
2021-06-21 15:24:40 +02:00
< ul >
2021-07-11 13:22:14 +02:00
< li > Also I noticed that we weren’ t using the Country formatter in OpenRXV for the WorldFish country field, so some values don’ t get mapped properly< / li >
< li > I added some value mappings to fix some issues with WorldFish data and added a few more fields to the repository harvesting config and started a fresh re-indexing< / li >
2021-06-21 15:24:40 +02:00
< / ul >
< / li >
< / ul >
2021-07-11 13:22:14 +02:00
< h2 id = "2021-07-05" > 2021-07-05< / h2 >
2021-06-21 15:24:40 +02:00
< ul >
2021-07-11 13:22:14 +02:00
< li > The AReS harvesting last night succeeded and I started the plugins< / li >
< li > Margarita from CCAFS asked if we can create a new field for AICCRA publications
2021-06-21 15:24:40 +02:00
< ul >
2021-07-11 13:22:14 +02:00
< li > I asked her to clarify what they want< / li >
< li > AICCRA is an initiative so it might be better to create new field for that, for example < code > cg.contributor.initiative< / code > < / li >
2021-06-21 15:24:40 +02:00
< / ul >
< / li >
2021-06-22 14:22:15 +02:00
< / ul >
2021-07-11 13:22:14 +02:00
< h2 id = "2021-07-06" > 2021-07-06< / h2 >
2021-06-22 14:22:15 +02:00
< ul >
2021-07-11 13:22:14 +02:00
< li > Atmire merged my spider user agent changes from last month so I will update the < code > example< / code > list we use in DSpace and remove the new ones from my < code > ilri< / code > override file
2021-06-22 14:22:15 +02:00
< ul >
2021-07-11 13:22:14 +02:00
< li > Also, I concatenated all our user agents into one file and purged all hits:< / li >
2021-06-22 14:22:15 +02:00
< / ul >
< / li >
< / ul >
2021-07-11 13:22:14 +02:00
< pre > < code class = "language-console" data-lang = "console" > $ ./ilri/check-spider-hits.sh -f /tmp/spiders -p
Purging 95 hits from Drupal in statistics
Purging 38 hits from DTS Agent in statistics
Purging 601 hits from Microsoft Office Existence Discovery in statistics
Purging 51 hits from Site24x7 in statistics
Purging 62 hits from Trello in statistics
Purging 13574 hits from WhatsApp in statistics
Purging 144 hits from FlipboardProxy in statistics
Purging 37 hits from LinkWalker in statistics
Purging 1 hits from [Ll]ink.?[Cc]heck.? in statistics
Purging 427 hits from WordPress in statistics
2021-06-22 14:22:15 +02:00
2021-07-11 13:22:14 +02:00
Total number of bot hits purged: 15030
2021-06-22 14:22:15 +02:00
< / code > < / pre > < ul >
2021-07-11 13:22:14 +02:00
< li > Meet with the CGIAR– AGROVOC task group to discuss how we want to do the workflow for submitting new terms to AGROVOC< / li >
< li > I extracted another list of all subjects to check against AGROVOC:< / li >
2021-06-25 08:34:29 +02:00
< / ul >
2021-07-11 13:22:14 +02:00
< pre > < code class = "language-console" data-lang = "console" > \COPY (SELECT DISTINCT(LOWER(text_value)) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-06-all-subjects.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-07-06-all-subjects.csv | sed 1d > /tmp/2021-07-06-all-subjects.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-06-agrovoc-results-all-subjects.csv -d
2021-06-25 08:34:29 +02:00
< / code > < / pre > < ul >
2021-07-11 13:22:14 +02:00
< li > Test < a href = "https://github.com/DSpace/DSpace/pull/3162" > Hrafn Malmquist’ s proposed DBCP2 changes< / a > for DSpace 6.4 (DS-4574)
< ul >
< li > His changes reminded me that we can perhaps switch back to using this pooling instead of Tomcat 7’ s JDBC pooling via JNDI< / li >
< li > Tomcat 8 has DBCP2 built in, but we are stuck on Tomcat 7 for now< / li >
< / ul >
< / li >
< li > Looking into the database issues we had last month on 2021-06-23
< ul >
< li > I think it might have been some kind of attack because the number of XMLUI sessions was through the roof at one point (10,000!) and the number of unique IPs accessing the server that day is much higher than any other day:< / li >
< / ul >
< / li >
< / ul >
< pre > < code class = "language-console" data-lang = "console" > # for num in {10..26}; do echo " 2021-06-$num" ; zcat /var/log/nginx/access.log.*.gz /var/log/nginx/library-access.log.*.gz | grep " $num/Jun/2021" | awk '{print $1}' | sort | uniq | wc -l; done
2021-06-10
10693
2021-06-11
10587
2021-06-12
7958
2021-06-13
7681
2021-06-14
12639
2021-06-15
15388
2021-06-16
12245
2021-06-17
11187
2021-06-18
9684
2021-06-19
7835
2021-06-20
7198
2021-06-21
10380
2021-06-22
10255
2021-06-23
15878
2021-06-24
9963
2021-06-25
9439
2021-06-26
7930
2021-06-25 08:34:29 +02:00
< / code > < / pre > < ul >
2021-07-11 13:22:14 +02:00
< li > Similarly, the number of connections to the REST API was around the average for the recent weeks before:< / li >
< / ul >
< pre > < code class = "language-console" data-lang = "console" > # for num in {10..26}; do echo " 2021-06-$num" ; zcat /var/log/nginx/rest.*.gz | grep " $num/Jun/2021" | awk '{print $1}' | sort | uniq | wc -l; done
2021-06-10
1183
2021-06-11
1074
2021-06-12
911
2021-06-13
892
2021-06-14
1320
2021-06-15
1257
2021-06-16
1208
2021-06-17
1119
2021-06-18
965
2021-06-19
985
2021-06-20
854
2021-06-21
1098
2021-06-22
1028
2021-06-23
1375
2021-06-24
1135
2021-06-25
969
2021-06-26
904
2021-07-01 07:53:21 +02:00
< / code > < / pre > < ul >
2021-07-11 13:22:14 +02:00
< li > According to goaccess, the traffic spike started at 2AM (remember that the first “ Pool empty” error in dspace.log was at 4:01AM):< / li >
2021-07-01 07:53:21 +02:00
< / ul >
2021-07-11 13:22:14 +02:00
< pre > < code class = "language-console" data-lang = "console" > # zcat /var/log/nginx/access.log.1[45].gz /var/log/nginx/library-access.log.1[45].gz | grep -E '23/Jun/2021' | goaccess --log-format=COMBINED -
2021-07-01 07:53:21 +02:00
< / code > < / pre > < ul >
2021-07-11 13:22:14 +02:00
< li > Moayad sent a fix for the add missing items plugins issue (< a href = "https://github.com/ilri/OpenRXV/pull/107" > #107< / a > )
2021-07-01 07:53:21 +02:00
< ul >
2021-07-11 13:22:14 +02:00
< li > It works MUCH faster because it correctly identifies the missing handles in each repository< / li >
< li > Also it adds better debug messages to the api logs< / li >
2021-07-01 07:53:21 +02:00
< / ul >
< / li >
2021-06-30 11:21:16 +02:00
< / ul >
2021-06-21 15:24:40 +02:00
<!-- raw HTML omitted -->
2021-06-03 20:54:49 +02:00
< / article >
< / div > <!-- /.blog - main -->
< aside class = "col-sm-3 ml-auto blog-sidebar" >
< section class = "sidebar-module" >
< h4 > Recent Posts< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "/cgspace-notes/2021-06/" > June, 2021< / a > < / li >
2021-07-07 15:30:06 +02:00
< li > < a href = "/cgspace-notes/2021-06/" > July, 2021< / a > < / li >
2021-06-03 20:54:49 +02:00
< li > < a href = "/cgspace-notes/2021-05/" > May, 2021< / a > < / li >
< li > < a href = "/cgspace-notes/2021-04/" > April, 2021< / a > < / li >
< li > < a href = "/cgspace-notes/2021-03/" > March, 2021< / a > < / li >
< / ol >
< / section >
< section class = "sidebar-module" >
< h4 > Links< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "https://cgspace.cgiar.org" > CGSpace< / a > < / li >
< li > < a href = "https://dspacetest.cgiar.org" > DSpace Test< / a > < / li >
< li > < a href = "https://github.com/ilri/DSpace" > CGSpace @ GitHub< / a > < / li >
< / ol >
< / section >
< / aside >
< / div > <!-- /.row -->
< / div > <!-- /.container -->
< footer class = "blog-footer" >
< p dir = "auto" >
Blog template created by < a href = "https://twitter.com/mdo" > @mdo< / a > , ported to Hugo by < a href = 'https://twitter.com/mralanorth' > @mralanorth< / a > .
< / p >
< p >
< a href = "#" > Back to top< / a >
< / p >
< / footer >
< / body >
< / html >