cgspace-notes/docs/2023-01/index.html

882 lines
58 KiB
HTML
Raw Normal View History

2023-01-01 09:12:13 +01:00
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="January, 2023" />
<meta property="og:description" content="2023-01-01
Apply some more ORCID identifiers to items on CGSpace using my 2022-09-22-add-orcids.csv file
I want to update all ORCID names and refresh them in the database
I see we have some new ones that aren&rsquo;t in our list if I combine with this file:
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2023-01/" />
<meta property="article:published_time" content="2023-01-01T08:44:36+03:00" />
2023-03-15 06:03:48 +01:00
<meta property="article:modified_time" content="2023-03-14T14:30:17+03:00" />
2023-01-01 09:12:13 +01:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="January, 2023"/>
<meta name="twitter:description" content="2023-01-01
Apply some more ORCID identifiers to items on CGSpace using my 2022-09-22-add-orcids.csv file
I want to update all ORCID names and refresh them in the database
I see we have some new ones that aren&rsquo;t in our list if I combine with this file:
"/>
2023-03-21 14:35:41 +01:00
<meta name="generator" content="Hugo 0.111.3">
2023-01-01 09:12:13 +01:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "January, 2023",
"url": "https://alanorth.github.io/cgspace-notes/2023-01/",
2023-03-15 06:03:48 +01:00
"wordCount": "4367",
2023-01-01 09:12:13 +01:00
"datePublished": "2023-01-01T08:44:36+03:00",
2023-03-15 06:03:48 +01:00
"dateModified": "2023-03-14T14:30:17+03:00",
2023-01-01 09:12:13 +01:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2023-01/">
<title>January, 2023 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2023-01/">January, 2023</a></h2>
<p class="blog-post-meta">
<time datetime="2023-01-01T08:44:36+03:00">Sun Jan 01, 2023</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2023-01-01">2023-01-01</h2>
<ul>
<li>Apply some more ORCID identifiers to items on CGSpace using my <code>2022-09-22-add-orcids.csv</code> file
<ul>
<li>I want to update all ORCID names and refresh them in the database</li>
<li>I see we have some new ones that aren&rsquo;t in our list if I combine with this file:</li>
</ul>
</li>
</ul>
2023-03-15 06:03:48 +01:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort -u | wc -l
2023-01-01 09:12:13 +01:00
</span></span><span style="display:flex;"><span>1939
2023-03-15 06:03:48 +01:00
</span></span><span style="display:flex;"><span>$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml 2022-09-22-add-orcids.csv | grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort -u | wc -l
2023-01-01 09:12:13 +01:00
</span></span><span style="display:flex;"><span>1973
</span></span></code></pre></div><ul>
<li>I will extract and process them with my <code>resolve-orcids.py</code> script:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml 2022-09-22-add-orcids.csv| grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort -u &gt; /tmp/2023-01-01-orcids.txt
</span></span><span style="display:flex;"><span>$ ./ilri/resolve-orcids.py -i /tmp/2023-01-01-orcids.txt -o /tmp/2023-01-01-orcids-names.txt -d
</span></span></code></pre></div><ul>
2023-03-15 06:03:48 +01:00
<li>Then update them in the database:</li>
2023-01-01 09:12:13 +01:00
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/update-orcids.py -i /tmp/2023-01-01-orcids-names.txt -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -m <span style="color:#ae81ff">247</span>
</span></span></code></pre></div><ul>
<li>Load on CGSpace is high around 9.x
<ul>
<li>I see there is a CIAT bot harvesting via the REST API with IP 45.5.186.2</li>
<li>Other than that I don&rsquo;t see any particular system stats as alarming</li>
<li>There has been a marked increase in load in the last few weeks, perhaps due to Initiative activity&hellip;</li>
<li>Perhaps there are some stuck PostgreSQL locks from CLI tools?</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | grep -o -E <span style="color:#e6db74">&#39;(dspaceWeb|dspaceApi|dspaceCli)&#39;</span> | sort | uniq -c
</span></span><span style="display:flex;"><span> 58 dspaceCli
</span></span><span style="display:flex;"><span> 46 dspaceWeb
</span></span></code></pre></div><ul>
<li>The current time on the server is 08:52 and I see the dspaceCli locks were started at 04:00 and 05:00&hellip; so I need to check which cron jobs those belong to as I think I noticed this last month too
<ul>
<li>I&rsquo;m going to wait and see if they finish, but by tomorrow I will kill them</li>
</ul>
</li>
</ul>
2023-01-04 15:08:14 +01:00
<h2 id="2023-01-02">2023-01-02</h2>
<ul>
<li>The load on the server is now very low and there are no more locks from dspaceCli
<ul>
<li>So there <em>was</em> some long-running process that was running and had to finish!</li>
<li>That finally sheds some light on the &ldquo;high load on Sunday&rdquo; problem where I couldn&rsquo;t find any other distinct pattern in the nginx or Tomcat requests</li>
</ul>
</li>
</ul>
<h2 id="2023-01-03">2023-01-03</h2>
<ul>
<li>The load from the server on Sundays, which I have noticed for a long time, seems to be coming from the DSpace checker cron job
<ul>
<li>This checks the checksums of all bitstreams to see if they match the ones in the database</li>
</ul>
</li>
<li>I exported the entire CGSpace metadata to do country/region checks with <code>csv-metadata-quality</code>
<ul>
<li>I extracted only the items with countries, which was about 48,000, then split the file into parts of 10,000 items, but the upload found 2,000 changes in the first one and took several hours to complete&hellip;</li>
</ul>
</li>
<li>IWMI sent me ORCID identifiers for new scientsts, bringing our total to 2,010</li>
</ul>
<h2 id="2023-01-04">2023-01-04</h2>
<ul>
<li>I finally finished applying the region imports (in five batches of 10,000)
<ul>
<li>It was about 7,500 missing regions in total&hellip;</li>
</ul>
</li>
<li>Now I will move on to doing the Initiative mappings
<ul>
<li>I modified my <code>fix-initiative-mappings.py</code> script to only write out the items that have updated mappings</li>
<li>This makes it way easier to apply fixes to the entire CGSpace because we don&rsquo;t try to import 100,000 items with no changes in mappings</li>
</ul>
</li>
<li>More dspaceCli locks from 04:00 this morning (current time on server is 07:33) and today is a Wednesday
<ul>
<li>The checker cron job runs on <code>0,3</code>, which is Sunday and Wednesday, so this is from that&hellip;</li>
<li>Finally at 16:30 I decided to kill the PIDs associated with those locks&hellip;</li>
<li>I am going to disable that cron job for now and watch the server load for a few weeks</li>
</ul>
</li>
<li>Start a harvest on AReS</li>
</ul>
2023-01-10 20:22:03 +01:00
<h2 id="2023-01-08">2023-01-08</h2>
<ul>
<li>It&rsquo;s Sunday and I see some PostgreSQL locks belonging to dspaceCli that started at 05:00
<ul>
<li>That&rsquo;s strange because I disabled the <code>dspace checker</code> one last week, so I&rsquo;m not sure which this is&hellip;</li>
<li>It&rsquo;s currently 2:30PM on the server so these locks have been there for almost twelve hours</li>
</ul>
</li>
<li>I exported the entire CGSpace to update the Initiative mappings
<ul>
<li>Items were mapped to ~58 new Initiative collections</li>
</ul>
</li>
<li>Then I ran the ORCID import to catch any new ones that might not have been tagged</li>
<li>Then I started a harvest on AReS</li>
</ul>
<h2 id="2023-01-09">2023-01-09</h2>
<ul>
<li>Fix some invalid Initiative names on CGSpace and then check for missing mappings</li>
<li>Check for missing regions in the Initiatives collection</li>
<li>Export a list of author affiliations from the Initiatives community for Peter to check
<ul>
<li>Was slightly ghetto because I did it from a CSV export of the Initiatives community, then imported to OpenRefine to split multi-value fields, then did some sed nonsense to handle the quoting:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;cg.contributor.affiliation[en_US]&#39;</span> ~/Downloads/2023-01-09-initiatives.csv | <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> sed -e &#39;s/^&#34;//&#39; -e &#39;s/&#34;$//&#39; -e &#39;s/||/\n/g&#39; | \
</span></span><span style="display:flex;"><span> sort -u | \
</span></span><span style="display:flex;"><span> sed -e &#39;s/^\(.*\)/&#34;\1/&#39; -e &#39;s/\(.*\)$/\1&#34;/&#39; &gt; /tmp/2023-01-09-initiatives-affiliations.csv
</span></span></code></pre></div><h2 id="2023-01-10">2023-01-10</h2>
<ul>
<li>Export the CGSpace Initiatives collection to check for missing regions and collection mappings</li>
</ul>
2023-01-12 21:11:42 +01:00
<h2 id="2023-01-11">2023-01-11</h2>
<ul>
<li>I&rsquo;m trying the DSpace 7 REST API again
<ul>
<li>While following onathe <a href="https://github.com/DSpace/RestContract/blob/main/authentication.md">DSpace 7 REST API authentication docs</a> I still cannot log in via curl on the command line because I get a <code>Access is denied. Invalid CSRF token.</code> message</li>
<li>Logging in via the HAL Browser works&hellip;</li>
<li>Someone on the DSpace Slack mentioned that the <a href="https://github.com/DSpace/RestContract/issues/209">authentication documentation is out of date</a> and we need to specify the cookie too</li>
<li>I tried it and finally got it to work:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl --head https://dspace7test.ilri.org/server/api
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>set-cookie: DSPACE-XSRF-COOKIE=42c78c56-613d-464f-89ea-79142fc5b519; Path=/server; Secure; HttpOnly; SameSite=None
</span></span><span style="display:flex;"><span>dspace-xsrf-token: 42c78c56-613d-464f-89ea-79142fc5b519
</span></span><span style="display:flex;"><span>$ curl -v -X POST https://dspace7test.ilri.org/server/api/authn/login --data <span style="color:#e6db74">&#34;user=alantest%40cgiar.org&amp;password=dspace&#34;</span> -H <span style="color:#e6db74">&#34;X-XSRF-TOKEN: 42c78c56-613d-464f-89ea-79142fc5b519&#34;</span> -b <span style="color:#e6db74">&#34;DSPACE-XSRF-COOKIE=42c78c56-613d-464f-89ea-79142fc5b519&#34;</span>
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>authorization: Bearer eyJh...9-0
</span></span><span style="display:flex;"><span>$ curl -v <span style="color:#e6db74">&#34;https://dspace7test.ilri.org/api/core/items&#34;</span> -H <span style="color:#e6db74">&#34;Authorization: Bearer eyJh...9-0&#34;</span>
</span></span></code></pre></div><ul>
<li>I created <a href="https://github.com/DSpace/RestContract/pull/213">a pull request</a> to fix the docs</li>
<li>I did quite a lot of cleanup and updates on the IFPRI batch items for the Gender Equality batch upload
<ul>
<li>Then I uploaded them to CGSpace</li>
</ul>
</li>
<li>I added about twenty more ORCID identifiers to my list and tagged them on CGSpace</li>
</ul>
<h2 id="2023-01-12">2023-01-12</h2>
<ul>
<li>I exported the entire CGSpace and did some cleanups on all metadata in OpenRefine
<ul>
<li>I was primarily interested in normalizing the DOIs, but I also normalized a bunch of publishing places</li>
<li>After this imports I will export it again to do the Initiative and region mappings</li>
<li>I ran the <code>fix-initiative-mappings.py</code> script and got forty-nine new mappings&hellip;</li>
</ul>
</li>
<li>I added several dozen new ORCID identifiers to my list and tagged ~500 on CGSpace</li>
<li>Start a harvest on AReS</li>
</ul>
2023-01-15 06:10:16 +01:00
<h2 id="2023-01-13">2023-01-13</h2>
<ul>
<li>Do a bit more cleanup on licenses, issue dates, and publishers
<ul>
<li>Then I started importing my large list of 5,000 items changed from yesterday</li>
</ul>
</li>
<li>Help Karen add abstracts to a bunch of SAPLING items that were missing them on CGSpace
<ul>
<li>For now I only did open access journal articles, but I should do the reports and others too</li>
</ul>
</li>
</ul>
<h2 id="2023-01-14">2023-01-14</h2>
<ul>
<li>Export CGSpace and check for missing Initiative mappings
<ul>
<li>There were a total of twenty-five</li>
<li>Then I exported the Initiatives communinty to check the countries and regions</li>
</ul>
</li>
</ul>
<h2 id="2023-01-15">2023-01-15</h2>
<ul>
<li>Start a harvest on AReS</li>
</ul>
2023-01-17 20:38:55 +01:00
<h2 id="2023-01-16">2023-01-16</h2>
<ul>
<li>Batch import four IFPRI items for CGIAR Initiative on Low-Emission Food Systems</li>
<li>Batch import another twenty-eight items for IFPRI across several Initiatives
<ul>
<li>On this one I did quite a bit of extra work to check for CRPs and data/code URLs in the acknowledgements, licenses, volume/issue/extent, etc</li>
<li>I fixed some authors, an ISBN, and added extra AGROVOC keywords from the abstracts</li>
<li>Then I checked for duplicates and ran it through csv-metadata-quality to make sure the countries/regions matched and there were no duplicate metadata values</li>
</ul>
</li>
</ul>
<h2 id="2023-01-17">2023-01-17</h2>
<ul>
<li>Batch import another twenty-three items for IFPRI across several Initiatives
<ul>
<li>I checked the IFPRI eBrary for extra CRPs and data/code URLs in the acknowledgements, licenses, volume/issue/extent, etc</li>
<li>I fixed some authors, an ISBN, and added extra AGROVOC keywords from the abstracts</li>
<li>Then I found and removed one duplicate in these items, as well as another on CGSpace already (!): 10568/126669</li>
<li>Then I ran it through csv-metadata-quality to make sure the countries/regions matched and there were no duplicate metadata values</li>
</ul>
</li>
<li>I exported the Initiatives collection to check the mappings, regions, and other metadata with csv-metadata-quality</li>
<li>I also added a bunch of ORCID identifiers to my list and tagged 837 new metadata values on CGSpace</li>
<li>There is a high load on CGSpace pretty regularly
<ul>
<li>Looking at Munin it shows there is a marked increase in DSpace sessions the last few weeks:</li>
</ul>
</li>
</ul>
<p><img src="/cgspace-notes/2023/01/jmx_dspace_sessions-year.png" alt="DSpace sessions year"></p>
<ul>
<li>Is this attributable to all the PRMS harvesting?</li>
<li>I also see some PostgreSQL locks starting earlier today:</li>
</ul>
<p><img src="/cgspace-notes/2023/01/postgres_connections_ALL-day.png" alt="PostgreSQL locks day"></p>
<ul>
<li>I&rsquo;m curious to see what kinds of IPs have been connecting, so I will look at the last few weeks:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat --force /var/log/nginx/<span style="color:#f92672">{</span>rest,access,library-access,oai<span style="color:#f92672">}</span>.log /var/log/nginx/<span style="color:#f92672">{</span>rest,access,library-access,oai<span style="color:#f92672">}</span>.log.1 /var/log/nginx/<span style="color:#f92672">{</span>rest,access,library-access,oai<span style="color:#f92672">}</span>.log.<span style="color:#f92672">{</span>2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25<span style="color:#f92672">}</span>.gz | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/2023-01-17-cgspace-ips.txt
</span></span><span style="display:flex;"><span># wc -l /tmp/2023-01-17-cgspace-ips.txt
</span></span><span style="display:flex;"><span>129446 /tmp/2023-01-17-cgspace-ips.txt
</span></span></code></pre></div><ul>
<li>I ran the IPs through my <code>resolve-addresses-geoip2.py</code> script to resolve their ASNs/networks, then extracted some lists of data center ISPs by eyeballing them (Amazon, Google, Microsoft, Apple, DigitalOcean, HostRoyale, and a dozen others):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(8075|714|16276|15169|23576|24940|13238|32934|14061|12876|55286|203020|204287|7922|50245|6939|16509|14618)$&#39;</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> /tmp/2023-01-17-cgspace-ips.csv | csvcut -c network | \
</span></span><span style="display:flex;"><span> sed 1d | sort | uniq &gt; /tmp/networks-to-block.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/networks-to-block.txt
</span></span><span style="display:flex;"><span>776 /tmp/networks-to-block.txt
</span></span></code></pre></div><ul>
<li>I added the list of networks to nginx&rsquo;s <code>bot-networks.conf</code> so they will all be heavily rate limited</li>
<li>Looking at the Munin stats again I see the load has been extra high since yesterday morning:</li>
</ul>
<p><img src="/cgspace-notes/2023/01/cpu-week.png" alt="CPU week"></p>
<ul>
<li>But still, it&rsquo;s suspicious that there are so many PostgreSQL locks</li>
<li>Looking at the Solr stats to check the hits the last month (actually I skipped December because I was so busy)
<ul>
<li>I see 31.148.223.10 is on ALFA TELECOM s.r.o. in Russia and it made 43,000 requests this month (and 400,000 more last month!)</li>
<li>I see 18.203.245.60 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 3.249.192.212 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 34.244.160.145 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 52.213.59.101 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 91.209.8.29 is in Bulgaria on DGM EOOD and is low risk according to Scamlytics, but their user agent is all lower case and it&rsquo;s a data center ISP so nope</li>
<li>I see 54.78.176.127 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 54.246.128.111 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 54.74.197.53 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 52.16.103.133 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 63.32.99.252 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 176.34.141.181 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 34.243.17.80 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 34.240.206.16 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 18.203.81.120 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 176.97.210.106 is on Tube Hosting and is rate VERY BAD, malicious, scammy on everything I checked</li>
<li>I see 79.110.73.54 is on ALFA TELCOM / Serverel and is using a different, weird user agent with each request</li>
<li>There are too many to count&hellip; so I will purge these and then move on to user agents</li>
</ul>
</li>
<li>I purged hits from those IPs:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
</span></span><span style="display:flex;"><span>Purging 439185 hits from 31.148.223.10 in statistics
</span></span><span style="display:flex;"><span>Purging 2151 hits from 18.203.245.60 in statistics
</span></span><span style="display:flex;"><span>Purging 1990 hits from 3.249.192.212 in statistics
</span></span><span style="display:flex;"><span>Purging 1975 hits from 34.244.160.145 in statistics
</span></span><span style="display:flex;"><span>Purging 1969 hits from 52.213.59.101 in statistics
</span></span><span style="display:flex;"><span>Purging 2540 hits from 91.209.8.29 in statistics
</span></span><span style="display:flex;"><span>Purging 1624 hits from 54.78.176.127 in statistics
</span></span><span style="display:flex;"><span>Purging 1236 hits from 54.74.197.53 in statistics
</span></span><span style="display:flex;"><span>Purging 1327 hits from 54.246.128.111 in statistics
</span></span><span style="display:flex;"><span>Purging 1108 hits from 52.16.103.133 in statistics
</span></span><span style="display:flex;"><span>Purging 1045 hits from 63.32.99.252 in statistics
</span></span><span style="display:flex;"><span>Purging 999 hits from 176.34.141.181 in statistics
</span></span><span style="display:flex;"><span>Purging 997 hits from 34.243.17.80 in statistics
</span></span><span style="display:flex;"><span>Purging 985 hits from 34.240.206.16 in statistics
</span></span><span style="display:flex;"><span>Purging 862 hits from 18.203.81.120 in statistics
</span></span><span style="display:flex;"><span>Purging 1654 hits from 176.97.210.106 in statistics
</span></span><span style="display:flex;"><span>Purging 1628 hits from 51.81.193.200 in statistics
</span></span><span style="display:flex;"><span>Purging 1020 hits from 79.110.73.54 in statistics
</span></span><span style="display:flex;"><span>Purging 842 hits from 35.153.105.213 in statistics
</span></span><span style="display:flex;"><span>Purging 1689 hits from 54.164.237.125 in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 466826
</span></span></code></pre></div><ul>
<li>Looking at user agents in Solr statistics from 2022-12 and 2023-01 I see some weird ones:
<ul>
<li><code>azure-logic-apps/1.0 (workflow e1f855704d6543f48be6205c40f4083f; version 08585300079823949478) microsoft-flow/1.0</code></li>
<li><code>Gov employment data scraper ([[your email]])</code></li>
<li><code>Microsoft.Data.Mashup (https://go.microsoft.com/fwlink/?LinkID=304225)</code></li>
<li><code>crownpeak</code></li>
<li><code>Mozilla/5.0 (compatible)</code></li>
</ul>
</li>
<li>Also, a ton of them are lower case, which I&rsquo;ve never seen before&hellip; it might be possible, but looks super fishy to me:
<ul>
<li><code>mozilla/5.0 (x11; ubuntu; linux x86_64; rv:84.0) gecko/20100101 firefox/86.0</code></li>
<li><code>mozilla/5.0 (macintosh; intel mac os x 11_3) applewebkit/537.36 (khtml, like gecko) chrome/89.0.4389.90 safari/537.36</code></li>
<li><code>mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.75 safari/537.36</code></li>
<li><code>mozilla/5.0 (windows nt 10.0; win64; x64; rv:86.0) gecko/20100101 firefox/86.0</code></li>
<li><code>mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/90.0.4430.93 safari/537.36</code></li>
<li><code>mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/92.0.4515.159 safari/537.36</code></li>
<li><code>mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/88.0.4324.104 safari/537.36</code></li>
<li><code>mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.75 safari/537.36</code></li>
</ul>
</li>
<li>I purged some of those:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
</span></span><span style="display:flex;"><span>Purging 1658 hits from azure-logic-apps\/1.0 in statistics
</span></span><span style="display:flex;"><span>Purging 948 hits from Gov employment data scraper in statistics
</span></span><span style="display:flex;"><span>Purging 786 hits from Microsoft\.Data\.Mashup in statistics
</span></span><span style="display:flex;"><span>Purging 303 hits from crownpeak in statistics
</span></span><span style="display:flex;"><span>Purging 332 hits from Mozilla\/5.0 (compatible) in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 4027
</span></span></code></pre></div><ul>
<li>Then I ran all system updates on the server and rebooted it
<ul>
<li>Hopefully this clears the locks and the nginx mitigation helps with the load from non-human hosts in large data centers</li>
<li>I need to re-work how I&rsquo;m doing this whitelisting and blacklisting&hellip; it&rsquo;s way too complicated now</li>
</ul>
</li>
<li>Export entire CGSpace to check Initiative mappings, and add nineteen&hellip;</li>
<li>Start a harvest on AReS</li>
</ul>
2023-01-22 19:53:45 +01:00
<h2 id="2023-01-18">2023-01-18</h2>
<ul>
<li>I&rsquo;m looking at all the ORCID identifiers in the database, which seem to be way more than I realized:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=247) to /tmp/2023-01-18-orcid-identifiers.txt;
</span></span><span style="display:flex;"><span>COPY 4231
</span></span><span style="display:flex;"><span>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/2023-01-18-orcid-identifiers.txt | grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort -u &gt; /tmp/2023-01-18-orcids.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/2023-01-18-orcids.txt
</span></span><span style="display:flex;"><span>4518 /tmp/2023-01-18-orcids.txt
</span></span></code></pre></div><ul>
<li>Then I resolved them from ORCID and updated them in the database:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/resolve-orcids.py -i /tmp/2023-01-18-orcids.txt -o /tmp/2023-01-18-orcids-names.txt -d
</span></span><span style="display:flex;"><span>$ ./ilri/update-orcids.py -i /tmp/2023-01-18-orcids-names.txt -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -m <span style="color:#ae81ff">247</span>
</span></span></code></pre></div><ul>
<li>Then I updated the controlled vocabulary</li>
<li>CGSpace became inactive in the afternoon, with a high number of locks, but surprisingly low CPU usage:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | grep -o -E <span style="color:#e6db74">&#39;(dspaceWeb|dspaceApi|dspaceCli)&#39;</span> | sort | uniq -c
</span></span><span style="display:flex;"><span> 83 dspaceApi
</span></span><span style="display:flex;"><span> 7829 dspaceWeb
</span></span></code></pre></div><ul>
<li>In the DSpace logs I see some weird SQL messages, so I decided to restart PostgreSQL and Tomcat 7&hellip;
<ul>
<li>I hope this doesn&rsquo;t cause some issue with in-progress workflows&hellip;</li>
</ul>
</li>
<li>I see another user on Cox in the US (98.186.216.144) crawling and scraping XMLUI with Python
<ul>
<li>I will add python to the list of bad bot user agents in nginx</li>
</ul>
</li>
<li>While looking into the locks I see some potential Java heap issues
<ul>
<li>Indeed, I see two out of memory errors in Tomcat&rsquo;s journal:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>tomcat7[310996]: java.lang.OutOfMemoryError: Java heap space
</span></span><span style="display:flex;"><span>tomcat7[310996]: Jan 18, 2023 1:37:03 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
</span></span></code></pre></div><ul>
<li>Which explains why the locks went down to normal numbers as I was watching&hellip; (because Java crashed)</li>
</ul>
<h2 id="2023-01-19">2023-01-19</h2>
<ul>
<li>Update a bunch of ORCID identifiers, Initiative mappings, and regions on CGSpace</li>
<li>So it seems an IFPRI user got caught up in the blocking I did yesterday
<ul>
<li>Their ISP is Comcast&hellip;</li>
<li>I need to re-work the ASN blocking on nginx, but for now I will just get the ASNs again minus Comcast:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ wget https://asn.ipinfo.app/api/text/list/AS714 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> https://asn.ipinfo.app/api/text/list/AS16276 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS15169 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS23576 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS24940 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS13238 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS32934 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS14061 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS12876 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS55286 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS203020 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS204287 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS50245 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS6939 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS16509 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS14618
</span></span><span style="display:flex;"><span>$ cat AS* | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>18179
</span></span><span style="display:flex;"><span>$ cat /tmp/AS* | ~/go/bin/mapcidr -a &gt; /tmp/networks.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/networks.txt
</span></span><span style="display:flex;"><span>5872 /tmp/networks.txt
</span></span></code></pre></div><h2 id="2023-01-20">2023-01-20</h2>
<ul>
<li>A lot of work on CGSpace metadata (ORCID identifiers, regions, and Initiatives)</li>
<li>I noticed that MEL and CGSpace are using slightly different vocabularies for SDGs so I sent an email to Salem and Sara</li>
</ul>
<h2 id="2023-01-21">2023-01-21</h2>
<ul>
<li>Export the Initiatives community again to perform collection mappings and country/region fixes</li>
</ul>
<h2 id="2023-01-22">2023-01-22</h2>
<ul>
<li>There has been a high load on the server for a few days, currently 8.0&hellip; and I&rsquo;ve been seeing some PostgreSQL locks stuck all day:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | grep -o -E <span style="color:#e6db74">&#39;(dspaceWeb|dspaceApi|dspaceCli)&#39;</span> | sort | uniq -c
</span></span><span style="display:flex;"><span> 11 dspaceApi
</span></span><span style="display:flex;"><span> 28 dspaceCli
</span></span><span style="display:flex;"><span> 981 dspaceWeb
</span></span></code></pre></div><ul>
<li>Looking at the locks I see they are from this morning at 5:00 AM, which is the <code>dspace checker-email</code> script
<ul>
<li>Last week I disabled the one that ones at 4:00 AM, but I guess I will experiment with disabling this too&hellip;</li>
<li>Then I killed the PIDs of the locks</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#34;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name=&#39;dspaceCli&#39;;&#34;</span> | less -S
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>$ ps auxw | grep <span style="color:#ae81ff">18986</span>
</span></span><span style="display:flex;"><span>postgres 1429108 1.9 1.5 3359712 508148 ? Ss 05:00 13:40 postgres: 12/main: dspace dspace 127.0.0.1(18986) SELECT
</span></span></code></pre></div><ul>
<li>Also, I checked the age of the locks and killed anything over 1 day:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql &lt; locks-age.sql | grep days | less -S
</span></span></code></pre></div><ul>
<li>Then I ran all updates on the server and restarted it&hellip;</li>
<li>Salem responded to my question about the SDG mismatch between MEL and CGSpace
<ul>
<li>We agreed to use a version based on the text of <a href="http://metadata.un.org/sdg/?lang=en">this site</a></li>
</ul>
</li>
<li>Salem is having issues with some REST API submission / updates
<ul>
<li>I updated DSpace Test with a recent CGSpace backup and created a super admin user for him to test</li>
</ul>
</li>
<li>Clean and normalize fifty-eight IFPRI records for batch import to CGSpace
<ul>
<li>I did a duplicate check and found six, so that&rsquo;s good!</li>
</ul>
</li>
<li>I exported the entire CGSpace to check for missing Initiative mappings
<ul>
<li>Then I exported the Initiatives community to check for missing regions</li>
<li>Then I ran the script to check for missing ORCID identifiers</li>
<li>Then <em>finally</em>, I started a harvest on AReS</li>
</ul>
</li>
</ul>
2023-01-29 16:19:31 +01:00
<h2 id="2023-01-23">2023-01-23</h2>
<ul>
<li>Salem found that you can actually harvest everything in DSpace 7 using the <a href="https://dspace7test.ilri.org/server/api/discover/browses/title/items?page=1&amp;size=100"><code>discover/browses</code> endpoint</a></li>
<li>Exported CGSpace again to examine and clean up a bunch of stuff like ISBNs in the ISSN field, DOIs in the URL field, dataset URLs in the DOI field, normalized a bunch of publisher places, fixed some countries and regions, fixed some licenses, etc
<ul>
<li>I noticed that we still have &ldquo;North America&rdquo; as a region, but according to UN M.49 that is the continent, which comprises &ldquo;Northern America&rdquo; the region, so I will update our controlled vocabularies and all existing entries</li>
<li>I imported changes to 1,800 items</li>
<li>When it finished five hours later I started a harvest on AReS</li>
</ul>
</li>
</ul>
<h2 id="2023-01-24">2023-01-24</h2>
<ul>
<li>Proof and upload seven items for the Rethinking Food Markets Initiative for IFPRI</li>
<li>Export CGSpace to do some minor cleanups, Initiative collection mappings, and region fixes
<ul>
<li>I also added &ldquo;CGIAR Trust Fund&rdquo; to all items with an Initiative in <code>cg.contributor.initiative</code></li>
</ul>
</li>
</ul>
<h2 id="2023-01-25">2023-01-25</h2>
<ul>
<li>Oh shit, the import last night ran for twelve hours and then died:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Error committing changes to database: could not execute statement
</span></span><span style="display:flex;"><span>Aborting most recent changes.
</span></span></code></pre></div><ul>
<li>I re-submitted a smaller version without the CGIAR Trust Fund changes for now just so we get the regions and other fixes</li>
<li>Do some work on SAPLING issues for CGSpace, sending a large list of issues we found to the MEL team for items they submitted</li>
<li>Abenet noticed that the number of items in the Initiatives community appears to have dropped by about 2,000 in the XMLUI
<ul>
<li>We looked on AReS and all the items are still there</li>
<li>I looked in the DSpace log and see around 2,000 messages like this:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>2023-01-25 07:14:59,529 ERROR com.atmire.versioning.ModificationLogger @ Error while writing item to versioning index: c9fac1f2-6b2b-4941-8077-40b7b5c936b6 message:missing required field: epersonID
</span></span><span style="display:flex;"><span>org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: missing required field: epersonID
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102)
</span></span><span style="display:flex;"><span> at com.atmire.versioning.ModificationLogger.indexItem(ModificationLogger.java:263)
</span></span><span style="display:flex;"><span> at com.atmire.versioning.ModificationConsumer.end(ModificationConsumer.java:134)
</span></span><span style="display:flex;"><span> at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:157)
</span></span><span style="display:flex;"><span> at org.dspace.core.Context.dispatchEvents(Context.java:455)
</span></span><span style="display:flex;"><span> at org.dspace.core.Context.commit(Context.java:424)
</span></span><span style="display:flex;"><span> at org.dspace.core.Context.complete(Context.java:380)
</span></span><span style="display:flex;"><span> at org.dspace.app.bulkedit.MetadataImport.main(MetadataImport.java:1399)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
</span></span><span style="display:flex;"><span> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
</span></span><span style="display:flex;"><span> at java.lang.reflect.Method.invoke(Method.java:498)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</span></span></code></pre></div><ul>
<li>I filed a ticket with Atmire to ask them</li>
<li>For now I just did a light Discovery reindex (not the full one) and all the items appeared again</li>
<li>Submit an issue to MEL GitHub regarding the capitalization of CRPs: <a href="https://github.com/CodeObia/MEL/issues/11133">https://github.com/CodeObia/MEL/issues/11133</a>
<ul>
<li>I talked to Salem and he said that this is a legacy thing from when CGSpace was using ALL CAPS for most of its metadata. I provided him with <a href="https://ilri.github.io/cgspace-submission-guidelines/cg-contributor-crp/cg-contributor-crp.txt">our current controlled vocabulary for CRPs</a> and he will update it in MEL.</li>
<li>On that note, Peter and Abenet and I realized that we still have an old field <code>cg.subject.crp</code> with about 450 values in it, but it has not been used for a few years (they are using the old ALL CAPS CRPs)</li>
<li>I exported this list of values to lowercase them and move them to <code>cg.contributor.crp</code></li>
<li>Even if some items end up with multiple CRPs, they will get de-duplicated when I remove duplicate values soon</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/fix-metadata-values.py -i /tmp/2023-01-25-fix-crp-subjects.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f cg.subject.crp -t correct
</span></span><span style="display:flex;"><span>$ ./ilri/move-metadata-values.py -i /tmp/2023-01-25-move-crp-subjects.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f cg.subject.crp -t cg.contributor.crp
</span></span></code></pre></div><ul>
<li>After fixing and moving them all, I deleted the <code>cg.subject.crp</code> field from the metadata registry</li>
<li>I realized a smarter way to update the text lang attributes of metadata would be to restrict the query to items that are in the archive and not withdrawn:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sql" data-lang="sql"><span style="display:flex;"><span><span style="color:#66d9ef">UPDATE</span> metadatavalue <span style="color:#66d9ef">SET</span> text_lang<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;en_US&#39;</span> <span style="color:#66d9ef">WHERE</span> dspace_object_id <span style="color:#66d9ef">IN</span> (<span style="color:#66d9ef">SELECT</span> uuid <span style="color:#66d9ef">FROM</span> item <span style="color:#66d9ef">WHERE</span> in_archive <span style="color:#66d9ef">AND</span> <span style="color:#66d9ef">NOT</span> withdrawn) <span style="color:#66d9ef">AND</span> text_lang <span style="color:#66d9ef">IS</span> <span style="color:#66d9ef">NULL</span> <span style="color:#66d9ef">OR</span> text_lang <span style="color:#66d9ef">IN</span> (<span style="color:#e6db74">&#39;en&#39;</span>, <span style="color:#e6db74">&#39;&#39;</span>);
</span></span></code></pre></div><ul>
<li>
<p>I tried that in a transaction and it hung, so I canceled it and rolled back</p>
</li>
<li>
<p>I see some PostgreSQL locks attributed to <code>dspaceApi</code> that were started at <code>2023-01-25 13:40:04.529087+01</code> and haven&rsquo;t changed since then (that&rsquo;s eight hours ago)</p>
<ul>
<li>I killed the pid&hellip;</li>
<li>There were also saw some locks owned by <code>dspaceWeb</code> that were nine and four hours old, so I killed those too&hellip;</li>
<li>Now Maria was able to archive one submission of hers that was hanging all afternoon, but I still can&rsquo;t run the update on the text langs&hellip;</li>
</ul>
</li>
<li>
<p>Export entire CGSpace to do Initiative mappings again</p>
</li>
<li>
<p>Started a harvest on AReS</p>
</li>
</ul>
<h2 id="2023-01-26">2023-01-26</h2>
<ul>
<li>Export entire CGSpace to do some metadata cleanup on various fields
<ul>
<li>I also added &ldquo;CGIAR Trust Fund&rdquo; to all items in the Initiatives community</li>
</ul>
</li>
</ul>
<h2 id="2023-01-27">2023-01-27</h2>
<ul>
<li>Export a list of affiliations in the Initiatives community for Peter, trying a new method to avoid exporting <em>everything</em> from PostgreSQL:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace metadata-export -i 10568/115087 -f /tmp/2023-01-27-initiatives.csv
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;cg.contributor.affiliation[en_US]&#39;</span> 2023-01-27-initiatives.csv <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | sed -e 1d -e &#39;s/^&#34;//&#39; -e &#39;s/&#34;$//&#39; -e &#39;s/||/\n/g&#39; -e &#39;/^$/d&#39; \
</span></span><span style="display:flex;"><span> | sort | uniq -c | sort -h \
</span></span><span style="display:flex;"><span> | awk &#39;BEGIN { FS = &#34;^[[:space:]]+[[:digit:]]+[[:space:]]+&#34; } {print $2}&#39;\
</span></span><span style="display:flex;"><span> | sed -e &#39;1i cg.contributor.affiliation&#39; -e &#39;s/^\(.*\)$/&#34;\1&#34;/&#39; \
</span></span><span style="display:flex;"><span> &gt; /tmp/2023-01-27-initiatives-affiliations.csv
</span></span></code></pre></div><ul>
<li>The first sed command strips the quotes, deletes empty lines, and splits multiple values on &ldquo;||&rdquo;</li>
<li>The awk command sets the field separator to something so we can get the second &ldquo;field&rdquo; of the sort command, ie:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span> 309 International Center for Agricultural Research in the Dry Areas
</span></span><span style="display:flex;"><span> 412 International Livestock Research Institute
</span></span></code></pre></div><ul>
<li>The second sed command adds the CSV header and quotes back</li>
<li>I did the same for authors and donors and send them to Peter to make corrections</li>
</ul>
<h2 id="2023-01-28">2023-01-28</h2>
<ul>
<li>Daniel from the Alliance said they are getting an HTTP 401 when trying to submit items to CGSpace via the REST API</li>
</ul>
<h2 id="2023-01-29">2023-01-29</h2>
<ul>
<li>Export the entire CGSpace to do Initiatives collection mappings</li>
<li>I was thinking about a way to use Crossref&rsquo;s API to enrich our data, for example checking registered DOIs for license information, publishers, etc
<ul>
<li>Turns out I had already written <code>crossref-doi-lookup.py</code> last year, and it works</li>
<li>I exported a list of all DOIs without licenses from CGSpace, minus the CIFOR ones because I know they aren&rsquo;t registered on Crossref, which is about 11,800 DOIs</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;cg.identifier.doi[en_US]&#39;</span> ~/Downloads/2023-01-29-CGSpace-DOIs-without-licenses.csv <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | csvgrep -c &#39;cg.identifier.doi[en_US]&#39; -r &#39;.*cifor.*&#39; -i \
</span></span><span style="display:flex;"><span> | sed 1d &gt; /tmp/2023-01-29-dois.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/2023-01-29-dois.txt
</span></span><span style="display:flex;"><span>11819 /tmp/2023-01-29-dois.txt
</span></span><span style="display:flex;"><span>$ ./ilri/crossref-doi-lookup.py -e a.orth@cgiar.org -i /tmp/2023-01-29-dois.txt -o /tmp/crossref-results.csv
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;id,cg.identifier.doi[en_US]&#39;</span> ~/Downloads/2023-01-29-CGSpace-DOIs-without-licenses.csv <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | sed -e &#39;s_https://doi.org/__g&#39; -e &#39;s_https://dx.doi.org/__g&#39; -e &#39;s/cg.identifier.doi\[en_US\]/doi/&#39; \
</span></span><span style="display:flex;"><span> &gt; /tmp/cgspace-temp.csv
</span></span><span style="display:flex;"><span>$ csvjoin -c doi /tmp/cgspace-temp.csv /tmp/crossref-results.csv <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | csvgrep -c license -r &#39;creative&#39; \
</span></span><span style="display:flex;"><span> | sed &#39;1s/license/dcterms.license[en_US]/&#39; \
</span></span><span style="display:flex;"><span> | csvcut -c id,license &gt; /tmp/2023-01-29-new-licenses.csv
</span></span></code></pre></div><ul>
<li>The above was done with just 5,000 DOIs because it was taking a long time, but after the last step I imported into OpenRefine to clean up the license URLs
<ul>
<li>Then I imported 635 new licenses to CGSpace woooo</li>
<li>After checking the remaining 6,500 DOIs there were another 852 new licenses, woooo</li>
</ul>
</li>
2023-01-31 20:20:38 +01:00
<li>Peter finished the corrections on affiliations, authors, and donors
<ul>
<li>I quickly checked them and applied each on CGSpace</li>
</ul>
</li>
<li>Start a harvest on AReS</li>
</ul>
<h2 id="2023-01-30">2023-01-30</h2>
<ul>
<li>Run the thumbnail fixer tasks on the Initiatives collections:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace dsrun io.github.ilri.cgspace.scripts.FixLowQualityThumbnails 10568/115087 | tee -a /tmp/FixLowQualityThumbnails.log
</span></span><span style="display:flex;"><span>$ grep -c remove /tmp/FixLowQualityThumbnails.log
</span></span><span style="display:flex;"><span>16
</span></span><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace dsrun io.github.ilri.cgspace.scripts.FixJpgJpgThumbnails 10568/115087 | tee -a /tmp/FixJpgJpgThumbnails.log
</span></span><span style="display:flex;"><span>$ grep -c replacing /tmp/FixJpgJpgThumbnails.log
</span></span><span style="display:flex;"><span>13
</span></span></code></pre></div><h2 id="2023-01-31">2023-01-31</h2>
<ul>
<li>Someone from the Google Scholar team contacted us to ask why Googlebot is blocked from crawling CGSpace
<ul>
<li>I said that I blocked them because they crawl haphazardly and we had high load during PRMS reporting</li>
<li>Now I will unblock their ASN15169 in nginx&hellip;</li>
<li>I urged them to be smarter about crawling since we&rsquo;re a small team and they are a huge engineering company</li>
</ul>
</li>
<li>I removed their ASN and regenerted my list from 2023-01-17:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ wget https://asn.ipinfo.app/api/text/list/AS714 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> https://asn.ipinfo.app/api/text/list/AS16276 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS23576 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS24940 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS13238 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS32934 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS14061 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS12876 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS55286 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS203020 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS204287 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS50245 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS6939 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS16509 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS14618
</span></span><span style="display:flex;"><span>$ cat AS* | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>17134
</span></span><span style="display:flex;"><span>$ cat /tmp/AS* | ~/go/bin/mapcidr -a &gt; /tmp/networks.txt
</span></span></code></pre></div><ul>
<li>Then I updated nginx&hellip;</li>
<li>Re-run the scripts to delete duplicate metadata values and update item timestamps that I originally used in 2022-11
<ul>
<li>This was about 650 duplicate metadata values&hellip;</li>
</ul>
</li>
<li>Exported CGSpace to do some metadata interrogation in OpenRefine
<ul>
<li>I looked at items that are set as <code>Limited Access</code> but have Creative Commons licenses</li>
<li>I filtered ~150 that had DOIs and checked them on the Crossref API using <code>crossref-doi-lookup.py</code></li>
<li>Of those, only about five or so were incorrectly marked as having Creative Commons licenses, so I set those to copyrighted</li>
<li>For the rest, I set them to Open Access</li>
</ul>
</li>
<li>Start a harvest on AReS</li>
2023-01-29 16:19:31 +01:00
</ul>
2023-01-01 09:12:13 +01:00
<!-- raw HTML omitted -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
2023-04-02 08:16:25 +02:00
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
2023-03-01 06:30:25 +01:00
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
2023-02-09 06:50:54 +01:00
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
2023-01-01 09:12:13 +01:00
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
<li><a href="/cgspace-notes/2022-12/">December, 2022</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>