Add notes for 2022-03-04

This commit is contained in:
2022-03-04 15:30:06 +03:00
parent 7453499827
commit 27acbac859
115 changed files with 6550 additions and 6444 deletions

View File

@ -32,7 +32,7 @@ Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two
Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -130,7 +130,7 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account
<ul>
<li>Delete 58 blank metadata values from the CGSpace database:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
DELETE 58
</code></pre><ul>
<li>I also ran it on DSpace Test because we&rsquo;ll be migrating the CGIAR Library soon and it would be good to catch these before we migrate</li>
@ -145,7 +145,7 @@ DELETE 58
<li>There will need to be some metadata updatesthough if I recall correctly it is only about seven recordsfor that as well, I had made some notes about it in <a href="/cgspace-notes/2017-07">2017-07</a>, but I&rsquo;ve asked for more clarification from Lili just in case</li>
<li>Looking at the DSpace logs to see if we&rsquo;ve had a change in the &ldquo;Cannot get a connection&rdquo; errors since last month when we adjusted the <code>db.maxconnections</code> parameter on CGSpace:</li>
</ul>
<pre tabindex="0"><code># grep -c &quot;Cannot get a connection, pool error Timeout waiting for idle object&quot; dspace.log.2017-09-*
<pre tabindex="0"><code># grep -c &#34;Cannot get a connection, pool error Timeout waiting for idle object&#34; dspace.log.2017-09-*
dspace.log.2017-09-01:0
dspace.log.2017-09-02:0
dspace.log.2017-09-03:9
@ -174,7 +174,7 @@ dspace.log.2017-09-10:0
<li>The import process takes the same amount of time with and without the caching</li>
<li>Also, I captured TCP packets destined for port 80 and both imports only captured ONE packet (an update check from some component in Java):</li>
</ul>
<pre tabindex="0"><code>$ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and 'tcp[32:4] = 0x47455420'
<pre tabindex="0"><code>$ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and &#39;tcp[32:4] = 0x47455420&#39;
</code></pre><ul>
<li>Great TCP dump guide here: <a href="https://danielmiessler.com/study/tcpdump">https://danielmiessler.com/study/tcpdump</a></li>
<li>The last part of that command filters for HTTP GET requests, of which there should have been many to fetch all the XSD files for validation</li>
@ -204,7 +204,7 @@ dspace.log.2017-09-10:0
<li>I wonder what was going on, and looking into the nginx logs I think maybe it&rsquo;s OAI&hellip;</li>
<li>Here is yesterday&rsquo;s top ten IP addresses making requests to <code>/oai</code>:</li>
</ul>
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
1 213.136.89.78
1 66.249.66.90
1 66.249.66.92
@ -217,7 +217,7 @@ dspace.log.2017-09-10:0
</code></pre><ul>
<li>Compared to the previous day&rsquo;s logs it looks VERY high:</li>
</ul>
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
1 207.46.13.39
1 66.249.66.93
2 66.249.66.91
@ -234,9 +234,9 @@ dspace.log.2017-09-10:0
</li>
<li>And this user agent has never been seen before today (or at least recently!):</li>
</ul>
<pre tabindex="0"><code># grep -c &quot;API scraper&quot; /var/log/nginx/oai.log
<pre tabindex="0"><code># grep -c &#34;API scraper&#34; /var/log/nginx/oai.log
62088
# zgrep -c &quot;API scraper&quot; /var/log/nginx/oai.log.*.gz
# zgrep -c &#34;API scraper&#34; /var/log/nginx/oai.log.*.gz
/var/log/nginx/oai.log.10.gz:0
/var/log/nginx/oai.log.11.gz:0
/var/log/nginx/oai.log.12.gz:0
@ -270,7 +270,7 @@ dspace.log.2017-09-10:0
<li>Some of these heavy users are also using XMLUI, and their user agent isn&rsquo;t matched by the <a href="https://github.com/ilri/rmg-ansible-public/blob/master/roles/dspace/templates/tomcat/server-tomcat7.xml.j2#L158">Tomcat Session Crawler valve</a>, so each request uses a different session</li>
<li>Yesterday alone the IP addresses using the <code>API scraper</code> user agent were responsible for 16,000 sessions in XMLUI:</li>
</ul>
<pre tabindex="0"><code># grep -a -E &quot;(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)&quot; /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep -a -E &#34;(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)&#34; /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
15924
</code></pre><ul>
<li>If this continues I will definitely need to figure out who is responsible for this scraper and add their user agent to the session crawler valve regex</li>
@ -282,7 +282,7 @@ dspace.log.2017-09-10:0
<li>Looking at the spreadsheet with deletions and corrections that CCAFS sent last week</li>
<li>It appears they want to delete a lot of metadata, which I&rsquo;m not sure they realize the implications of:</li>
</ul>
<pre tabindex="0"><code>dspace=# select text_value, count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange') group by text_value;
<pre tabindex="0"><code>dspace=# select text_value, count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in (&#39;EA_PAR&#39;,&#39;FP1_CSAEvidence&#39;,&#39;FP2_CRMWestAfrica&#39;,&#39;FP3_Gender&#39;,&#39;FP4_Baseline&#39;,&#39;FP4_CCPAG&#39;,&#39;FP4_CCPG&#39;,&#39;FP4_CIATLAM IMPACT&#39;,&#39;FP4_ClimateData&#39;,&#39;FP4_ClimateModels&#39;,&#39;FP4_GenderPolicy&#39;,&#39;FP4_GenderToolbox&#39;,&#39;FP4_Livestock&#39;,&#39;FP4_PolicyEngagement&#39;,&#39;FP_GII&#39;,&#39;SA_Biodiversity&#39;,&#39;SA_CSV&#39;,&#39;SA_GHGMeasurement&#39;,&#39;SEA_mitigationSAMPLES&#39;,&#39;SEA_UpscalingInnovation&#39;,&#39;WA_Partnership&#39;,&#39;WA_SciencePolicyExchange&#39;) group by text_value;
text_value | count
--------------------------+-------
FP4_ClimateModels | 6
@ -309,18 +309,18 @@ dspace.log.2017-09-10:0
<li>I sent CCAFS people an email to ask if they really want to remove these 200+ tags</li>
<li>She responded yes, so I&rsquo;ll at least need to do these deletes in PostgreSQL:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in (&#39;EA_PAR&#39;,&#39;FP1_CSAEvidence&#39;,&#39;FP2_CRMWestAfrica&#39;,&#39;FP3_Gender&#39;,&#39;FP4_Baseline&#39;,&#39;FP4_CCPAG&#39;,&#39;FP4_CCPG&#39;,&#39;FP4_CIATLAM IMPACT&#39;,&#39;FP4_ClimateData&#39;,&#39;FP4_ClimateModels&#39;,&#39;FP4_GenderPolicy&#39;,&#39;FP4_GenderToolbox&#39;,&#39;FP4_Livestock&#39;,&#39;FP4_PolicyEngagement&#39;,&#39;FP_GII&#39;,&#39;SA_Biodiversity&#39;,&#39;SA_CSV&#39;,&#39;SA_GHGMeasurement&#39;,&#39;SEA_mitigationSAMPLES&#39;,&#39;SEA_UpscalingInnovation&#39;,&#39;WA_Partnership&#39;,&#39;WA_SciencePolicyExchange&#39;,&#39;FP_GII&#39;);
DELETE 207
</code></pre><ul>
<li>When we discussed this in late July there were some other renames they had requested, but I don&rsquo;t see them in the current spreadsheet so I will have to follow that up</li>
<li>I talked to Macaroni Bros and they said to just go ahead with the other corrections as well as their spreadsheet was evolved organically rather than systematically!</li>
<li>The final list of corrections and deletes should therefore be:</li>
</ul>
<pre tabindex="0"><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
update metadatavalue set text_value='FP3_VietnamLED' where resource_type_id=2 and metadata_field_id=134 and text_value='FP3_VeitnamLED';
update metadatavalue set text_value='PII-FP1_PIRCCA' where resource_type_id=2 and metadata_field_id=235 and text_value='PII-SEA_PIRCCA';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-WA_IntegratedInterventions';
delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
<pre tabindex="0"><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-FP4_CRMWestAfrica&#39;;
update metadatavalue set text_value=&#39;FP3_VietnamLED&#39; where resource_type_id=2 and metadata_field_id=134 and text_value=&#39;FP3_VeitnamLED&#39;;
update metadatavalue set text_value=&#39;PII-FP1_PIRCCA&#39; where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-SEA_PIRCCA&#39;;
delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-WA_IntegratedInterventions&#39;;
delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in (&#39;EA_PAR&#39;,&#39;FP1_CSAEvidence&#39;,&#39;FP2_CRMWestAfrica&#39;,&#39;FP3_Gender&#39;,&#39;FP4_Baseline&#39;,&#39;FP4_CCPAG&#39;,&#39;FP4_CCPG&#39;,&#39;FP4_CIATLAM IMPACT&#39;,&#39;FP4_ClimateData&#39;,&#39;FP4_ClimateModels&#39;,&#39;FP4_GenderPolicy&#39;,&#39;FP4_GenderToolbox&#39;,&#39;FP4_Livestock&#39;,&#39;FP4_PolicyEngagement&#39;,&#39;FP_GII&#39;,&#39;SA_Biodiversity&#39;,&#39;SA_CSV&#39;,&#39;SA_GHGMeasurement&#39;,&#39;SEA_mitigationSAMPLES&#39;,&#39;SEA_UpscalingInnovation&#39;,&#39;WA_Partnership&#39;,&#39;WA_SciencePolicyExchange&#39;,&#39;FP_GII&#39;);
</code></pre><ul>
<li>Create and merge pull request to shut up the Ehcache update check (<a href="https://github.com/ilri/DSpace/pull/337">#337</a>)</li>
<li>Although it looks like there was a previous attempt to disable these update checks that was merged in DSpace 4.0 (although it only affects XMLUI): <a href="https://jira.duraspace.org/browse/DS-1492">https://jira.duraspace.org/browse/DS-1492</a></li>
@ -332,7 +332,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
<li>Testing to see how we end up with all these new authorities after we keep cleaning and merging them in the database</li>
<li>Here are all my distinct authority combinations in the database before:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -347,7 +347,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
</code></pre><ul>
<li>And then after adding a new item and selecting an existing &ldquo;Orth, Alan&rdquo; with an ORCID in the author lookup:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -363,7 +363,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
</code></pre><ul>
<li>It created a new authority&hellip; let&rsquo;s try to add another item and select the same existing author and see what happens in the database:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -379,7 +379,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
</code></pre><ul>
<li>No new one&hellip; so now let me try to add another item and select the italicized result from the ORCID lookup and see what happens in the database:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -396,7 +396,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
</code></pre><ul>
<li>Shit, it created another authority! Let&rsquo;s try it again!</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -439,19 +439,19 @@ DELETE 207
<li>We still need to do the changes to <code>config.dct</code> and regenerate the <code>sitebndl.zip</code> to send to the Handle.net admins</li>
<li>According to this <a href="http://dspace.2283337.n4.nabble.com/Multiple-handle-prefixes-merged-DSpace-instances-td3427192.html">dspace-tech mailing list entry from 2011</a>, we need to add the extra handle prefixes to <code>config.dct</code> like this:</li>
</ul>
<pre tabindex="0"><code>&quot;server_admins&quot; = (
&quot;300:0.NA/10568&quot;
&quot;300:0.NA/10947&quot;
<pre tabindex="0"><code>&#34;server_admins&#34; = (
&#34;300:0.NA/10568&#34;
&#34;300:0.NA/10947&#34;
)
&quot;replication_admins&quot; = (
&quot;300:0.NA/10568&quot;
&quot;300:0.NA/10947&quot;
&#34;replication_admins&#34; = (
&#34;300:0.NA/10568&#34;
&#34;300:0.NA/10947&#34;
)
&quot;backup_admins&quot; = (
&quot;300:0.NA/10568&quot;
&quot;300:0.NA/10947&quot;
&#34;backup_admins&#34; = (
&#34;300:0.NA/10568&#34;
&#34;300:0.NA/10947&#34;
)
</code></pre><ul>
<li>More work on the CGIAR Library migration test run locally, as I was having problem with importing the last fourteen items from the CGIAR System Management Office community</li>
@ -494,7 +494,7 @@ DELETE 207
<li>Abenet and I noticed that hdl.handle.net is blocked by ETC at ILRI Addis so I asked Biruk Debebe to route it over the satellite</li>
<li>Force thumbnail regeneration for the CGIAR System Organization&rsquo;s Historic Archive community (2000 items):</li>
</ul>
<pre tabindex="0"><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -f -i 10947/1 -p &quot;ImageMagick PDF Thumbnail&quot;
<pre tabindex="0"><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -f -i 10947/1 -p &#34;ImageMagick PDF Thumbnail&#34;
</code></pre><ul>
<li>I&rsquo;m still waiting (over 1 day later) to hear back from the CGIAR System Organization about updating the DNS for library.cgiar.org</li>
</ul>
@ -552,7 +552,7 @@ DELETE 207
<li>Communicate (finally) with Tania and Tunji from the CGIAR System Organization office to tell them to request CGNET make the DNS updates for library.cgiar.org</li>
<li>Peter wants me to clean up the text values for Delia Grace&rsquo;s metadata, as the authorities are all messed up again since we cleaned them up in <a href="/cgspace-notes/2016-12">2016-12</a>:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
text_value | authority | confidence
--------------+--------------------------------------+------------
Grace, Delia | | 600
@ -563,12 +563,12 @@ DELETE 207
<li>Strangely, none of her authority entries have ORCIDs anymore&hellip;</li>
<li>I&rsquo;ll just fix the text values and forget about it for now:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value=&#39;Grace, Delia&#39;, authority=&#39;bfa61d7c-7583-4175-991c-2e7315000f0c&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
UPDATE 610
</code></pre><ul>
<li>After this we have to reindex the Discovery and Authority cores (as <code>tomcat7</code> user):</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 83m56.895s