<li>Last week Jose had sent me an updated CSV with UTF-8 formatting, which was missing the filename column</li>
<li>I joined it with the older file (stripped down to just the <code>cg.number</code> and <code>filename</code> columns and then did the same cleanups I had done last week</li>
<li>I noticed there are six PDFs unused, so I asked Jose</li>
</ul>
</li>
<li>Spent some time trying to understand the REST API submission issues that Rafael from CIAT is having with tip-approve and tip-submit
<ul>
<li>First, according to my notes in 2020-10, a user must be a <em>collection admin</em> in order to submit via the REST API</li>
<li>Second, a collection must have a “Accept/Reject/Edit Metadata” step defined in the workflow</li>
<li>Also, I referenced my notes from this gist I had made for exactly this purpose! <ahref="https://gist.github.com/alanorth/40fc3092aefd78f978cca00e8abeeb7a">https://gist.github.com/alanorth/40fc3092aefd78f978cca00e8abeeb7a</a></li>
</ul>
</li>
</ul>
<h2id="2022-08-03">2022-08-03</h2>
<ul>
<li>I came up with an interesting idea to add missing countries and AGROVOC terms to the MARLO Innovation metadata
<ul>
<li>I copied the abstract column to two new fields: <code>countrytest</code> and <code>agrovoctest</code> and then used this Jython code as a transform to drop terms that don’t match (using CGSpace’s country list and list of 1,400 AGROVOC terms):</li>
</ul>
</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-python"data-lang="python"><spanstyle="display:flex;"><span><spanstyle="color:#66d9ef">with</span> open(<spanstyle="color:#e6db74">r</span><spanstyle="color:#e6db74">"/tmp/cgspace-countries.txt"</span>,<spanstyle="color:#e6db74">'r'</span>) <spanstyle="color:#66d9ef">as</span> f :
</span></span><spanstyle="display:flex;"><span> countries <spanstyle="color:#f92672">=</span> [name<spanstyle="color:#f92672">.</span>rstrip()<spanstyle="color:#f92672">.</span>lower() <spanstyle="color:#66d9ef">for</span> name <spanstyle="color:#f92672">in</span> f]
</span></span><spanstyle="display:flex;"><span>
</span></span><spanstyle="display:flex;"><span><spanstyle="color:#66d9ef">return</span><spanstyle="color:#e6db74">"||"</span><spanstyle="color:#f92672">.</span>join([x <spanstyle="color:#66d9ef">for</span> x <spanstyle="color:#f92672">in</span> value<spanstyle="color:#f92672">.</span>split(<spanstyle="color:#e6db74">''</span>) <spanstyle="color:#66d9ef">if</span> x<spanstyle="color:#f92672">.</span>lower() <spanstyle="color:#f92672">in</span> countries])
</span></span></code></pre></div><ul>
<li>Then I joined them with the other country and AGROVOC columns
<ul>
<li>I had originally tried to use csv-metadata-quality to look up and drop invalid AGROVOC terms but it was timing out ever dozen or so requests</li>
<li>Then I briefly tried to use lightrdf to export a text file of labels from AGROVOC’s RDF, but I couldn’t figure it out</li>
<li>I just realized this will not match countries with spaces in our cell value, ugh… and Jython has weird syntax and errors and I can’t get normal Python code to work here, I’m missing something</li>
</ul>
</li>
<li>Then I extracted the titles, dates, and types and added IDs, then ran them through <code>check-duplicates.py</code> to find the existing items on CGSpace so I can add them as <code>dcterm.relation</code> links</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>$ csvcut -l -c dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-08-03-Innovations-Cleaned.csv | sed <spanstyle="color:#e6db74">'1s/line_number/id/'</span>> /tmp/innovations-temp.csv
<li>There were about 115 with existing items on CGSpace</li>
<li>Then I did some minor processing and checking of the duplicates file (for example, some titles appear more than once in both files), and joined with the other file (left join):</li>
<li>I discussed issues with the DSpace 7 submission forms on Slack and Mark Wood found that the migration tool creates a non-working submission form
<ul>
<li>After updating the class name of the collection step and removing the “complete” and “sample” steps the submission form was working</li>
<li>Now the issue is that the controlled vocabularies show up like this:</li>
</ul>
</li>
</ul>
<p><imgsrc="/cgspace-notes/2022/08/dspace7-submission.png"alt="Controlled vocabulary bug in DSpace 7"></p>
<ul>
<li>I think we need to add IDs, I will have to check what the implications of that are</li>
<li>Emilio contacted me last week to say they have re-worked their harvester on Hetzner to use a new user agent: <code>AICCRA website harvester</code>
<ul>
<li>I verified that I see it in the REST API logs, but I don’t see any new stats hits for it</li>
<li>I do see 11,000 hits from that IP last month when I had the incorrect nginx configuration that was sending a literal <code>$http_user_agent</code> so I purged those</li>
<li>It is lucky that we have <code>harvest</code> in the DSpace spider agent example file so Solr doesn’t log these hits, nothing needed to be done in nginx</li>
<li>I noticed there was high load on CGSpace, around 9 or 10
<ul>
<li>Looking at the Munin graphs it seems to just be the last two hours or so, with a slight increase in PostgreSQL connections, firewall traffic, and a more noticeable increase in CPU</li>
<li>DSpace sessions are normal</li>
<li>The number of unique hosts making requests to nginx is pretty low, though it’s only 6AM in the server’s time</li>
</ul>
</li>
<li>I see one IP in Sweden making a lot of requests with a normal user agent: 80.248.237.167
<ul>
<li>This host is on Internet Vikings (INTERNETBOLAGET), and I see 140,000 requests from them in Solr</li>
<li>I see reports of excessive scraping on AbuseIPDB.com</li>
<li>I’m gonna add their 80.248.224.0/20 to the bot-networks.conf in nginx</li>
<li>I will also purge all the hits from this IP in Solr statistics</li>
<li>I also see the core.ac.uk bot making tens of thousands of requests today, but we are already tagging that as a bot in Tomcat’s Crawler Session Manager valve, so they should be sharing a Tomcat session with other bots and not creating too many sessions</li>
<li>Peter and Jose sent more feedback about the CRP Innovation records from MARLO
<ul>
<li>We expanded the CRP names in the citation and removed the <code>cg.identifier.url</code> URLs because they are ugly and will stop working eventually</li>
<li>The mappings of MARLO links will be done internally with the <code>cg.number</code> IDs like “IN-1119” and the Handle URIs</li>
</ul>
</li>
</ul>
<h2id="2022-08-18">2022-08-18</h2>
<ul>
<li>I talked to Jose about the CCAFS MARLO records
<ul>
<li>He still hasn’t finished re-processing the PDFs to update the internal MARLO links</li>
<li>I started looking at the other records (MELIAs, OICRs, Policies) and found some minor issues in the MELIAs so I sent feedback to Jose</li>
<li>On second thought, I opened the MELIAs file in OpenRefine and it looks OK, so this must have been a parsing issue in LibreOffice when I was checking the file (or perhaps I didn’t use the correct quoting when importing)</li>
</ul>
</li>
<li>Import the original MELIA v2 CSV file into OpenRefine to fix encoding before processing with csvcut/csvjoin
<ul>
<li>Then extract the IDs and filenames from the original V2 file and join with the UTF-8 file:</li>