I learned how to use the Levenshtein functions in PostgreSQL
The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing
Also, the trgm functions I’ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first
I learned how to use the Levenshtein functions in PostgreSQL
The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing
Also, the trgm functions I’ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first
<li>I learned how to use the Levenshtein functions in PostgreSQL
<ul>
<li>The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing</li>
<li>Also, the trgm functions I’ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first</li>
</ul>
</li>
</ul>
<ul>
<li>A working query checking for duplicates in the recent AfricaRice items is:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>localhost/dspace= ☘ SELECT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND levenshtein_less_equal(LOWER('International Trade and Exotic Pests: The Risks for Biodiversity and African Economies'), LEFT(LOWER(text_value), 255), 3) <= 3;
</span></span></span><spanstyle="display:flex;"><span><spanstyle="color:#960050;background-color:#1e0010"></span>Time: 399.751 ms
</span></span></code></pre></div><ul>
<li>There is a great <ahref="https://www.crunchydata.com/blog/fuzzy-name-matching-in-postgresql">blog post discussing Soundex with Levenshtein</a> and creating indexes to make them faster</li>
<li>I want to do some proper checks of accuracy and speed against my trigram method</li>
<li>This seems low, so it must have been from the request patterns by certain visitors
<ul>
<li>64.39.98.251 is Qualys, and I’m debating blocking <ahref="https://pci.qualys.com/static/help/merchant/getting_started/check_scanner_ip_addresses.htm">all their IPs</a> using a geo block in nginx (need to test)</li>
<li>The top few are known ILRI and other CGIAR scrapers, but 80.248.237.167 is on InternetVikings in Sweden, using a normal user agentand scraping Discover</li>
<li>64.124.8.59 is making requests with a normal user agent and belongs to Castle Global or Zayo</li>
</ul>
</li>
<li>I ran all system updates and rebooted the server (could have just restarted PostgreSQL but I thought I might as well do everything)</li>
<li>I implemented a geo mapping for the user agent mapping AND the nginx <code>limit_req_zone</code> by extracting the networks into an external file and including it in two different geo mapping blocks
<ul>
<li>This is clever and relies on the fact that we can use defaults in both cases</li>
<li>First, we map the user agent of requests from these networks to “bot” so that Tomcat and Solr handle them accordingly</li>
<li>Second, we use this as a key in a <code>limit_req_zone</code>, which relies on a default mapping of ’’ (and nginx doesn’t evaluate empty cache keys)</li>
</ul>
</li>
<li>I noticed that CIP uploaded a number of Georgian presentations with <code>dcterms.language</code> set to English and Other so I changed them to “ka”
<ul>
<li>Perhaps we need to update our list of languages to include all instead of the most common ones</li>
<li>I wrote a script <code>ilri/iso-639-value-pairs.py</code> to extract the names and Alpha 2 codes for all ISO 639-1 languages from pycountry and added them to <code>input-forms.xml</code></li>
<li>CGSpace went down and up a few times due to high load
<ul>
<li>I found one host in Romania making very high speed requests with a normal user agent (<code>Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C</code>):</li>
<li>I added 146.19.75.141 to the list of bot networks in nginx</li>
<li>While looking at the logs I started thinking about Bing again
<ul>
<li>They apparently <ahref="https://www.bing.com/toolbox/bingbot.json">publish a list of all their networks</a></li>
<li>I wrote a script to use <code>prips</code> to <ahref="https://stackoverflow.com/a/52501093/1996540">print the IPs for each network</a></li>
<li>The script is <code>bing-networks-to-ips.sh</code></li>
<li>From Bing’s IPs alone I purged 145,403 hits… sheesh</li>
</ul>
</li>
<li>Delete two items on CGSpace for Margarita because she was getting the “Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:0b26875a-…” error
<ul>
<li>This is the same DSpace 6 bug I noticed in 2021-03, 2021-04, and 2021-05</li>
</ul>
</li>
<li>Update some <code>cg.audience</code> metadata to use “Academics” instead of “Academicians”:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value='Academics' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value='Academicians';
<li>Finalize lists of non-AGROVOC subjects in CGSpace that I started last week
<ul>
<li>I used the <ahref="https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6">SQL helper functions</a> to find the collections where each term was used:</li>
</ul>
</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>localhost/dspace= ☘ SELECT DISTINCT(ds6_item2collectionhandle(dspace_object_id)) AS collection, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND LOWER(text_value) = 'water demand' GROUP BY collection ORDER BY count DESC LIMIT 5;
<li>For now I only did terms from my list that had 100 or more occurrences in CGSpace
<ul>
<li>This leaves us with thirty-six terms that I will send to Sara Jani and Elizabeth Arnaud for evaluating possible inclusion to AGROVOC</li>
</ul>
</li>
<li>Write to some submitters from CIAT, Bioversity, and CCAFS to ask if they are still uploading new items with their legacy subject fields on CGSpace
<ul>
<li>We want to remove them from the submission form to create space for new fields</li>
</ul>
</li>
<li>Update one term I noticed people using that was close to AGROVOC:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value='development policies' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value='development policy';
<li>I modified my <code>check-duplicates.py</code> script a bit to fix a logic error for deleted items and add similarity scores from spacy (see: <ahref="https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents">https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents</a>)
<ul>
<li>I want to use this with the MARLO innovation reports, to find related publications and working papers on CGSpace</li>
<li>I am curious to see how the similarity scores compare to those from trgm… perhaps we don’t need them actually</li>
</ul>
</li>
<li>Deploy latest changes to submission form, Discovery, and browse on CGSpace
<ul>
<li>Also run all system updates and reboot the host</li>
</ul>
</li>
<li>Fix 152 <code>dcterms.relation</code> that are using “cgspace.cgiar.org” links instead of handles:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '.*cgspace\.cgiar\.org/handle/(\d+/\d+)$', 'https://hdl.handle.net/\1') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=180 AND text_value ~ 'cgspace\.cgiar\.org/handle/\d+/\d+$';
</span></span></code></pre></div><p>Looking at the top twenty I see some usual IPs, but some new ones on Hetzner that are using many DSpace sessions:</p>
<li>Update an incorrect ORCID identifier for Alliance</li>
<li>Adjust collection permissions on CIFOR publications collection so Vika can submit without approval</li>
</ul>
<h2id="2022-07-14">2022-07-14</h2>
<ul>
<li>Someone on the DSpace Slack mentioned having issues with the database configuration in DSpace 7.3
<ul>
<li>The reason is apparently that the default <code>db.dialect</code> changed from “org.dspace.storage.rdbms.hibernate.postgres.DSpacePostgreSQL82Dialect” to “org.hibernate.dialect.PostgreSQL94Dialect” as a result of a Hibernate update</li>
</ul>
</li>
<li>Then I was getting more errors starting the backend server in Tomcat, but the issue was that the backend server needs Solr to be up first!</li>
<li>I see some Hetzner IPs in the top users today, but most of the requests are getting HTTP 503 because of the changes I made last week</li>
<li>I see 137.184.159.211, which is on Digital Ocean, and the DNS is apparently iitawpsite.iita.org
<ul>
<li>I’ve seen their user agent before, but I don’t think I knew it was IITA: “GuzzleHttp/6.3.3 curl/7.84.0 PHP/7.4.30”</li>
<li>I already have something in nginx to mark Guzzle as a bot, but interestingly it shows up in Solr as <code>$http_user_agent</code> so there is a logic error in my nginx config</li>
</span></span></span><spanstyle="display:flex;"><span><spanstyle="color:#960050;background-color:#1e0010"></span> include /etc/nginx/bot-networks.conf;
</span></span><spanstyle="display:flex;"><span>}
</span></span></code></pre></div><ul>
<li>After some testing on DSpace Test I see that this is actually setting the default user agent to a literal <code>$http_user_agent</code></li>
<li>Reading more about nginx’s geo/map and doing some tests on DSpace Test, it appears that the <ahref="https://stackoverflow.com/questions/47011497/nginx-geo-module-wont-use-variables">geo module cannot do dynamic values</a>
<ul>
<li>So this issue with the literal <code>$http_user_agent</code> is due to the geo block I put in place earlier this month</li>
<li>I reworked the logic so that the geo block sets “bot” or and empty string when a network matches or not, and then re-use that value in a mapping that passes through the host’s user agent in case geo has set it to an empty string</li>
<li>This allows me to accomplish the original goal while still only using one bot-networks.conf file for the <code>limit_req_zone</code> and the user agent mapping that we pass to Tomcat</li>
<li>Unfortunately this means I will have hundreds of thousands of requests in Solr with a literal <code>$http_user_agent</code></li>
<li>I might try to purge some by enumerating all the networks in my block file and running them through <code>check-spider-ip-hits.sh</code></li>
<li><code>g!</code>: global, lines <em>not</em> matching (the opposite of <code>g</code>)</li>
<li><code>/\/\d\+$/</code>, pattern matching <code>/</code> with one or more digits at the end of the line</li>
<li><code>s/^\(\d\+\.\d\+\.\d\+\.\d\+\)$/\1\/32/</code>, for lines not matching above, capture the IPv4 address and add <code>/32</code> at the end</li>
</ul>
</li>
<li>Then I ran the list through prips to enumerate the IPs:</li>
<li>Review the AfricaRice records from earlier this month again
<ul>
<li>I found one more duplicate and one more suspicious item, so the total after removing those is now forty-two</li>
</ul>
</li>
<li>I took all the ~560 IPs that had hits so far in <code>check-spider-ip-hits.sh</code> above (about 270,000 into the list of 1946968 above) and ran them directly on CGSpace
<ul>
<li>This purged 199,032 hits from Solr, very many of which were from Qualys, but also that Chinese bot on 124.17.34.0/24 that was grabbing PDFs a few years ago which I blocked in nginx, but never purged the hits from</li>
<li>Then I deleted all IPs up to the last one where I found hits in the large file of 1946968 IPs and re-started the script</li>
</ul>
</li>
</ul>
<h2id="2022-07-20">2022-07-20</h2>
<ul>
<li>Did a few more minor edits to the forty-two AfricaRice records (including generating thumbnails for the handful that are Creative Commons licensed) then did a test import on my local instance
<ul>
<li>Once it worked well I did an import to CGSpace:</li>
</ul>
</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>$ dspace import -a -e fuuu@example.com -m 2022-07-20-africarice.map -s /tmp/SimpleArchiveFormat
</span></span></code></pre></div><ul>
<li>Also make edits to ~62 affiliations on CGSpace because I noticed they were messed up</li>
<li>Extract another ~1,600 IPs that had hits since I started the second round of <code>check-spider-ip-hits.sh</code> yesterday and purge another 303,594 hits
<ul>
<li>This is about 999846 into the original list of 1946968 from yesterday</li>
<li>A metric fuck ton of the IPs in this batch were from Hetzner</li>
</ul>
</li>
</ul>
<h2id="2022-07-21">2022-07-21</h2>
<ul>
<li>Extract another ~2,100 IPs that had hits since I started the third round of <code>check-spider-ip-hits.sh</code> last night and purge another 763,843 hits
<ul>
<li>This is about 1441221 into the original list of 1946968 from two days ago</li>
<li>Again these are overwhelmingly Hetzner (not surprising since my bot-networks.conf file in nginx is mostly Hetzner)</li>