<li>I noticed the licenses were missing from Nilam’s original file so I found a way to check <ahref="https://guides.dataverse.org/en/latest/api/native-api.html#export-metadata-of-a-dataset-in-various-formats">Dataverse’s API for a persistent identifier</a></li>
<li>We have both Handles and DOIs for these datasets, both from Harvard’s Dataverse</li>
</ul>
</li>
</ul>
<ul>
<li>I used this GREL in OpenRefine to create a new column based on URLs using the DOI (uppercasing the DOI for Dataverse):</li>
<li>Spent some time checking the remaining 3312 IFPRI 2016–2019 migration set for duplicates on CGSpace
<ul>
<li>There seem to be about 50 exact matches of title, type, and issue date</li>
</ul>
</li>
</ul>
<h2id="2024-06-20">2024-06-20</h2>
<ul>
<li>Finalize merging and uploading metadata for 48 duplicates from the IFPRI 2016–2019 migration set</li>
<li>Heavy load on both CGSpace and DSpace 7 Test this afternoon
<ul>
<li>Took me a while to figure out it was due to someone / something hammering <code>/search</code> for a bunch of facets</li>
<li>The <code>pm2 logs</code> command was more useful than the nginx logs to see the requests at least, for example:</li>
</ul>
</li>
</ul>
<pretabindex="0"><code>0|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&spc.page=1&f.accessRights=Open%20Access,equals&f.dateIssued.min=2023&f.dateIssued.max=2024&f.country=Colombia,equals&f.subject=climate%20change,equals&f.region=Latin%20America%20and%20the%20Caribbean,equals&f.publisher=CGIAR%20FOCUS%20Climate%20Security,equals - - ms - -
1|dspace-ui | GET /search?f.accessRights=Open%20Access,equals&spc.page=1&f.sponsorship=CGIAR%20Trust%20Fund,equals&f.impactArea=Climate%20adaptation%20and%20mitigation,equals&f.region=Eastern%20Africa,equals&f.publisher=International%20Institute%20of%20Tropical%20Agriculture,equals - - ms - -
3|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&f.sdg=SDG%2012%20-%20Responsible%20consumption%20and%20production,equals&spc.page=1&f.affiliation=CGIAR%20Research%20Program%20on%20Climate%20Change,%20Agriculture%20and%20Food%20Security,equals&f.affiliation=Alliance%20of%20Bioversity%20International%20and%20CIAT,equals&f.dateIssued.min=2020&f.dateIssued.max=2021&f.impactArea=Environmental%20health%20and%20biodiversity,equals - - ms - -
</code></pre><ul>
<li>Still difficult to find the client, because the logs are all <ahref="https://github.com/DSpace/dspace-angular/issues/2902">coming from Angular’s user agent</a> and IP
<ul>
<li>I changed the nginx logging to use the <code>X-Forwarded-For</code> header, as the default <code>combined</code> log format uses <code>$remote_addr</code> by default, which is only accurate if the request doesn’t come from Angular (ie directly to the API)</li>
<li>From what I can see now the IPs are all coming from Huawei Cloud and Tencent</li>
<li>The ASNs are AS136907 (Huawei) and AS132203 (Tencent)</li>
<li>For now I will just add those to the list of bot networks</li>
</ul>
</li>
</ul>
<h2id="2024-06-21">2024-06-21</h2>
<ul>
<li>Update the nginx logging to use <ahref="http://nginx.org/en/docs/http/ngx_http_realip_module.html">nginx’s <code>real_ip</code> module</a> to log the correct client IP
<ul>
<li>I think this means we will start sending ‘bot’ to the Angular / Express frontend because bot IPs will be properly classified now…</li>
<li>I will have to re-work or at least re-think that nginx configuration for requests going to the frontend because the proposed fix in <ahref="https://github.com/DSpace/dspace-angular/issues/2902">https://github.com/DSpace/dspace-angular/issues/2902</a> is to pass on the client’s user-agent</li>
<li>Minor updates to <ahref="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> and <ahref="https://github.com/ilri/cgspace-java-helpers">cgspace-java-helpers</a> to normalize a few more invalid DOI formats</li>
</ul>
<h2id="2024-06-25">2024-06-25</h2>
<ul>
<li>Work on uploading some missing PDFs from the IFPRI 2016–2019 batch migration</li>
</ul>
<h2id="2024-06-26">2024-06-26</h2>
<ul>
<li>Did a big cleanup of several thousand journal articles based on metadata from Crossref</li>