Add notes for 2023-01-31

This commit is contained in:
2023-01-31 22:20:38 +03:00
parent 81f04f48ad
commit 16ba5723eb
29 changed files with 151 additions and 34 deletions

View File

@ -550,5 +550,60 @@ $ csvjoin -c doi /tmp/cgspace-temp.csv /tmp/crossref-results.csv \
- The above was done with just 5,000 DOIs because it was taking a long time, but after the last step I imported into OpenRefine to clean up the license URLs
- Then I imported 635 new licenses to CGSpace woooo
- After checking the remaining 6,500 DOIs there were another 852 new licenses, woooo
- Peter finished the corrections on affiliations, authors, and donors
- I quickly checked them and applied each on CGSpace
- Start a harvest on AReS
## 2023-01-30
- Run the thumbnail fixer tasks on the Initiatives collections:
```console
$ chrt -b 0 dspace dsrun io.github.ilri.cgspace.scripts.FixLowQualityThumbnails 10568/115087 | tee -a /tmp/FixLowQualityThumbnails.log
$ grep -c remove /tmp/FixLowQualityThumbnails.log
16
$ chrt -b 0 dspace dsrun io.github.ilri.cgspace.scripts.FixJpgJpgThumbnails 10568/115087 | tee -a /tmp/FixJpgJpgThumbnails.log
$ grep -c replacing /tmp/FixJpgJpgThumbnails.log
13
```
## 2023-01-31
- Someone from the Google Scholar team contacted us to ask why Googlebot is blocked from crawling CGSpace
- I said that I blocked them because they crawl haphazardly and we had high load during PRMS reporting
- Now I will unblock their ASN15169 in nginx...
- I urged them to be smarter about crawling since we're a small team and they are a huge engineering company
- I removed their ASN and regenerted my list from 2023-01-17:
```console
$ wget https://asn.ipinfo.app/api/text/list/AS714 \
https://asn.ipinfo.app/api/text/list/AS16276 \
https://asn.ipinfo.app/api/text/list/AS23576 \
https://asn.ipinfo.app/api/text/list/AS24940 \
https://asn.ipinfo.app/api/text/list/AS13238 \
https://asn.ipinfo.app/api/text/list/AS32934 \
https://asn.ipinfo.app/api/text/list/AS14061 \
https://asn.ipinfo.app/api/text/list/AS12876 \
https://asn.ipinfo.app/api/text/list/AS55286 \
https://asn.ipinfo.app/api/text/list/AS203020 \
https://asn.ipinfo.app/api/text/list/AS204287 \
https://asn.ipinfo.app/api/text/list/AS50245 \
https://asn.ipinfo.app/api/text/list/AS6939 \
https://asn.ipinfo.app/api/text/list/AS16509 \
https://asn.ipinfo.app/api/text/list/AS14618
$ cat AS* | sort | uniq | wc -l
17134
$ cat /tmp/AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
```
- Then I updated nginx...
- Re-run the scripts to delete duplicate metadata values and update item timestamps that I originally used in 2022-11
- This was about 650 duplicate metadata values...
- Exported CGSpace to do some metadata interrogation in OpenRefine
- I looked at items that are set as `Limited Access` but have Creative Commons licenses
- I filtered ~150 that had DOIs and checked them on the Crossref API using `crossref-doi-lookup.py`
- Of those, only about five or so were incorrectly marked as having Creative Commons licenses, so I set those to copyrighted
- For the rest, I set them to Open Access
- Start a harvest on AReS
<!-- vim: set sw=2 ts=2: -->