cgspace-notes/content/posts/2022-05.md

100 lines
4.7 KiB
Markdown
Raw Normal View History

2022-05-04 10:09:45 +02:00
---
title: "May, 2022"
date: 2022-05-04T09:13:39+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-05-04
- I found a few more IPs making requests using the shady Chrome 44 user agent in the last few days so I will add them to the block list too:
- 18.207.136.176
- 185.189.36.248
- 50.118.223.78
- 52.70.76.123
- 3.236.10.11
- Looking at the Solr statistics for 2022-04
- 52.191.137.59 is Microsoft, but they are using a normal user agent and making tens of thousands of requests
- 64.39.98.62 is owned by Qualys, and all their requests are probing for /etc/passwd etc
- 185.192.69.15 is in the Netherlands and is using a normal user agent, but making excessive automated HTTP requests to paths forbidden in robots.txt
- 157.55.39.159 is owned by Microsoft and identifies as bingbot so I don't know why its requests were logged in Solr
- 52.233.67.176 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
- 157.55.39.144 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
- 207.46.13.177 is owned by Microsoft and identifies as bingbot so I don't know why its requests were logged in Solr
- If I query Solr for `time:2022-04* AND dns:*msnbot* AND dns:*.msn.com.` I see a handful of IPs that made 41,000 requests
- I purged 93,974 hits from these IPs using my `check-spider-ip-hits.sh` script
<!--more-->
- Now looking at the Solr statistics by user agent I see:
- `SomeRandomText`
- `RestSharp/106.11.7.0`
- `MetaInspector/5.7.0 (+https://github.com/jaimeiniesta/metainspector)`
- `wp_is_mobile`
- `Mozilla/5.0 (compatible; um-LN/1.0; mailto: techinfo@ubermetrics-technologies.com; Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1"`
- `insomnia/2022.2.1`
- `ZoteroTranslationServer`
- `omgili/0.5 +http://omgili.com`
- `curb`
- `Sprout Social (Link Attachment)`
- I purged 2,900 hits from these user agents from Solr using my `check-spider-hits.sh` script
- I made a [pull request to COUNTER-Robots](https://github.com/atmire/COUNTER-Robots/pull/54) for some of these agents
- In the mean time I will add them to our local overrides in DSpace
2022-05-04 15:48:24 +02:00
- Run all system updates on AReS server, update all Docker containers, and restart the server
- Start a harvest on AReS
2022-05-04 10:09:45 +02:00
2022-05-05 11:46:13 +02:00
## 2022-05-05
- Update PostgreSQL JDBC driver to 42.3.5 in the Ansible infrastructure playbooks and deploy on DSpace Test
- Peter asked me how many items we add to CGSpace every year
- I wrote a SQL query to check the number of items grouped by their accession dates since 2009:
```console
localhost/dspacetest= ☘ SELECT EXTRACT(year from text_value::date) AS YYYY, COUNT(*) FROM metadatavalue WHERE metadata_field_id=11 GROUP BY YYYY ORDER BY YYYY DESC LIMIT 14;
yyyy │ count
──────┼───────
2022 │ 2073
2021 │ 6471
2020 │ 4074
2019 │ 7330
2018 │ 8899
2017 │ 6860
2016 │ 8451
2015 │ 15692
2014 │ 16479
2013 │ 4388
2012 │ 6472
2011 │ 2694
2010 │ 2457
2009 │ 293
```
- Note that I had an issue with casting `text_value` to date because one item had an accession date of `2016` instead of `2016-09-29T20:14:47Z`
- Once I fixed that PostgreSQL was able to [extract() the year](https://www.postgresql.org/docs/12/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT)
- There were some other methods I tried that worked also, for example `TO_DATE()`:
```console
localhost/dspacetest= ☘ SELECT EXTRACT(year from TO_DATE(text_value, 'YYYY-MM-DD"T"HH24:MI:SS"Z"')) AS YYYY, COUNT(*) FROM metadatavalue WHERE metadata_field_id=11 GROUP BY YYYY ORDER BY YYYY DESC LIMIT 14;
```
- But it seems PostgreSQL is smart enough to recognize date formatting in strings automatically when we cast so we don't need to convert to date first
- Another thing I noticed is that a few hundred items have accession dates from decades ago, perhaps this is due to importing items from the CGIAR Library?
2022-05-05 11:47:48 +02:00
- I spent some time merging a few pull requests for DSpace 6.4 and porting one to `main` for DSpace 7.x
2022-05-10 15:35:50 +02:00
- I also submitted a [pull request to migrate Mirage 2's build from bower and compass to yarn and node-sass](https://github.com/DSpace/DSpace/pull/8288)
2022-05-05 11:46:13 +02:00
2022-05-08 20:23:36 +02:00
## 2022-05-07
- Start a harvest on AReS
2022-05-10 15:35:50 +02:00
## 2022-05-09
- Submit an issue to Atmire's bug tracker inquiring about DSpace 6.4 support
## 2022-05-10
- Submit an updated [pull request to migrate Mirage 2's build from bower and compass to npm and node-sass](https://github.com/DSpace/DSpace/pull/8292)
- This one is better than the previous one because it uses npm directly, which comes with the Node.js distribution, rather than requiring the user to install yarn
- I also updated a bunch of grunt build deps
2022-05-04 10:09:45 +02:00
<!-- vim: set sw=2 ts=2: -->