mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2020-10-12
This commit is contained in:
@ -215,4 +215,80 @@ $ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:
|
||||
2. If the collection has a workflow the item will enter it and the API returns an item ID
|
||||
3. If the collection does not have a workflow then the item is committed to the archive and you get a Handle
|
||||
|
||||
## 2020-10-09
|
||||
|
||||
- Skype with Peter about AReS and CGSpace
|
||||
- We discussed removing Atmire Listings and Reports from DSpace 6 because we can probably make the same reports in AReS and this module is the one that is currently holding us back from the upgrade
|
||||
- We discussed allowing partners to submit content via the REST API and perhaps making it an extra fee due to the burden it incurs with unfinished submissions, manual duplicate checking, developer support, etc
|
||||
- He was excited about the possibility of using my statistics API for more things on AReS as well as item view pages
|
||||
- Also I fixed a bunch of the CRP mappings in the AReS value mapper and started a fresh re-indexing
|
||||
|
||||
## 2020-10-12
|
||||
|
||||
- Looking at CGSpace's Solr statistics for 2020-09 and I see:
|
||||
- `RTB website BOT`: 212916
|
||||
- `Java/1.8.0_66`: 3122
|
||||
- `Mozilla/5.0 (compatible; um-LN/1.0; mailto: techinfo@ubermetrics-technologies.com; Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1`: 614
|
||||
- `omgili/0.5 +http://omgili.com`: 272
|
||||
- `Mozilla/5.0 (compatible; TrendsmapResolver/0.1)`: 199
|
||||
- `Vizzit`: 160
|
||||
- `Scoop.it`: 151
|
||||
- I'm confused because a pattern for `bot` has existed in the default DSpace spider agents file forever...
|
||||
- I see 259,000 hits in CGSpace's 2020 Solr core when I search for this: `userAgent:/.*[Bb][Oo][Tt].*/`
|
||||
- This includes 228,000 for `RTB website BOT` and 18,000 for `ILRI Livestock Website Publications importer BOT`
|
||||
- I made a few requests to DSpace Test with the RTB user agent to see if it gets logged or not:
|
||||
|
||||
```
|
||||
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
```
|
||||
|
||||
- After a few minutes I saw these four hits in Solr... WTF
|
||||
- So is there some issue with DSpace's parsing of the spider agent files?
|
||||
- I added `RTB website BOT` to the ilri pattern file, restarted Tomcat, and made four more requests to the bitstream
|
||||
- These four requests were recorded in Solr too, WTF!
|
||||
- It seems like the patterns aren't working at all...
|
||||
- I decided to try something drastic and removed all pattern files, adding only one single pattern `bot` to make sure this is not because of a syntax or precedence issue
|
||||
- Now even those four requests were recorded in Solr, WTF!
|
||||
- I will try one last thing, to put a single entry with the exact pattern `RTB website BOT` in a single spider agents pattern file...
|
||||
- Nope! Still records the hits... WTF
|
||||
- As a last resort I tried to use the vanilla [DSpace 6 `example` file](https://github.com/DSpace/DSpace/blob/dspace-6_x/dspace/config/spiders/agents/example)
|
||||
- And the hits still get recorded... WTF
|
||||
- So now I'm wondering if this is because of our custom Atmire shit?
|
||||
- I will have to test on a vanilla DSpace instance I guess before I can complain to the dspace-tech mailing list
|
||||
- I re-factored the `check-spider-hits.sh` script to read patterns from a text file rather than sed's stdout, and to properly search for spaces in patterns that use `\s` because Lucene's search syntax doesn't support it (and spaces work just fine)
|
||||
- Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html
|
||||
- Reference: https://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Regexp_Searches
|
||||
- I added `[Ss]pider` to the Tomcat Crawler Sessions Manager Valve regex because this can catch a few more generic bots and force them to use the same Tomcat JSESSIONID
|
||||
- I added a few of the patterns from above to our local agents list and ran the `check-spider-hits.sh` on CGSpace:
|
||||
|
||||
```
|
||||
$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics -u http://localhost:8083/solr -p
|
||||
Purging 228916 hits from RTB website BOT in statistics
|
||||
Purging 18707 hits from ILRI Livestock Website Publications importer BOT in statistics
|
||||
Purging 2661 hits from ^Java\/[0-9]{1,2}.[0-9] in statistics
|
||||
Purging 199 hits from [Ss]pider in statistics
|
||||
Purging 2326 hits from ubermetrics in statistics
|
||||
Purging 888 hits from omgili\.com in statistics
|
||||
Purging 1888 hits from TrendsmapResolver in statistics
|
||||
Purging 3546 hits from Vizzit in statistics
|
||||
Purging 2127 hits from Scoop\.it in statistics
|
||||
|
||||
Total number of bot hits purged: 261258
|
||||
$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics-2019 -u http://localhost:8083/solr -p
|
||||
Purging 2952 hits from TrendsmapResolver in statistics-2019
|
||||
Purging 4252 hits from Vizzit in statistics-2019
|
||||
Purging 2976 hits from Scoop\.it in statistics-2019
|
||||
|
||||
Total number of bot hits purged: 10180
|
||||
$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics-2018 -u http://localhost:8083/solr -p
|
||||
Purging 1702 hits from TrendsmapResolver in statistics-2018
|
||||
Purging 1062 hits from Vizzit in statistics-2018
|
||||
Purging 920 hits from Scoop\.it in statistics-2018
|
||||
|
||||
Total number of bot hits purged: 3684
|
||||
```
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user