mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Update notes for 2019-11-07
This commit is contained in:
@ -177,5 +177,39 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.g
|
||||
- CCAFS finally confirmed that they do indeed need the confusing new project tag that looks like a duplicate
|
||||
- They had proposed a batch of new tags in 2019-09 and we never merged them due to this uncertainty
|
||||
- I have now merged the changes in to the `5_x-prod` branch ([#432](https://github.com/ilri/DSpace/pull/432))
|
||||
- I am reconsidering the move of `cg.identifier.dataurl` to `cg.hasMetadata` in CG Core v2
|
||||
- The values of this field are mostly links to data sets on Dataverse and partner sites
|
||||
- I opened an [issue on GitHub](https://github.com/AgriculturalSemantics/cg-core/issues/10) to ask Marie-Angelique for clarification
|
||||
- Looking into CGSpace statistics again
|
||||
- I searched for hits in Solr from the BUbiNG bot and found 63,000 in the `statistics-2018` core:
|
||||
|
||||
```
|
||||
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="62944" start="0">
|
||||
```
|
||||
|
||||
- Similar for com.plumanalytics, Grammarly, and ltx71!
|
||||
|
||||
```
|
||||
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:
|
||||
*com.plumanalytics*' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="28256" start="0">
|
||||
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*Grammarly*' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="6288" start="0">
|
||||
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="105663" start="0">
|
||||
```
|
||||
|
||||
- Deleting these seems to work, for example the 105,000 ltx71 records from 2018:
|
||||
|
||||
```
|
||||
$ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true'
|
||||
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="0" start="0"/>
|
||||
```
|
||||
|
||||
- I wrote a quick bash script to check all these user agents against the CGSpace Solr statistics cores
|
||||
- For years 2010 until 2019 there are 1.6 million hits from these spider user agents
|
||||
- For 2019 alone there are 740,000, over half of which come from Unpaywall!
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user