cgspace-notes/2018-09.md at 8383cd466bd9e475c36e9fec05f6ddba99427b65

mirror of https://github.com/alanorth/cgspace-notes.git synced 2024-11-09 16:45:45 +01:00

They moved to Lyrasis in 2019.

2020-04-13 15:30:34 +03:00

41 KiB

Raw Blame History

title

date

author

2018-09-02

New PostgreSQL JDBC driver version 42.2.5
I'll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
Also, I'll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month
I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:

02-Sep-2018 11:18:52.678 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
 java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
    at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
    at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4776)
    at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5240)
    at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
    at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:754)
    at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:730)
    at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734)
    at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:629)
    at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1838)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations:

Full log here: https://gist.github.com/alanorth/1e4ae567b853fea9d9dbf1a030ecd8c2
XMLUI fails to load, but the REST, SOLR, JSPUI, etc work
The old 5_x-prod-dspace-5.5 branch does work in Ubuntu 18.04 with Tomcat 8.5.30-1ubuntu1.4, however!
And the 5_x-prod DSpace 5.8 branch does work in Tomcat 8.5.x on my Arch Linux laptop...
I'm not sure where the issue is then!

2018-09-03

Abenet says she's getting three emails about periodic statistics reports every day since the DSpace 5.8 upgrade last week
They are from the CUA module
Two of them have "no data" and one has a "null" title
The last one is a report of the top downloaded items, and includes a graph
She will try to click the "Unsubscribe" link in the first two to see if it works, otherwise we should contact Atmire
The only one she remembers subscribing to is the top downloads one

2018-09-04

I'm looking over the latest round of IITA records from Sisay: Mercy1806_August_29
- All fields are split with multiple columns like cg.authorship.types and cg.authorship.types[]
- This makes it super annoying to do the checks and cleanup, so I will merge them (also time consuming)
- Five items had dc.date.issued values like 2013-5 so I corrected them to be 2013-05
- Several metadata fields had values with newlines in them (even in some titles!), which I fixed by trimming the consecutive whitespaces in Open Refine
- Many (91!) items from before 2011 are indicated as having a CRP, but CRPs didn't exist then so this is impossible
  - I got all items that were from 2011 and onwards using a custom facet with this GREL on the dc.date.issued column: isNotNull(value.match(/201[1-8].*/)) and then blanking their CRPs
- Some affiliations with only one separator (|) for multiple values
- I replaced smart quotes like ’ with plain ones
- Some inconsistencies in cg.subject.iita like COWPEA and COWPEAS, and YAM and YAMS, etc, as well as some spelling mistakes like IMPACT ASSESSMENTN
- Some values in the dc.identifier.isbn are actually ISSNs so I moved them to the dc.identifier.issn column
- I found one invalid ISSN using a custom text facet with the regex from the ISSN page on Wikipedia: isNotBlank(value.match(/^\d{4}-\d{3}[\dxX]$/))
- One invalid value for dc.type
Abenet says she hasn't received any more subscription emails from the CUA module since she unsubscribed yesterday, so I think we don't need create an issue on Atmire's bug tracker anymore

2018-09-10

Playing with strest to test the DSpace REST API programatically
For example, given this test.yaml:

version: 1

requests:
  test:
    method: GET
    url: https://dspacetest.cgiar.org/rest/test
    validate:
      raw: "REST api is running."

  login:
    url: https://dspacetest.cgiar.org/rest/login
    method: POST
    data:
      json: {"email":"test@dspace","password":"thepass"}

  status:
    url: https://dspacetest.cgiar.org/rest/status
    method: GET
    headers:
      rest-dspace-token: Value(login)

  logout:
    url: https://dspacetest.cgiar.org/rest/logout
    method: POST
    headers:
      rest-dspace-token: Value(login)

# vim: set sw=2 ts=2:

Works pretty well, though the DSpace logout always returns an HTTP 415 error for some reason
We could eventually use this to test sanity of the API for creating collections etc
A user is getting an error in her workflow:

2018-09-10 07:26:35,551 ERROR org.dspace.submit.step.CompleteStep @ Caught exception in submission step: 
org.dspace.authorize.AuthorizeException: Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:2 by user 3819

Seems to be during submit step, because it's workflow step 1...?
Move some top-level CRP communities to be below the new CGIAR Research Programs and Platforms community:

$ dspace community-filiator --set -p 10568/97114 -c 10568/51670
$ dspace community-filiator --set -p 10568/97114 -c 10568/35409
$ dspace community-filiator --set -p 10568/97114 -c 10568/3112

Valerio contacted me to point out some issues with metadata on CGSpace, which I corrected in PostgreSQL:

update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI Juornal';
UPDATE 1
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI journal';
UPDATE 23
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='YES';
UPDATE 1
delete from metadatavalue where resource_type_id=2 and metadata_field_id=226 and text_value='NO';
DELETE 17
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI';
UPDATE 15

Start working on adding metadata for access and usage rights that we started earlier in 2018 (and also in 2017)
The current cg.identifier.status field will become "Access rights" and dc.rights will become "Usage rights"
I have some work in progress on the 5_x-rights branch
Linode said that CGSpace (linode18) had a high CPU load earlier today
When I looked, I see it's the same Russian IP that I noticed last month:

# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Sep/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
   1459 157.55.39.202
   1579 95.108.181.88
   1615 157.55.39.147
   1714 66.249.64.91
   1924 50.116.102.77
   3696 157.55.39.106
   3763 157.55.39.148
   4470 70.32.83.92
   4724 35.237.175.180
  14132 5.9.6.51

And this bot is still creating more Tomcat sessions than Nginx requests (WTF?):

# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-09-10 
14133

The user agent is still the same:

Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)

I added .*crawl.* to the Tomcat Session Crawler Manager Valve, so I'm not sure why the bot is creating so many sessions...
I just tested that user agent on CGSpace and it does not create a new session:

$ http --print Hh https://cgspace.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)'
GET / HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: cgspace.cgiar.org
User-Agent: Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)

HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Mon, 10 Sep 2018 20:43:04 GMT
Server: nginx
Strict-Transport-Security: max-age=15768000
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block

I will have to keep an eye on it and perhaps add it to the list of "bad bots" that get rate limited

2018-09-12

Merge AReS explorer changes to nginx config and deploy on CGSpace so CodeObia can start testing more
Re-create my local Docker container for PostgreSQL data, but using a volume for the database data:

$ sudo docker volume create --name dspacetest_data
$ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine

Sisay is still having problems with the controlled vocabulary for top authors
I took a look at the submission template and Firefox complains that the XML file is missing a root element
I guess it's because Firefox is receiving an empty XML file
I told Sisay to run the XML file through tidy
More testing of the access and usage rights changes

2018-09-13

Peter was communicating with Altmetric about the OAI mapping issue for item 10568/82810 again
Altmetric said it was somehow related to the OAI dateStamp not getting updated when the mappings changed, but I said that back in [2018-07]({{< relref "2018-07.md" >}}) when this happened it was because the OAI was actually just not reflecting all the item's mappings
After forcing a complete re-indexing of OAI the mappings were fine
The dateStamp is most probably only updated when the item's metadata changes, not its mappings, so if Altmetric is relying on that we're in a tricky spot
We need to make sure that our OAI isn't publicizing stale data... I was going to post something on the dspace-tech mailing list, but never did
Linode says that CGSpace (linode18) has had high CPU for the past two hours
The top IP addresses today are:

# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "13/Sep/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10                                                                                                
     32 46.229.161.131
     38 104.198.9.108
     39 66.249.64.91
     56 157.55.39.224
     57 207.46.13.49
     58 40.77.167.120
     78 169.255.105.46
    702 54.214.112.202
   1840 50.116.102.77
   4469 70.32.83.92

And the top two addresses seem to be re-using their Tomcat sessions properly:

$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=70.32.83.92' dspace.log.2018-09-13 | sort | uniq
7
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-13 | sort | uniq
2

So I'm not sure what's going on
Valerio asked me if there's a way to get the page views and downloads from CGSpace
I said no, but that we might be able to piggyback on the Atmire statlet REST API
For example, when you expand the "statlet" at the bottom of an item like 10568/97103 you can see the following request in the browser console:

https://cgspace.cgiar.org/rest/statlets?handle=10568/97103

That JSON file has the total page views and item downloads for the item...
Abenet forwarded a request by CIP that item thumbnails be included in RSS feeds
I had a quick look at the DSpace 5.x manual and it doesn't not seem that this is possible (you can only add metadata)
Testing the new LDAP server the CGNET says will be replacing the old one, it doesn't seem that they are using the global catalog on port 3269 anymore, now only 636 is open
I did a clean deploy of DSpace 5.8 on Ubuntu 18.04 with some stripped down Tomcat 8 configuration and actually managed to get it up and running without the autowire errors that I had previously experienced
I realized that it always works on my local machine with Tomcat 8.5.x, but not when I do the deployment from Ansible in Ubuntu 18.04
So there must be something in my Tomcat 8 server.xml template
Now I re-deployed it with the normal server template and it's working, WTF?
Must have been something like an old DSpace 5.5 file in the spring folder... weird
But yay, this means we can update DSpace Test to Ubuntu 18.04, Tomcat 8, PostgreSQL 9.6, etc...

2018-09-14

Sisay uploaded the IITA records to CGSpace, but forgot to remove the old Handles
I explicitly told him not to forget to remove them yesterday!

2018-09-16

Add the DSpace build.properties as a template into my Ansible infrastructure scripts for configuring DSpace machines
One stupid thing there is that I add all the variables in a private vars file, which is apparently higher precedence than host vars, meaning that I can't override them (like SMTP server) on a per-host basis
Discuss access and usage rights with Peter
I suggested that we leave access rights (cg.identifier.access) as it is now, with "Open Access" or "Limited Access", and then simply re-brand that as "Access rights" in the UIs and relevant drop downs
Then we continue as planned to add dc.rights as "Usage rights"

2018-09-17

Skype meeting with CGSpace team in Addis
Change cg.identifier.status "Access rights" options to:
- Open Access→Unrestricted Access
- Limited Access→Restricted Access
- Metadata Only
Update these immediately, but talk to CodeObia to create a mapping between the old and new values
Finalize dc.rights "Usage rights" with seven combinations of Creative Commons, plus the others
Need to double check the new CRP community to see why the collection counts aren't updated after we moved the communities there last week
- I forced a full Discovery re-index and now the community shows 1,600 items
Check if it's possible to have items deposited via REST use a workflow so we can perhaps tell ICARDA to use that from MEL
Agree that we'll publicize AReS explorer on the week before the Big Data Platform workshop
- Put a link and or picture on the CGSpace homepage saying "Visualized CGSpace research" or something, and post a message on Yammer
I want to explore creating a thin API to make the item view and download stats available from Solr so CodeObia can use them in the AReS explorer
Currently CodeObia is exploring using the Atmire statlets internal API, but I don't really like that...
There are some example queries on the DSpace Solr wiki
For example, this query returns 1655 rows for item 10568/10630:

$ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false'

The id in the Solr query is the item's database id (get it from the REST API or something)
Next, I adopted a query to get the downloads and it shows 889, which is similar to the number Atmire's statlet shows, though the query logic here is confusing:

$ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'

According to the SolrQuerySyntax page on the Apache wiki, the [* TO *] syntax just selects a range (in this case all values for a field)
So it seems to be:
- type:0 is for bitstreams according to the DSpace Solr documentation
- -(bundleName:[*+TO+*]-bundleName:ORIGINAL) seems to be a negative query starting with all documents, subtracting those with bundleName:ORIGINAL, and then negating the whole thing... meaning only documents from bundleName:ORIGINAL?
What the shit, I think I'm right: the simplified logic in this query returns the same 889:

$ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=bundleName:ORIGINAL&fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'

And if I simplify the statistics_type logic the same way, it still returns the same 889!

$ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=bundleName:ORIGINAL&fq=statistics_type:view'

As for item views, I suppose that's just the same query, minus the bundleName:ORIGINAL:

$ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=-bundleName:ORIGINAL&fq=statistics_type:view'

That one returns 766, which is exactly 1655 minus 889...
Also, Solr's fq is similar to the regular q query parameter, but it is considered for the Solr query cache so it should be faster for multiple queries

2018-09-18

I managed to create a simple proof of concept REST API to expose item view and download statistics: cgspace-statistics-api
It uses the Python-based Falcon web framework and talks to Solr directly using the SolrClient library (which seems to have issues in Python 3.7 currently)
After deploying on DSpace Test I can then get the stats for an item using its ID:

$ http -b 'https://dspacetest.cgiar.org/rest/statistics/item?id=110988'
{
    "downloads": 2,
    "id": 110988,
    "views": 15
}

The numbers are different than those that come from Atmire's statlets for some reason, but as I'm querying Solr directly, I have no idea where their numbers come from!
Moayad from CodeObia asked if I could make the API be able to paginate over all items, for example: /statistics?limit=100&page=1
Getting all the item IDs from PostgreSQL is certainly easy:

dspace=# select item_id from item where in_archive is True and withdrawn is False and discoverable is True;

The rest of the Falcon tooling will be more difficult...

2018-09-19

I emailed Jane Poole to ask if there is some money we can use from the Big Data Platform (BDP) to fund the purchase of some Atmire credits for CGSpace
I learned that there is an efficient way to do "deep paging" in large Solr results sets by using cursorMark, but it doesn't work with faceting

2018-09-20

Contact Atmire to ask how we can buy more credits for future development (#644)
I researched the Solr filterCache size and I found out that the formula for calculating the potential memory use of each entry in the cache is:

((maxDoc/8) + 128) * (size_defined_in_solrconfig.xml)

Which means that, for our statistics core with 149 million documents, each entry in our filterCache would use 8.9 GB!

((149374568/8) + 128) * 512 = 9560037888 bytes (8.9 GB)

So I think we can forget about tuning this for now!
Discussion on the mailing list about filterCache size
Article discussing testing methodology for different filterCache sizes
Discuss Handle links on Twitter with IWMI

2018-09-21

I see that there was a nice optimization to the ImageMagick PDF CMYK detection in the upstream dspace-5_x branch: DS-3664
The fix will go into DSpace 5.10, and we are currently on DSpace 5.8 but I think I'll cherry-pick that fix into our 5_x-prod branch:
- 4e8c7b578bdbe26ead07e36055de6896bbf02f83: ImageMagick: Only execute "identify" on first page
I think it would also be nice to cherry-pick the fixes for DS-3883, which is related to optimizing the XMLUI item display of items with many bitstreams
- a0ea20bd1821720b111e2873b08e03ce2bf93307: DS-3883: Don't loop through original bitstreams if only displaying thumbnails
- 8d81e825dee62c2aa9d403a505e4a4d798964e8d: DS-3883: If only including thumbnails, only load the main item thumbnail.

2019-09-23

I did more work on my cgspace-statistics-api, fixing some item view counts and adding indexing via SQLite (I'm trying to avoid having to set up yet another database, user, password, etc) during deployment
I created a new branch called 5_x-upstream-cherry-picks to test and track those cherry-picks from the upstream 5.x branch
Also, I need to test the new LDAP server, so I will deploy that on DSpace Test today
Rename my cgspace-statistics-api to dspace-statistics-api on GitHub

2018-09-24

Trying to figure out how to get item views and downloads from SQLite in a join
It appears SQLite doesn't support FULL OUTER JOIN so some people on StackOverflow have emulated it with LEFT JOIN and UNION:

> SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemviews views
LEFT JOIN itemdownloads downloads USING(id)
UNION ALL
SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemdownloads downloads
LEFT JOIN itemviews views USING(id)
WHERE views.id IS NULL;

This "works" but the resulting rows are kinda messy so I'd have to do extra logic in Python
Maybe we can use one "items" table with defaults values and UPSERT (aka insert... on conflict ... do update):

sqlite> CREATE TABLE items(id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0);
sqlite> INSERT INTO items(id, views) VALUES(0, 52);
sqlite> INSERT INTO items(id, downloads) VALUES(1, 171);
sqlite> INSERT INTO items(id, downloads) VALUES(1, 176) ON CONFLICT(id) DO UPDATE SET downloads=176;
sqlite> INSERT INTO items(id, views) VALUES(0, 78) ON CONFLICT(id) DO UPDATE SET views=78;
sqlite> INSERT INTO items(id, views) VALUES(0, 3) ON CONFLICT(id) DO UPDATE SET downloads=3;
sqlite> INSERT INTO items(id, views) VALUES(0, 7) ON CONFLICT(id) DO UPDATE SET downloads=excluded.views;

This totally works!
Note the special excluded.views form! See SQLite's lang_UPSERT documentation
Oh nice, I finally finished the Falcon API route to page through all the results using SQLite's amazing LIMIT and OFFSET support
But when I deployed it on my Ubuntu 16.04 environment I realized Ubuntu's SQLite is old and doesn't support UPSERT, so my indexing doesn't work...
Apparently UPSERT came in SQLite 3.24.0 (2018-06-04), and Ubuntu 16.04 has 3.11.0
Ok this is hilarious, I manually downloaded the libsqlite3 3.24.0 deb from Ubuntu 18.10 "cosmic" and installed it in Ubnutu 16.04 and now the Python indexer.py works
This is definitely a dirty hack, but the list of packages we use that depend on libsqlite3-0 in Ubuntu 16.04 are actually pretty few:

# apt-cache rdepends --installed libsqlite3-0 | sort | uniq
  gnupg2
  libkrb5-26-heimdal
  libnss3
  libpython2.7-stdlib
  libpython3.5-stdlib

I wonder if I could work around this by detecting the SQLite library version, for example on Ubuntu 16.04 after I replaced the library:

# python3
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
>>> print(sqlite3.sqlite_version)
3.24.0

Or maybe I should just bite the bullet and migrate this to PostgreSQL, as it supports UPSERT since version 9.5 and also seems to have my new favorite LIMIT and OFFSET
I changed the syntax of the SQLite stuff and PostgreSQL is working flawlessly with psycopg2... hmmm.
For reference, creating a PostgreSQL database for testing this locally (though indexer.py will create the table):

$ createdb -h localhost -U postgres -O dspacestatistics --encoding=UNICODE dspacestatistics
$ createuser -h localhost -U postgres --pwprompt dspacestatistics
$ psql -h localhost -U postgres dspacestatistics
dspacestatistics=> CREATE TABLE IF NOT EXISTS items
dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0)

2018-09-25

I deployed the DSpace statistics API on CGSpace, but when I ran the indexer it wanted to index 180,000 pages of item views
I'm not even sure how that's possible, as we only have 74,000 items!
I need to inspect the id values that are returned for views and cross check them with the owningItem values for bitstream downloads...
Also, I could try to check all IDs against the items table to see if they are actually items (perhaps the Solr id field doesn't correspond with actual DSpace items?)
I want to purge the bot hits from the Solr statistics core, as I am now realizing that I don't give a shit about tens of millions of hits by Google and Bing indexing my shit every day (at least not in Solr!)
CGSpace's Solr core has 150,000,000 documents in it... and it's still pretty fast to query, but it's really a maintenance and backup burden
DSpace Test currently has about 2,000,000 documents with isBot:true in its Solr statistics core, and the size on disk is 2GB (it's not much, but I have to test this somewhere!)
According to the DSpace 5.x Solr documentation I can use dspace stats-util -f, so let's try it:

$ dspace stats-util -f

The command comes back after a few seconds and I still see 2,000,000 documents in the statistics core with isBot:true
I was just writing a message to the dspace-tech mailing list and then I decided to check the number of bot view events on DSpace Test again, and now it's 201 instead of 2,000,000, and statistics core is only 30MB now!
I will set the logBots = false property in dspace/config/modules/usage-statistics.cfg on DSpace Test and check if the number of isBot:true events goes up any more...
I restarted the server with logBots = false and after it came back up I see 266 events with isBots:true (maybe they were buffered)... I will check again tomorrow
After a few hours I see there are still only 266 view events with isBot:true on DSpace Test's Solr statistics core, so I'm definitely going to deploy this on CGSpace soon
Also, CGSpace currently has 60,089,394 view events with isBot:true in it's Solr statistics core and it is 124GB!
Amazing! After running dspace stats-util -f on CGSpace the Solr statistics core went from 124GB to 60GB, and now there are only 700 events with isBot:true so I should really disable logging of bot events!
I'm super curious to see how the JVM heap usage changes...
I made (and merged) a pull request to disable bot logging on the 5_x-prod branch (#387)
Now I'm wondering if there are other bot requests that aren't classified as bots because the IP lists or user agents are outdated
DSpace ships a list of spider IPs, for example: config/spiders/iplists.com-google.txt
I checked the list against all the IPs we've seen using the "Googlebot" useragent on CGSpace's nginx access logs
The first thing I learned is that shit tons of IPs in Russia, Ukraine, Ireland, Brazil, Portugal, the US, Canada, etc are pretending to be "Googlebot"...
According to the Googlebot FAQ the domain name in the reverse DNS lookup should contain either googlebot.com or google.com
In Solr this appears to be an appropriate query that I can maybe use later (returns 81,000 documents):

*:* AND (dns:*googlebot.com. OR dns:*google.com.) AND isBot:false

I translate that into a delete command using the /update handler:

http://localhost:8081/solr/statistics/update?commit=true&stream.body=<delete><query>*:*+AND+(dns:*googlebot.com.+OR+dns:*google.com.)+AND+isBot:false</query></delete>

And magically all those 81,000 documents are gone!
After a few hours the Solr statistics core is down to 44GB on CGSpace!
I did a major refactor and logic fix in the DSpace Statistics API's indexer.py
Basically, it turns out that using facet.mincount=1 is really beneficial for me because it reduces the size of the Solr result set, reduces the amount of data we need to ingest into PostgreSQL, and the API returns HTTP 404 Not Found for items without views or downloads anyways
I deployed the new version on CGSpace and now it looks pretty good!

Indexing item views (page 28 of 753)
...
Indexing item downloads (page 260 of 260)

And now it's fast as hell due to the muuuuch smaller Solr statistics core

2018-09-26

Linode emailed to say that CGSpace (linode18) was using 30Mb/sec of outward bandwidth for two hours around midnight
I don't see anything unusual in the nginx logs, so perhaps it was the cron job that syncs the Solr database to Amazon S3?
It could be that the bot purge yesterday changed the core significantly so there was a lot to change?
I don't see any drop in JVM heap size in CGSpace's munin stats since I did the Solr cleanup, but this looks pretty good:

I will have to keep an eye on that over the next few weeks to see if things stay as they are
I did a batch replacement of the access rights with my fix-metadata-values.py script on DSpace Test:

$ ./fix-metadata-values.py -i /tmp/fix-access-status.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.status -t correct -m 206

This changes "Open Access" to "Unrestricted Access" and "Limited Access" to "Restricted Access"
After that I did a full Discovery reindex:

$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    77m3.755s
user    7m39.785s
sys     2m18.485s

I told Peter it's better to do the access rights before the usage rights because the git branches are conflicting with each other and it's actually a pain in the ass to keep changing the values as we discuss, rebase, merge, fix conflicts...
Udana and Mia from WLE were asking some questions about their WLE Feedburner feed
It's pretty confusing, because until recently they were entering issue dates as only YYYY (like 2018) and their feeds were all showing items in the wrong order
I'm not exactly sure what their problem now is, though (confusing)
I updated the dspace-statistiscs-api to use psycopg2's execute_values() to insert batches of 100 values into PostgreSQL instead of doing every insert individually
On CGSpace this reduces the total run time of indexer.py from 432 seconds to 400 seconds (most of the time is actually spent in getting the data from Solr though)

2018-09-27

Linode emailed to say that CGSpace's (linode19) CPU load was high for a few hours last night
Looking in the nginx logs around that time I see some new IPs that look like they are harvesting things:

# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "26/Sep/2018:(19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    295 34.218.226.147
    296 66.249.64.95
    350 157.55.39.185
    359 207.46.13.28
    371 157.55.39.85
    388 40.77.167.148
    444 66.249.64.93
    544 68.6.87.12
    834 66.249.64.91
    902 35.237.175.180

35.237.175.180 is on Google Cloud
68.6.87.12 is on Cox Communications in the US (?)
These hosts are not using proper user agents and are not re-using their Tomcat sessions:

$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-09-26 | sort | uniq
5423
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26 | sort | uniq
758

I will add their IPs to the list of bad bots in nginx so we can add a "bot" user agent to them and let Tomcat's Crawler Session Manager Valve handle them
I asked Atmire to prepare an invoice for 125 credits

2018-09-29

I merged some changes to author affiliations from Sisay as well as some corrections to organizational names using smart quotes like Université d’Abomey Calavi (#388)
Peter sent me a list of 43 author names to fix, but it had some encoding errors like BelalcÃ¡zar, John like usual (I will tell him to stop trying to export as UTF-8 because it never seems to work)
I did batch replaces for both on CGSpace with my fix-metadata-values.py script:

$ ./fix-metadata-values.py -i 2018-09-29-fix-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3

Afterwards I started a full Discovery re-index:

$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b

Linode sent an alert that both CGSpace and DSpace Test were using high CPU for the last two hours
It seems to be Moayad trying to do the AReS explorer indexing
He was sending too many (5 or 10) concurrent requests to the server, but still... why is this shit so slow?!

2018-09-30

Valerio keeps sending items on CGSpace that have weird or incorrect languages, authors, etc
I think I should just batch export and update all languages...

dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2018-09-30-languages.csv with csv;

Then I can simply delete the "Other" and "other" ones because that's not useful at all:

dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Other';
DELETE 6
dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='other';
DELETE 79

Looking through the list I see some weird language codes like gh, so I checked out those items:

dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
 resource_id
-------------
       94530
       94529
dspace=# SELECT handle,item_id FROM item, handle WHERE handle.resource_type_id=2 AND handle.resource_id = item.item_id AND handle.resource_id in (94530, 94529);
   handle    | item_id
-------------+---------
 10568/91386 |   94529
 10568/91387 |   94530

Those items are from Ghana, so the submitter apparently thought gh was a language... I can safely delete them:

dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
DELETE 2

The next issue would be jn:

dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
 resource_id
-------------
       94001
       94003
dspace=# SELECT handle,item_id FROM item, handle WHERE handle.resource_type_id=2 AND handle.resource_id = item.item_id AND handle.resource_id in (94001, 94003);
   handle    | item_id
-------------+---------
 10568/90868 |   94001
 10568/90870 |   94003

Those items are about Japan, so I will update them to be ja
Other replacements:

DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
UPDATE metadatavalue SET text_value='fr' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='fn';
UPDATE metadatavalue SET text_value='hi' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='in';
UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Ja';
UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jp';

Then there are 12 items with en|hi, but they were all in one collection so I just exported it as a CSV and then re-imported the corrected metadata

41 KiB Raw Blame History Unescape Escape

2018-09-02

2018-09-03

2018-09-04

2018-09-10

2018-09-12

2018-09-13

2018-09-14

2018-09-16

2018-09-17

2018-09-18

2018-09-19

2018-09-20

2018-09-21

2019-09-23

2018-09-24

2018-09-25

2018-09-26

2018-09-27

2018-09-29

2018-09-30

41 KiB

Raw Blame History