mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-07-03
This commit is contained in:
@ -284,5 +284,7 @@ $ csvgrep -c 'number of matches' -r '^0$' /tmp/2022-06-30-cgspace-subjects-resul
|
||||
|
||||
- Then I took all the terms with fifty or more occurences and put them on a Google Sheet
|
||||
- There I started removing any term that was a variation of an existing AGROVOC term (like cowpea/cowpeas, policy/policies) or a compound concept
|
||||
- pnbecker on DSpace Slack mentioned that they made a JSPUI deduplication step that is open source: https://github.com/the-library-code/deduplication
|
||||
- It uses Levenshtein distance via PostgreSQL's fuzzystrmatch extension
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
35
content/posts/2022-07.md
Normal file
35
content/posts/2022-07.md
Normal file
@ -0,0 +1,35 @@
|
||||
---
|
||||
title: "July, 2022"
|
||||
date: 2022-07-02T14:07:36+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2022-07-02
|
||||
|
||||
- I learned how to use the Levenshtein functions in PostgreSQL
|
||||
- The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing
|
||||
- Also, the trgm functions I've used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first
|
||||
|
||||
<!--more-->
|
||||
|
||||
- A working query checking for duplicates in the recent AfricaRice items is:
|
||||
|
||||
```console
|
||||
localhost/dspace= ☘ SELECT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND levenshtein_less_equal(LOWER('International Trade and Exotic Pests: The Risks for Biodiversity and African Economies'), LEFT(LOWER(text_value), 255), 3) <= 3;
|
||||
text_value
|
||||
────────────────────────────────────────────────────────────────────────────────────────
|
||||
International trade and exotic pests: the risks for biodiversity and African economies
|
||||
(1 row)
|
||||
|
||||
Time: 399.751 ms
|
||||
```
|
||||
|
||||
- There is a great [blog post discussing Soundex with Levenshtein](https://www.crunchydata.com/blog/fuzzy-name-matching-in-postgresql) and creating indexes to make them faster
|
||||
- I want to do some proper checks of accuracy and speed against my trigram method
|
||||
|
||||
## 2022-07-03
|
||||
|
||||
- Start a harvest on AReS
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
Reference in New Issue
Block a user