Elasticsearch と python と Flask で Webアプリ(?)化してみる

せっかく Elasticsearch ちょっと覚えたので、ちょっと無理やり使ってみる方法を考えてみました

f:id:konchangakita:20200715235858p:plain

将来的には、入力画像に ML model を通して解析・判定した画像を連携する的なことに

=開発環境====
jupyter notebook
Visual Studio Code
Docker
Kubernetes

Python 3.8
Flask 1.1.2
Elasticsearh 7.8
========

Python と Elasticsearch

まずは、Python 上から Elasticsearch をいじるお作法を学んでみます

Python の Elasticsearch モジュールをインストール

anacondaをインストールしている環境では、管理者権限で

$ conda install elasticsearch

Jupyter nodebook で確認します
f:id:konchangakita:20200716004545p:plain

Elasticsearch のサーバを指定して、indexの一覧を表示しています

from elasticsearch import Elasticsearch
es = Elasticsearch('localhost:9200')
res = es.cat.indices()

ここまではなんてことないです

Elasticsearch に画像情報をつっこむ

画像は FF XIVファンキットからお借りします
https://jp.finalfantasyxiv.com/lodestone/special/fankit/desktop_wallpaper/4_0/

Elasticsearch へ画像ごとにつっこむ情報はコレ
　・登録日時（現時刻）
　・ファイル名
　・カテゴリー
　・ID

まずは、カテゴリーごとにファイル名を取得してみます
pathlib と glob を使ってフォルダ内のファイル名をゲットします
（Windows環境です）

import pathlib
import glob

img_path = pathlib.Path('D:\data\FF14\image\shadowbringers')
img_path.glob('*.jpg')

glob はイテレータで返すので、for文で取り出すことにします

[f.name for f in img_path.glob('*.jpg')]

f:id:konchangakita:20200716011523p:plain

この画像ファイルには "category" を "shadowbringers" として、タイムスタンプと一緒に Elasticsearch へつっこみます

データの挿入

glob で取得した画像情報を複数のデータを一気に突っ込みます
　・登録日時（現時刻）：datetime.utcnow()
　・ファイル名：for文で取り出す
　・カテゴリー："shadowbringers"
　・ID：拡張子を外したファイル名

複数のデータを突っ込むには "helpers.bulk" を使ってやるようです

from elasticsearch import helpers
from datetime import datetime

index_name = 'test_image'
category = 'shadowbringers'

# フォルダ内の画像リストを取得して、Elasticsearchに送る
img_path = pathlib.Path('D:\data\FF14\image\shadowbringers')
actions = []
for f in img_path.glob('*.jpg'):
    #print(f.name)
    f_id = f.name.split('.')
    
    doc = { "image_name" : f.name, 'timestamp': datetime.utcnow(), "category" : category}
    
    actions.append({'_index':index_name, '_type':'_doc', '_id':f_id[0],  '_source':doc})


helpers.bulk(es, actions)

f:id:konchangakita:20200716225155p:plain

突っ込んだデータを確認

index_name = 'test_image'
category = 'shadowbringers'

body =  { 
    "query": {
        "function_score" : {
            "query": {"match": { "category" : category }},
            "random_score": {}
        }
    }
}

es.search(index=index_name, body=body, size=5)

===
{'took': 163,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 21, 'relation': 'eq'},
  'max_score': 3.0576708,
  'hits': [{'_index': 'test_image',
    '_type': '_doc',
    '_id': 'q6EdL7-scpFFxtEdvUrTb09-Pc',
    '_score': 3.0576708,
    '_source': {'image_name': 'q6EdL7-scpFFxtEdvUrTb09-Pc.jpg',
     'timestamp': '2020-07-16T13:57:23.193518',
     'category': 'shadowbringers'}},

(省略)

   {'_index': 'test_image',
    '_type': '_doc',
    '_id': 'mVaNuPNSylbeAyfXD3-4vBDh5U',
    '_score': 2.4496062,
    '_source': {'image_name': 'mVaNuPNSylbeAyfXD3-4vBDh5U.jpg',
     'timestamp': '2020-07-16T13:57:23.193518',
     'category': 'shadowbringers'}}]}}

こんな感じでカテゴリをいくつか登録していきます

Elasticsearchでaggregation

aggregationは、検索クエリの結果を集約して要約してくれるものらしい
今回の使いどころとしては、カテゴリの種類をとってくるのに使います

例えばこんなデータで、"A", "B", "C"という種類を取得したい

id	category	file_name
0	A	000.jpg
1	A	111.jpg
2	C	222.jpg
3	B	333.jpg
4	C	444.jpg

みたいなことです

まずは、aggregation を使う下準備として対象のフィールドに "fielddata" : true をセットしてやります
（デフォルトで disable）
設定前の状態

GET /test_image/_mapping

===
{
  "test_image" : {
    "mappings" : {
      "properties" : {
        "category" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
（省略）
    }
  }
}

fielddata設定

PUT /test_image/_mapping
{
  "properties": {
    "category": { 
      "type":     "text",
      "fielddata": true
    }
  }
}

確認してみる

GET /test_image/_mapping

===
{
  "test_image" : {
    "mappings" : {
      "properties" : {
        "category" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "fielddata" : true
        }
（省略）
    }
  }
}

これでカテゴリの種類を集めることができます

body =  {
    "aggs" : {
        "by_category" : { "terms": { "field" : "category" } }
    },
    "size" : 0
}

es.search(index=index_name, body=body, size=0)

===
{'took': 0,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 460, 'relation': 'eq'},
  'max_score': None,
  'hits': []},
 'aggregations': {'by_category': {'doc_count_error_upper_bound': 0,
   'sum_other_doc_count': 0,
   'buckets': [{'key': 'minion', 'doc_count': 279},
    {'key': 'mount', 'doc_count': 113},
    {'key': 'stormblood', 'doc_count': 33},
    {'key': 'shadowbringers', 'doc_count': 21},
    {'key': 'job', 'doc_count': 12}]}}}

res = es.search(index=index_name, body=body, size=0)
[k['key'] for k in res['aggregations']['by_category']['buckets']]

===
['minion', 'mount', 'stormblood', 'shadowbringers', 'job']