Elasticsearch权威指南阅读笔记（1）入门

时间：Oct. 24, 2018 分类：读书笔记

入门

摘自Elasticsearch权威指南

Elasticsearch是一个实时分布式搜索和分析引擎

是什么

Elasticsearch是一个基于Apache Lucene(TM)的开源搜索引擎，无论在开源还是专有领域，Lucene可以被认为是迄今为止最先进、性能最好的、功能最全的搜索引擎库。

安装

下载地址

下载启动服务

$ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.4.2.tar.gz
$ tar xf elasticsearch-6.4.2.tar.gz 
$ cd elasticsearch-6.4.2
$ ./bin/elasticsearch &

如果需要后台启动可以加-d参数，如果需要对其他端口开放可以修改elasticsearch.yml中的network.host

Marvel是Elasticsearch的管理和监控工具

对于2.3版本可以使用

$ ./bin/plugin -i elasticsearch/marvel/latest

对于5.0版本，插件都被集成到x-pack中

$ ./bin/elasticsearch-plugin install x-pack
ERROR: this distribution of Elasticsearch contains X-Pack by default

已经自带了= =

检验一下服务

$ curl 'http://localhost:9200/?pretty'
{
  "name" : "KAFEEVp",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "7Yq_taMzTDSdy2AX8h-k7g",
  "version" : {
    "number" : "6.4.2",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "04711c2",
    "build_date" : "2018-09-26T13:34:09.098244Z",
    "build_snapshot" : false,
    "lucene_version" : "7.4.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

可能会遇到的问题

max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

/etc/sysctl.conf中修改vm.max_map_count

system call filters failed to install; check the logs and fix your configuration or disable system call filters at your own risk

Centos6不支持SecComp，导致默认的配置bootstrap.system_call_filter为true进行检测，所以导致检测失败，在elasticsearch.yml中将bootstrap.memory_lock和bootstrap.system_call_filter设置为false

API交互

Es之间通过9300端口进行交互，使用Elasticsearch传输协议进行交互，ES的java Client也是通过9300端口进行交互的

而其他语言可以使用HTTP的RESTful进行交互通过9200端口进行交互

示例Curl方法

curl -X<VERB> '<PROTOCOL>://<HOST>:<PORT>/<PATH>?<QUERY_STRING>' -d '<BODY>'

文档

应用中的对象一般都是拥有复杂结构，比如包含日期，地理位置等等

Es是面向文档的，存储整个对象或文档，并且索引每个文档的内容，在Es中可以对文档进行非成行成列的数据进行索引，搜索，排序和过滤

Es使用json作为文档序列化格式，Json现在基本成为了NoSQL领域的标准语言了，简洁，简单且易读

用json来描述一个对象或者说是文档

{
    "email":      "john@smith.com",
    "first_name": "John",
    "last_name":  "Smith",
    "info": {
        "bio":         "Eco-warrior and defender of the weak",
        "age":         25,
        "interests": [ "dolphins", "whales" ]
    },
    "join_date": "2014/05/01"
}

索引

创建索引文档

在Es中存储数据就就叫索引（index），和传统数据库对比

DB -> databases -> tables -> rows -> columns
Es -> indexes -> types -> documents -> fields

示例创建一个员工文档的索引

$ curl -XPUT 'http://127.0.0.1:9200/megacorp/employee/1' -H 'Content-type: application/json;charset=utf-8' -d '{"first_name" : "John", "last_name" : "Smith", "age" : 25, "about" : "I love to go rock climbing", "interests": [ "sports", "music" ]}'
{"_index":"megacorp","_type":"employee","_id":"1","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}

对于/megacorp/employee/1，megacorp是索引，employee是类型，1是id，如果不指定会默认创建

对于创建可以使用PUT和POST

PUT可以是更新和创建，而POST就是创建
PUT是幂等操作，而POST是非幂等操作
PUT因为是操作的一个具体的文档，需要文档UUID，但是POST不需要，可以自动生成不会发生碰撞的UUID，另外PUT一个已存在的文档会进行修改，_version版本号提高

$ curl -XPUT 'http://127.0.0.1:9200/megacorp/employee/2' -H 'Content-type: application/json;charset=utf-8' -d '{"first_name" : "Jane", "last_name" : "Smith", "age" : 32, "about" : "I like to collect rock albums", "interests": [ "music" ]}'
$ curl -XPUT 'http://127.0.0.1:9200/megacorp/employee/3' -H 'Content-type: application/json;charset=utf-8' -d '{"first_name" : "Douglas", "last_name" : "Fir", "age" : 35, "about" : "I like to build cabinets", "interests": [ "forestry" ]}'

搜索

检索文档

指明index，type和id就能通过GET请求检索文档

$ curl -XGET 'http://127.0.0.1:9200/megacorp/employee/1' 
{"_index":"megacorp","_type":"employee","_id":"1","_version":1,"found":true,"_source":{"first_name" : "John", "last_name" : "Smith", "age" : 25, "about" : "I love to go rock climbing", "interests": [ "sports", "music" ]}}

响应中包含一些文档的元数据，_source中是文档的内容

获取不存在的

$ curl -XGET 'http://127.0.0.1:9200/megacorp/employee/4' 
{"_index":"megacorp","_type":"employee","_id":"4","found":false}

简单搜索

搜索全部员工的请求

$ curl -XGET 'http://127.0.0.1:9200/megacorp/employee/_search' 2>/dev/null | jq .
{
  "hits": {
    "hits": [
      {
        "_source": {
          "interests": [
            "music"
          ],
          "about": "I like to collect rock albums",
          "age": 32,
          "last_name": "Smith",
          "first_name": "Jane"
        },
        "_score": 1,
        "_id": "2",
        "_type": "employee",
        "_index": "megacorp"
      },
      {
        "_source": {
          "interests": [
            "sports",
            "music"
          ],
          "about": "I love to go rock climbing",
          "age": 25,
          "last_name": "Smith",
          "first_name": "John"
        },
        "_score": 1,
        "_id": "1",
        "_type": "employee",
        "_index": "megacorp"
      },
      {
        "_source": {
          "interests": [
            "forestry"
          ],
          "about": "I like to build cabinets",
          "age": 35,
          "last_name": "Fir",
          "first_name": "Douglas"
        },
        "_score": 1,
        "_id": "3",
        "_type": "employee",
        "_index": "megacorp"
      }
    ],
    "max_score": 1,
    "total": 3
  },
  "_shards": {
    "failed": 0,
    "skipped": 0,
    "successful": 5,
    "total": 5
  },
  "timed_out": false,
  "took": 1
}

响应会告诉我们搜索到了多少文档，并将文档的完整内容返回

搜索last_name中包含Smith的员工，通过关键字_search和条件语句传递参数q

/megacorp/employee/_search?q=last_name:Smith

DSL语句查询

简单查询是通过字符串直接进行搜索，但是有局限性，Es提供了DSL查询，可以构建更负载更强大的查询

还是查询last_name中包含Smith的员工

$ curl -XGET 'http://127.0.0.1:9200/megacorp/employee/_search' -H 'Content-Type: application/json;charset=utf-8' -d '{"query" : {"match" : {"last_name" : "Smith"}}}' 2>/dev/null  | jq . 
{
  "hits": {
    "hits": [
      {
        "_source": {
          "interests": [
            "music"
          ],
          "about": "I like to collect rock albums",
          "age": 32,
          "last_name": "Smith",
          "first_name": "Jane"
        },
        "_score": 0.2876821,
        "_id": "2",
        "_type": "employee",
        "_index": "megacorp"
      },
      {
        "_source": {
          "interests": [
            "sports",
            "music"
          ],
          "about": "I love to go rock climbing",
          "age": 25,
          "last_name": "Smith",
          "first_name": "John"
        },
        "_score": 0.2876821,
        "_id": "1",
        "_type": "employee",
        "_index": "megacorp"
      }
    ],
    "max_score": 0.2876821,
    "total": 2
  },
  "_shards": {
    "failed": 0,
    "skipped": 0,
    "successful": 5,
    "total": 5
  },
  "timed_out": false,
  "took": 12
}

更复杂搜索

搜索last_name为Smith的员工，并且年龄大于30岁，可以使用过滤器filter

{
  "query": {
    "filtered": {
      "query": {
        "match": {
          "last_name": "Smith"
        }
      },
      "filter": {
        "range": {
          "age": {
            "gt": 30
          }
        }
      }
    }
  }
}

进行查询

$ curl -XGET 'http://127.0.0.1:9200/megacorp/employee/_search' -H 'Content-Type: application/json;charset=utf-8' -d '{"query" : {"filtered" : {"filter" : {"range" : {"age" : { "gt" : 30 }}}, "query" : {"match" : {"last_name" : "Smith"}}}}}' 2>/dev/null  | jq . 
{
  "status": 400,
  "error": {
    "col": 26,
    "line": 1,
    "reason": "no [query] registered for [filtered]",
    "type": "parsing_exception",
    "root_cause": [
      {
        "col": 26,
        "line": 1,
        "reason": "no [query] registered for [filtered]",
        "type": "parsing_exception"
      }
    ]
  }
}

全文搜索

搜索所有喜欢"rock climbing"的员工

$ curl -XGET 'http://127.0.0.1:9200/megacorp/employee/_search' -H 'Content-Type: application/json;charset=utf-8' -d '{"query" : {"match" : {"about" : "rock climbing"}}}' 2>/dev/null | jq .
{
  "hits": {
    "hits": [
      {
        "_source": {
          "interests": [
            "sports",
            "music"
          ],
          "about": "I love to go rock climbing",
          "age": 25,
          "last_name": "Smith",
          "first_name": "John"
        },
        "_score": 0.5753642,
        "_id": "1",
        "_type": "employee",
        "_index": "megacorp"
      },
      {
        "_source": {
          "interests": [
            "music"
          ],
          "about": "I like to collect rock albums",
          "age": 32,
          "last_name": "Smith",
          "first_name": "Jane"
        },
        "_score": 0.2876821,
        "_id": "2",
        "_type": "employee",
        "_index": "megacorp"
      }
    ],
    "max_score": 0.5753642,
    "total": 2
  },
  "_shards": {
    "failed": 0,
    "skipped": 0,
    "successful": 5,
    "total": 5
  },
  "timed_out": false,
  "took": 5
}

默认情况下Elasticsearch根据结果相关性评分来对结果集进行排序，结果相关性评分就是文档与查询条件的匹配程度，排名第一的John Smith的about字段明确的写到"rock climbing"，而Jane Smith的出现是因为about字段中包含了rock，而climbing没有被提及，所以其_score低于John。

对于传统数据库只有匹配和不匹配

短语搜索

想要查询同时包含"rock"和"climbing"（并且是相邻的）的员工记录

{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    }
}

高亮搜索结果

{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    },
    "highlight": {
        "fields" : {
            "about" : {}
        }
    }
}

当我们运行这个语句时，会命中与之前相同的结果，但是在返回结果中会有一个新的部分叫做highlight，这里包含了来自about字段中的文本，并且用<em></em>来标识匹配到的单词。

聚合

分析

类似SQL中的Group by，但是功能更强大

示例功能

{
  "aggs": {
    "all_interests": {
      "terms": { "field": "interests" }
    }
  }
}

curl -XGET 'http://127.0.0.1:9200/megacorp/employee/_search' -H 'Content-Type: application/json;charset=utf-8' -d '{  "aggs": { "all_interests": { "terms": { "field": "interests" }}}}'

貌似5.0之后不支持，需要定义字段类型，参考查询结果

{
   ...
   "hits": { ... },
   "aggregations": {
      "all_interests": {
         "buckets": [
            {
               "key":       "music",
               "doc_count": 2
            },
            {
               "key":       "forestry",
               "doc_count": 1
            },
            {
               "key":       "sports",
               "doc_count": 1
            }
         ]
      }
   }
}

可以看到喜欢music的有两个

如果想知道last_name为Smith的人的共同点

{
  "query": {
    "match": {
      "last_name": "smith"
    }
  },
  "aggs": {
    "all_interests": {
      "terms": {
        "field": "interests"
      }
    }
  }
}

all_interests聚合已经变成只包含和查询语句相匹配的文档了

返回结果

"all_interests": {
     "buckets": [
        {
           "key": "music",
           "doc_count": 2
        },
        {
           "key": "sports",
           "doc_count": 1
        }
     ]
  }

分级汇总

统计每种兴趣的职员的平均年龄

{
    "aggs" : {
        "all_interests" : {
            "terms" : { "field" : "interests" },
            "aggs" : {
                "avg_age" : {
                    "avg" : { "field" : "age" }
                }
            }
        }
    }
}

响应结果

"all_interests": {
     "buckets": [
        {
           "key": "music",
           "doc_count": 2,
           "avg_age": {
              "value": 28.5
           }
        },
        {
           "key": "forestry",
           "doc_count": 1,
           "avg_age": {
              "value": 35
           }
        },
        {
           "key": "sports",
           "doc_count": 1,
           "avg_age": {
              "value": 25
           }
        }
     ]
  }

分布式

Es可以扩展到上百上千的服务器来处理PB级数据。

Es在分布式上做了很大的透明化，隐藏分布式系统的复杂性

文档分配到不同的分片（shards），分片可以位于一个节点或者多个节点
将分片均匀的分配到各个节点，对索引搜索做负载均衡
冗余每一个分片，防止硬件故障导致的数据丢失
将集群中任意一个节点上的请求路由到响应数据所在节点
无论增加节点还是移除节点，分片都可以做到无缝的扩展和迁移

火眼征信大数据工程师闫大佬