Elasticsearch权威指南 阅读笔记(1)入门
目录:
入门
Elasticsearch是一个实时分布式搜索和分析引擎
是什么
Elasticsearch是一个基于Apache Lucene(TM)的开源搜索引擎,无论在开源还是专有领域,Lucene可以被认为是迄今为止最先进、性能最好的、功能最全的搜索引擎库。
安装
下载启动服务
$ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.4.2.tar.gz
$ tar xf elasticsearch-6.4.2.tar.gz
$ cd elasticsearch-6.4.2
$ ./bin/elasticsearch &
如果需要后台启动可以加-d
参数,如果需要对其他端口开放可以修改elasticsearch.yml中的network.host
Marvel是Elasticsearch的管理和监控工具
对于2.3版本可以使用
$ ./bin/plugin -i elasticsearch/marvel/latest
对于5.0版本,插件都被集成到x-pack中
$ ./bin/elasticsearch-plugin install x-pack
ERROR: this distribution of Elasticsearch contains X-Pack by default
已经自带了= =
检验一下服务
$ curl 'http://localhost:9200/?pretty'
{
"name" : "KAFEEVp",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "7Yq_taMzTDSdy2AX8h-k7g",
"version" : {
"number" : "6.4.2",
"build_flavor" : "default",
"build_type" : "tar",
"build_hash" : "04711c2",
"build_date" : "2018-09-26T13:34:09.098244Z",
"build_snapshot" : false,
"lucene_version" : "7.4.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}
可能会遇到的问题
-
1.
max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
/etc/sysctl.conf
中修改vm.max_map_count
-
2.
system call filters failed to install; check the logs and fix your configuration or disable system call filters at your own risk
Centos6不支持SecComp,导致默认的配置bootstrap.system_call_filter
为true进行检测,所以导致检测失败,在elasticsearch.yml中将bootstrap.memory_lock
和bootstrap.system_call_filter
设置为false
API交互
Es之间通过9300端口进行交互,使用Elasticsearch传输协议进行交互,ES的java Client也是通过9300端口进行交互的
而其他语言可以使用HTTP的RESTful进行交互通过9200端口进行交互
示例Curl方法
curl -X<VERB> '<PROTOCOL>://<HOST>:<PORT>/<PATH>?<QUERY_STRING>' -d '<BODY>'
文档
应用中的对象一般都是拥有复杂结构,比如包含日期,地理位置等等
Es是面向文档的,存储整个对象或文档,并且索引每个文档的内容,在Es中可以对文档进行非成行成列的数据进行索引,搜索,排序和过滤
Es使用json作为文档序列化格式,Json现在基本成为了NoSQL领域的标准语言了,简洁,简单且易读
用json来描述一个对象或者说是文档
{
"email": "john@smith.com",
"first_name": "John",
"last_name": "Smith",
"info": {
"bio": "Eco-warrior and defender of the weak",
"age": 25,
"interests": [ "dolphins", "whales" ]
},
"join_date": "2014/05/01"
}
索引
创建索引文档
在Es中存储数据就就叫索引(index),和传统数据库对比
DB -> databases -> tables -> rows -> columns
Es -> indexes -> types -> documents -> fields
示例创建一个员工文档的索引
$ curl -XPUT 'http://127.0.0.1:9200/megacorp/employee/1' -H 'Content-type: application/json;charset=utf-8' -d '{"first_name" : "John", "last_name" : "Smith", "age" : 25, "about" : "I love to go rock climbing", "interests": [ "sports", "music" ]}'
{"_index":"megacorp","_type":"employee","_id":"1","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}
对于/megacorp/employee/1
,megacorp是索引,employee是类型,1是id,如果不指定会默认创建
对于创建可以使用PUT和POST
- PUT可以是更新和创建,而POST就是创建
- PUT是幂等操作,而POST是非幂等操作
- PUT因为是操作的一个具体的文档,需要文档UUID,但是POST不需要,可以自动生成不会发生碰撞的UUID,另外PUT一个已存在的文档会进行修改,
_version版本号提高
$ curl -XPUT 'http://127.0.0.1:9200/megacorp/employee/2' -H 'Content-type: application/json;charset=utf-8' -d '{"first_name" : "Jane", "last_name" : "Smith", "age" : 32, "about" : "I like to collect rock albums", "interests": [ "music" ]}'
$ curl -XPUT 'http://127.0.0.1:9200/megacorp/employee/3' -H 'Content-type: application/json;charset=utf-8' -d '{"first_name" : "Douglas", "last_name" : "Fir", "age" : 35, "about" : "I like to build cabinets", "interests": [ "forestry" ]}'
搜索
检索文档
指明index,type和id就能通过GET请求检索文档
$ curl -XGET 'http://127.0.0.1:9200/megacorp/employee/1'
{"_index":"megacorp","_type":"employee","_id":"1","_version":1,"found":true,"_source":{"first_name" : "John", "last_name" : "Smith", "age" : 25, "about" : "I love to go rock climbing", "interests": [ "sports", "music" ]}}
响应中包含一些文档的元数据,_source
中是文档的内容
获取不存在的
$ curl -XGET 'http://127.0.0.1:9200/megacorp/employee/4'
{"_index":"megacorp","_type":"employee","_id":"4","found":false}
简单搜索
搜索全部员工的请求
$ curl -XGET 'http://127.0.0.1:9200/megacorp/employee/_search' 2>/dev/null | jq .
{
"hits": {
"hits": [
{
"_source": {
"interests": [
"music"
],
"about": "I like to collect rock albums",
"age": 32,
"last_name": "Smith",
"first_name": "Jane"
},
"_score": 1,
"_id": "2",
"_type": "employee",
"_index": "megacorp"
},
{
"_source": {
"interests": [
"sports",
"music"
],
"about": "I love to go rock climbing",
"age": 25,
"last_name": "Smith",
"first_name": "John"
},
"_score": 1,
"_id": "1",
"_type": "employee",
"_index": "megacorp"
},
{
"_source": {
"interests": [
"forestry"
],
"about": "I like to build cabinets",
"age": 35,
"last_name": "Fir",
"first_name": "Douglas"
},
"_score": 1,
"_id": "3",
"_type": "employee",
"_index": "megacorp"
}
],
"max_score": 1,
"total": 3
},
"_shards": {
"failed": 0,
"skipped": 0,
"successful": 5,
"total": 5
},
"timed_out": false,
"took": 1
}
响应会告诉我们搜索到了多少文档,并将文档的完整内容返回
搜索last_name
中包含Smith的员工,通过关键字_search
和条件语句传递参数q
/megacorp/employee/_search?q=last_name:Smith
DSL语句查询
简单查询是通过字符串直接进行搜索,但是有局限性,Es提供了DSL查询,可以构建更负载更强大的查询
还是查询last_name
中包含Smith的员工
$ curl -XGET 'http://127.0.0.1:9200/megacorp/employee/_search' -H 'Content-Type: application/json;charset=utf-8' -d '{"query" : {"match" : {"last_name" : "Smith"}}}' 2>/dev/null | jq .
{
"hits": {
"hits": [
{
"_source": {
"interests": [
"music"
],
"about": "I like to collect rock albums",
"age": 32,
"last_name": "Smith",
"first_name": "Jane"
},
"_score": 0.2876821,
"_id": "2",
"_type": "employee",
"_index": "megacorp"
},
{
"_source": {
"interests": [
"sports",
"music"
],
"about": "I love to go rock climbing",
"age": 25,
"last_name": "Smith",
"first_name": "John"
},
"_score": 0.2876821,
"_id": "1",
"_type": "employee",
"_index": "megacorp"
}
],
"max_score": 0.2876821,
"total": 2
},
"_shards": {
"failed": 0,
"skipped": 0,
"successful": 5,
"total": 5
},
"timed_out": false,
"took": 12
}
更复杂搜索
搜索last_name为Smith的员工,并且年龄大于30岁,可以使用过滤器filter
{
"query": {
"filtered": {
"query": {
"match": {
"last_name": "Smith"
}
},
"filter": {
"range": {
"age": {
"gt": 30
}
}
}
}
}
}
进行查询
$ curl -XGET 'http://127.0.0.1:9200/megacorp/employee/_search' -H 'Content-Type: application/json;charset=utf-8' -d '{"query" : {"filtered" : {"filter" : {"range" : {"age" : { "gt" : 30 }}}, "query" : {"match" : {"last_name" : "Smith"}}}}}' 2>/dev/null | jq .
{
"status": 400,
"error": {
"col": 26,
"line": 1,
"reason": "no [query] registered for [filtered]",
"type": "parsing_exception",
"root_cause": [
{
"col": 26,
"line": 1,
"reason": "no [query] registered for [filtered]",
"type": "parsing_exception"
}
]
}
}
全文搜索
搜索所有喜欢"rock climbing"的员工
$ curl -XGET 'http://127.0.0.1:9200/megacorp/employee/_search' -H 'Content-Type: application/json;charset=utf-8' -d '{"query" : {"match" : {"about" : "rock climbing"}}}' 2>/dev/null | jq .
{
"hits": {
"hits": [
{
"_source": {
"interests": [
"sports",
"music"
],
"about": "I love to go rock climbing",
"age": 25,
"last_name": "Smith",
"first_name": "John"
},
"_score": 0.5753642,
"_id": "1",
"_type": "employee",
"_index": "megacorp"
},
{
"_source": {
"interests": [
"music"
],
"about": "I like to collect rock albums",
"age": 32,
"last_name": "Smith",
"first_name": "Jane"
},
"_score": 0.2876821,
"_id": "2",
"_type": "employee",
"_index": "megacorp"
}
],
"max_score": 0.5753642,
"total": 2
},
"_shards": {
"failed": 0,
"skipped": 0,
"successful": 5,
"total": 5
},
"timed_out": false,
"took": 5
}
默认情况下Elasticsearch根据结果相关性评分来对结果集进行排序,结果相关性评分就是文档与查询条件的匹配程度,排名第一的John Smith的about字段明确的写到"rock climbing",而Jane Smith的出现是因为about字段中包含了rock,而climbing没有被提及,所以其_score低于John。
对于传统数据库只有匹配和不匹配
短语搜索
想要查询同时包含"rock"和"climbing"(并且是相邻的)的员工记录
{
"query" : {
"match_phrase" : {
"about" : "rock climbing"
}
}
}
高亮搜索结果
{
"query" : {
"match_phrase" : {
"about" : "rock climbing"
}
},
"highlight": {
"fields" : {
"about" : {}
}
}
}
当我们运行这个语句时,会命中与之前相同的结果,但是在返回结果中会有一个新的部分叫做highlight,这里包含了来自about字段中的文本,并且用<em></em>
来标识匹配到的单词。
聚合
分析
类似SQL中的Group by
,但是功能更强大
示例功能
{
"aggs": {
"all_interests": {
"terms": { "field": "interests" }
}
}
}
curl -XGET 'http://127.0.0.1:9200/megacorp/employee/_search' -H 'Content-Type: application/json;charset=utf-8' -d '{ "aggs": { "all_interests": { "terms": { "field": "interests" }}}}'
貌似5.0之后不支持,需要定义字段类型,参考查询结果
{
...
"hits": { ... },
"aggregations": {
"all_interests": {
"buckets": [
{
"key": "music",
"doc_count": 2
},
{
"key": "forestry",
"doc_count": 1
},
{
"key": "sports",
"doc_count": 1
}
]
}
}
}
可以看到喜欢music的有两个
如果想知道last_name为Smith的人的共同点
{
"query": {
"match": {
"last_name": "smith"
}
},
"aggs": {
"all_interests": {
"terms": {
"field": "interests"
}
}
}
}
all_interests聚合已经变成只包含和查询语句相匹配的文档了
返回结果
"all_interests": {
"buckets": [
{
"key": "music",
"doc_count": 2
},
{
"key": "sports",
"doc_count": 1
}
]
}
分级汇总
统计每种兴趣的职员的平均年龄
{
"aggs" : {
"all_interests" : {
"terms" : { "field" : "interests" },
"aggs" : {
"avg_age" : {
"avg" : { "field" : "age" }
}
}
}
}
}
响应结果
"all_interests": {
"buckets": [
{
"key": "music",
"doc_count": 2,
"avg_age": {
"value": 28.5
}
},
{
"key": "forestry",
"doc_count": 1,
"avg_age": {
"value": 35
}
},
{
"key": "sports",
"doc_count": 1,
"avg_age": {
"value": 25
}
}
]
}
分布式
Es可以扩展到上百上千的服务器来处理PB级数据。
Es在分布式上做了很大的透明化,隐藏分布式系统的复杂性
- 文档分配到不同的分片(shards),分片可以位于一个节点或者多个节点
- 将分片均匀的分配到各个节点,对索引搜索做负载均衡
- 冗余每一个分片,防止硬件故障导致的数据丢失
- 将集群中任意一个节点上的请求路由到响应数据所在节点
- 无论增加节点还是移除节点,分片都可以做到无缝的扩展和迁移