当前,许多应用重度依赖于搜索功能。从电子商务网站中寻找合适的产品,到社交网络中搜索寻人,再到地图网站中寻找POI和地址,依赖于搜索的应用非常广泛。 亚马逊新推出的云搜索服务,为自行实现搜索功能或定制安装Apache Lucene、Apache Solr和elasticsearch等流行产品提供了可行的替代方式。他们这样描述该服务:
数据定义尽管亚马逊提供了数据上传和搜索响应的数据定义(以XML和JSON两种方式),但数据上传的文档中仅定义了Relax NG 模式,而搜索响应则未定义任何模式。 在我们的实现方式中,我们决定使用XML数据格式而不是JSON,这是因为进行XML数据封装更加简单——XML使用规范的数据格式,而JSON则是动态的(JSON的标签是动态定义,每个请求各异)。我们分别用下边的两种模式(列表1和列表2)来上传数据和搜索结果。 <?xml version="1.0" encoding="UTF-8"?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"> …………………………………………………………………………………………. <xsd:complexType name="fieldType"> <xsd:simpleContent> <xsd:extension base="xsd:string"> <xsd:attribute name="name" type="field_nameType" /> </xsd:extension> </xsd:simpleContent> </xsd:complexType> <xsd:complexType name="addType"> <xsd:sequence> <xsd:element name="field" type="fieldType" maxOccurs="unbounded" /> </xsd:sequence> <xsd:attribute name="id" type="IDType" /> <xsd:attribute name="version" type="versionType" /> <xsd:attribute name="lang" type="xsd:language" /> </xsd:complexType> <xsd:complexType name="deleteType"> <xsd:attribute name="id" type="IDType" /> <xsd:attribute name="version" type="versionType" /> </xsd:complexType> <xsd:complexType name="batchType"> <xsd:sequence> <xsd:element name="add" type="addType" minOccurs="0" maxOccurs="unbounded" /> <xsd:element name="delete" type="deleteType" minOccurs="0" maxOccurs="unbounded" /> </xsd:sequence> </xsd:complexType> <xsd:element name="batch" type="batchType" /> <xsd:simpleType name="statusType"> <xsd:restriction base="xsd:string"> <xsd:enumeration value="success"/> <xsd:enumeration value="error" /> </xsd:restriction> </xsd:simpleType> <xsd:complexType name="errorsType"> <xsd:sequence> <xsd:element name="error" type="xsd:string" maxOccurs="unbounded" /> </xsd:sequence> </xsd:complexType> <xsd:complexType name="warningsType"> <xsd:sequence> <xsd:element name="warning" type="xsd:string" maxOccurs="unbounded" /> </xsd:sequence> </xsd:complexType> <xsd:complexType name="responseType"> <xsd:sequence> <xsd:element name="errors" type="errorsType" minOccurs="0" /> <xsd:element name="warnings" type="warningsType" minOccurs="0" /> </xsd:sequence> <xsd:attribute name="status" type="statusType"/> <xsd:attribute name="adds" type="xsd:int"/> <xsd:attribute name="deletes" type="xsd:int"/> </xsd:complexType> <xsd:element name="response" type="responseType" /> </xsd:schema> Listing 1 Upload data schema <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="http://cloudsearch.amazonaws.com/2011-02-01/results" xmlns="http://cloudsearch.amazonaws.com/2011-02-01/results" elementFormDefault="qualified"> <xsd:complexType name="constraintType"> <xsd:attribute name="value" type="xsd:string"/> <xsd:attribute name="count" type="xsd:int"/> </xsd:complexType> <xsd:complexType name="facetType"> <xsd:sequence> <xsd:element name="constraint" type="constraintType" maxOccurs="unbounded"/> </xsd:sequence> <xsd:attribute name="name" type="xsd:string" /> </xsd:complexType> <xsd:complexType name="facetsType"> <xsd:sequence> <xsd:element name="facet" type="facetType" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="infoType"> <xsd:attribute name="rid" type="xsd:string" /> <xsd:attribute name="time-ms" type="xsd:int" /> <xsd:attribute name="cpu-time-ms" type="xsd:int" /> </xsd:complexType> <xsd:complexType name="dType"> <xsd:simpleContent> <xsd:extension base="xsd:string"> <xsd:attribute name="name" type="xsd:string" /> </xsd:extension> </xsd:simpleContent> </xsd:complexType> <xsd:complexType name="hitType"> <xsd:sequence> <xsd:element name="d" type="dType" maxOccurs="unbounded"/> </xsd:sequence> <xsd:attribute name="id" type="xsd:string" /> </xsd:complexType> <xsd:complexType name="hitsType"> <xsd:sequence> <xsd:element name="hit" type="hitType" maxOccurs="unbounded"/> </xsd:sequence> <xsd:attribute name="found" type="xsd:int" /> <xsd:attribute name="start" type="xsd:int" /> </xsd:complexType> <xsd:complexType name="resultsType"> <xsd:sequence> <xsd:element name="rank" type="xsd:string" /> <xsd:element name="match-expr" type="xsd:string" /> <xsd:element name="hits" type="hitsType" minOccurs="0"/> <xsd:element name="facets" type="facetsType" minOccurs="0"/> <xsd:element name="info" type="infoType" /> </xsd:sequence> </xsd:complexType> <xsd:element name="results" type="resultsType"/> <xsd:complexType name="messageType"> <xsd:attribute name="severity" type="xsd:string" /> <xsd:attribute name="code" type="xsd:string" /> <xsd:attribute name="message" type="xsd:string"/> </xsd:complexType> <xsd:complexType name="errorType"> <xsd:sequence> <xsd:element name="error" type="xsd:string" /> <xsd:element name="rid" type="xsd:string" /> <xsd:element name="time-ms" type="xsd:int" /> <xsd:element name="cpu-time-ms" type="xsd:int" /> <xsd:element name="messages" type="messageType" maxOccurs="unbounded" /> </xsd:sequence> </xsd:complexType> <xsd:element name="error" type="errorType" /> </xsd:schema> Listing 2 Search results data schema 我们使用xjc binding compiler 生成上述的两种模式的Java类,这样就能通过Java Architecture for XML Binding (JAXB)进行自动封装/解封装。 查询定义 除了数据定义,实现搜索API还需要查询定义。我们已经创建了一组类,用来实现亚马逊的查询定义。 这个搜索查询的核心是过滤器。我们引入了SearchQueryFilter 接口,并提供了两种实现方式——Search Query Value Filter(列表3)和Search Query Filter Operation(列表4)。 public class SearchQueryValueFilter implements SearchQueryFilter{ private String _field; private String _value; private boolean _isExclude; private boolean _isNumeric; public SearchQueryValueFilter(){} public SearchQueryValueFilter(String field, String value, boolean isNumeric, boolean isExclude){ _field = field; _value = value; _isExclude = isExclude; _isNumeric = isNumeric; } public String getField() { return _field; } public void setField(String field) { _field = field; } public String getValue() { return _value; } public void setValue(String value) { _value = value; } public boolean isExclude() { return _isExclude; } public void setExclude(boolean isExclude) { _isExclude = isExclude; } public boolean isNumeric() { return _isNumeric; } public void setNumeric(boolean isNumeric) { _isNumeric = isNumeric; } @Override public String toString(){ StringBuffer sb = new StringBuffer(); if(_isExclude){ sb.append("(not "); } if(_field != null){ sb.append(_field); sb.append(":"); } if(!_isNumeric){ sb.append("'"); } sb.append(_value); if(!_isNumeric){ sb.append("'"); } if(_isExclude){ sb.append(")"); } return sb.toString(); } } Listing 3 Value filter implementation public class SearchQueryFilterOperation implements SearchQueryFilter { List<SearchQueryFilter> _filters; FilterOperation _operation; public SearchQueryFilterOperation(){ _operation = FilterOperation.and; _filters = new LinkedList<SearchQueryFilter>(); } public List<SearchQueryFilter> getFilters() { return _filters; } public void setFilters(List<SearchQueryFilter> filters) { _filters = filters; } public void addFilters(SearchQueryFilter filter) { _filters.add(filter); } public FilterOperation getOperation() { return _operation; } public void setOperation(FilterOperation operation) { _operation = operation; } @Override public String toString() { StringBuffer sb = new StringBuffer(); sb.append("("); sb.append(_operation); for(SearchQueryFilter f : _filters){ sb.append(" "); sb.append(f); } sb.append(")"); return sb.toString(); } public enum FilterOperation{ and, or } } Listing 4 Operation filter implementation Search Query Value Filter类支持开发者使用等于、小于、大于、区间(同样支持负值比较)等运算符设置单个字段的限制。而Search Query Filter Operation类还支持开发者使用AND/OR操作符,将多个Search Query Value Filters和Search Query Filter Operations组合使用。通过这两个类的组合使用,就能实现亚马逊云搜索所支持的任意查询过滤器的表达式了。 亚马逊云搜索支持分面分类(Faceted classification):
分面分类应用于分面搜索系统,用户在这种系统中能够从多方面进行信息的导航(译者注:如书籍可以从作者、主题、出版日期等不同的分面),多方面对应于不同顺序的分面。 AWS支持按分面控制搜索执行以及对搜索结果排序。同时还支持开发者控制返回的搜索结果中包含的分面数量。所有的分面操作由Search Query Facet(列表5)这个类来实现。 public class SearchQueryFacet { private String _name; private int _maxFacets; private List<String> _constraints; private FacetSort _sort; public SearchQueryFacet(String name){ _name = name; _maxFacets = -1; _constraints = null; _sort = FacetSort.none; } public SearchQueryFacet(String name, int maxFacets){ _name = name; _maxFacets = maxFacets; _constraints = null; _sort = FacetSort.none; } public SearchQueryFacet(String name, int maxFacets, FacetSort sort){ _name = name; _maxFacets = maxFacets; _constraints = null; _sort = sort; } public SearchQueryFacet(String name, int maxFacets, List<String> constraints){ _name = name; _maxFacets = maxFacets; _constraints = constraints; _sort = FacetSort.none; } public SearchQueryFacet(String name, List<String> constraints){ _name = name; _maxFacets = -1; _constraints = constraints; _sort = FacetSort.none; } public SearchQueryFacet(String name, FacetSort sort, List<String> constraints){ _name = name; _maxFacets = -1; _constraints = constraints; _sort = sort; } public SearchQueryFacet(String name, FacetSort sort){ _name = name; _maxFacets = -1; _constraints = null; _sort = sort; } public String getName() { return _name; } public void setName(String name) { _name = name; } public int getMaxFacets() { return _maxFacets; } public void setMaxFacets(int maxFacets) { _maxFacets = maxFacets; } public FacetSort getSort() { return _sort; } public void setSort(FacetSort sort) { _sort = sort; } public int get_maxFacets() { return _maxFacets; } public void set_maxFacets(int _maxFacets) { this._maxFacets = _maxFacets; } public List<String> getConstraints() { return _constraints; } public void setConstraints(List<String> constraints) { _constraints = constraints; } public void addConstraint(String constraint) { if(_constraints == null) _constraints = new LinkedList<String>(); _constraints.add(constraint); } @Override public String toString(){ StringBuffer sb = new StringBuffer(); sb.append("&facet="); sb.append(_name); if(_maxFacets > 0){ sb.append("&facet-"); sb.append(_name); sb.append("-top-n="); sb.append(_maxFacets); } if((_constraints != null) && (_constraints.size() > 0)){ sb.append("&facet-"); sb.append(_name); sb.append("-constraints="); boolean first = true; for(String c : _constraints){ if(!first) sb.append("%2C"); else first = false; sb.append("%27"); sb.append(c); sb.append("%27"); } } if(!_sort.equals(FacetSort.none)){ sb.append("&facet-"); sb.append(_name); sb.append("-sort="); sb.append(_sort); } return sb.toString(); } public enum FacetSort{ none, alpha, count, max, sum } } Listing 5 Facets control class 最后Search Query Sort类(列表6)实现了开发者对结果排序的控制。 public class SearchQuerySort { private List<SearchRank> _ranks; public SearchQuerySort(){ _ranks = new LinkedList<SearchRank>(); } public void addRank(SearchRank rank){ _ranks.add(rank); } @Override public String toString(){ if(_ranks.size() == 0) return null; StringBuffer sb = new StringBuffer(); sb.append("&rank="); boolean first = true; for(SearchRank r : _ranks){ if(!first) sb.append("%2C"); else first = false; sb.append(r); } return sb.toString(); } public static class SearchRank{ private String _name; private boolean _ascending; public SearchRank(){ _ascending = true; } public SearchRank(String name){ _ascending = true; _name = name; } public SearchRank(String name, boolean ascending){ _ascending = ascending; _name = name; } @Override public String toString(){ if(_ascending) return _name; return "-" + _name; } } } Listing 6 Sort control class CloudSearch查询除了将所有的参数汇总到一起,还增加了页码信息和一组返回字段。 这个查询类还提供了一个方法——HTTP查询转换(列表7),将搜索查询的所有部分汇总,并生成能被搜索处理的HTTP字符串。 public String toHttpQuery() throws Exception{ StringBuffer sb = new StringBuffer(); sb.append("?results-type=xml"); if(_size > 0){ sb.append("&size="); sb.append(_size); } if(_start > 0){ sb.append("&start="); sb.append(_start); } if((_fields != null) && (_fields.size() > 0)){ sb.append("&return-fields="); boolean first = true; for(String f : _fields){ if(!first) sb.append("%2C"); else first = false; sb.append(f); } } if(_filter != null){ if(_filter instanceof SearchQueryValueFilter) sb.append("&q="); else sb.append("&bq="); sb.append(URLEncoder.encode(_filter.toString(), "UTF8")); } if((_facets != null) && (_facets.size() > 0)){ for(SearchQueryFacet f : _facets){ sb.append(f); } } if((_sorts != null) && (_sorts.size() > 0)){ for(SearchQuerySort s : _sorts){ sb.append(s); } } return sb.toString(); } Listing 7 Convert to HTTP query method 我们使用Apache HttpComponents来实现与亚马逊云搜索的通信。 测试我们的API我们使用亚马逊提供的IMDB样例来进行验证。首次单元测试(列表8)用于验证我们实现的搜索API。 public class SearchAPITester extends TestCase { private static final String SearchURL = "search-imdb-movies-ab4fpqw4eocczpgsnrtlu4rn7i.us-east- 1.cloudsearch.amazonaws.com"; private CloudSearchClient client; protected void setUp() throws Exception { client = new CloudSearchClient(SearchURL); } protected void tearDown() { client.close(); } public void testSearch() throws Exception{ SearchQueryValueFilter f1 = new SearchQueryValueFilter("title", "star", false, false); SearchQueryValueFilter f11 = new SearchQueryValueFilter("title", "war", false, true); SearchQueryValueFilter f2 = new SearchQueryValueFilter("year", "..2000", true, false); SearchQueryFilterOperation f12 = new SearchQueryFilterOperation(); f12.setOperation(FilterOperation.or); f12.addFilters(f1); f12.addFilters(f11); SearchQueryFilterOperation f3 = new SearchQueryFilterOperation(); f3.addFilters(f12); f3.addFilters(f2); CloudSearchQuery query = new CloudSearchQuery(f3); query.addField("actor"); query.addField("director"); query.addField("title"); query.addField("year"); SearchQueryFacet sf = new SearchQueryFacet("genre", 5, FacetSort.alpha); sf.addConstraint("Drama"); sf.addConstraint("Sci-Fi"); query.addFacet(sf); SearchQuerySort sort = new SearchQuerySort(); SearchRank r1 = new SearchRank("title"); SearchRank r2 = new SearchRank("year", false); sort.addRank(r1); sort.addRank(r2); query.addSort(sort); try { System.out.println("Test 1 "); SearchResults result = client.search(query); System.out.println(result); } catch (Exception e) { e.printStackTrace(); } } } Listing 8 Search API tester 该测试获得的结果(列表9),和直接通过亚马逊REST API获得的结果相同。 SearchResults ] 第二次测试(列表10)用来验证文档的添加和删除。 public class DocumentAPITester extends TestCase { private static final String DocumentURL = "doc-imdb-movies-ab4fpqw4eocczpgsnrtlu4rn7i.us-east- 1.cloudsearch.amazonaws.com"; private CloudSearchDocumentClient client; private BatchType batch; protected void setUp() throws Exception { client = new CloudSearchDocumentClient(DocumentURL); FieldType title = new FieldType(); title.setName("title"); title.setValue("The Seeker: The Dark Is Rising"); FieldType director = new FieldType(); director.setName("director"); director.setValue("Cunningham, David L."); FieldType genrea = new FieldType(); genrea.setName("genre"); genrea.setValue("Adventure"); FieldType genred = new FieldType(); genred.setName("genre"); genred.setValue("Drama"); FieldType genref = new FieldType(); genref.setName("genre"); genref.setValue("Fantasy"); FieldType genret = new FieldType(); genret.setName("genre"); genret.setValue("Thriller"); FieldType actor1 = new FieldType(); actor1.setName("actor"); actor1.setValue("McShane, Ian"); FieldType actor2 = new FieldType(); actor2.setName("actor"); actor2.setValue("Eccleston, Christopher"); FieldType actor3 = new FieldType(); actor3.setName("actor"); actor3.setValue("Conroy, Frances"); FieldType actor4 = new FieldType(); actor4.setName("actor"); actor4.setValue("Conroy, Frances"); FieldType actor5 = new FieldType(); actor5.setName("actor"); actor5.setValue("Ludwig, Alexander"); FieldType actor6 = new FieldType(); actor6.setName("actor"); actor6.setValue("Crewson, Wendy"); FieldType actor7 = new FieldType(); actor7.setName("actor"); actor7.setValue("Warner, Amelia"); FieldType actor8 = new FieldType(); actor8.setName("actor"); actor8.setValue("Cosmo, James"); FieldType actor9 = new FieldType(); actor9.setName("actor"); actor9.setValue("Hickey, John Benjamin"); FieldType actor10 = new FieldType(); actor10.setName("actor"); actor10.setValue("Piddock, Jim"); FieldType actor11 = new FieldType(); actor11.setName("actor"); actor11.setValue("Lockhart, Emma"); AddType add = new AddType(); add.setId("tt0484562"); add.setVersion(1l); add.setLang("en"); add.getField().add(title); add.getField().add(director); add.getField().add(genrea); add.getField().add(genred); add.getField().add(genref); add.getField().add(genret); add.getField().add(actor1); add.getField().add(actor2); add.getField().add(actor3); add.getField().add(actor4); add.getField().add(actor5); add.getField().add(actor6); add.getField().add(actor7); add.getField().add(actor8); add.getField().add(actor9); add.getField().add(actor10); add.getField().add(actor11); DeleteType delete = new DeleteType(); delete.setId("tt0301199"); delete.setVersion(1l); batch = new BatchType(); batch.getAdd().add(add); batch.getDelete().add(delete); } protected void tearDown() { client.close(); } public void testSearch() throws Exception{ try { System.out.println("Test 1 "); ResponseType result = client.index(batch); System.out.println("Status " + result.getStatus() + " Added " + result.getAdds() + " Deleted " + result.getDeletes()); } catch (Exception e) { e.printStackTrace(); } } } Listing 10 Document upload tester 测试也获得了预期的结果(列表11) Status SUCCESS Added 1 Deleted 1 Listing 11 Document upload test results 总结上面这些简单的Java API实现了亚马逊云搜索的功能,显著简化了亚马逊云搜索功能在已有Java应用中的使用,必然会扩大应用的影响范围。 关于作者Boris Lublinsky博士是诺基亚首席架构师,主要从事大数据、SOA、BPM、中间件的实现。在此之前Boris曾经是 Herzum软件公司的首席架构师,为客户设计大规模的SOA系统,曾负责CNA保险公司的企业架构,参与了CNA的系统集成与SOA策略的设计和实现, 构建了应用框架并实现了面向服务的架构。Boris在企业、技术架构,软件工程方面有超过25年的经验。他还是OASIS SOA参考模型技术委员会的活跃会员,也是《Applied SOA:Service-Oriented Architecture and Design Strategies》,一书的共同作者。他还发表了大量架构、编程、大数据、SOA、BPM的相关文章。 查看英文原文:Using AWS Cloud Search |