admin管理员组文章数量:1794759
NEO4J
说明:使用neo4j算法库时需引入跟neo4j数据库对应的算法库插件或自定义算法库
1.简介
重叠相似度算法就是先把两个向量表示成两个长度相等得一维坐标,即映射到一维空间,再进行重合度加权求和,它即不关注两个向量得夹角,也不关注向量之差得长度值。
其向量公式如下:
数学计算公式如下:
,其中分母是作为一个归一化因子,其中成为共同维度函数,O(x_{i},y_{i}) }为重合度函数
2.使用场景
我们可以使用重叠相似性算法来计算出哪些事物是其他事物的子集。此算法对两个事物之间本身关联关系数据不必要求数量一致,然后我们可能会使用这些计算出的子集从标记数据中学习分类法。比如再常见数据挖掘分析过程中
3.neo4j中重叠度函数使用示例
neo4j中提供如下函数和存储过程,根据第四小节源码分析可知,其函数适合两个点比较重叠度,存储过程适合多个点比较重叠度
- algo.similarity.overlap函数,入参为两个List<Number集合>
- algo.similarity.overlap.stream存储过程,入参(List<Map<String,Object>> data,
Map<String, Object> config)
- algo.similarity.overlap存储过程,入参(List<Map<String,Object>> data,
Map<String, Object> config)
1.计算两个硬编码重叠度
RETURN algo.similarity.overlap([1,2,3], [1,2,4,5]) AS similarity
结果:0.6666666666666666
2.初始化节点数据
MERGE (fahrenheit451:Book {title:'Fahrenheit 451'})
MERGE (dune:Book {title:'Dune'})
MERGE (hungerGames:Book {title:'The Hunger Games'})
MERGE (nineteen84:Book {title:'1984'})
MERGE (gatsby:Book {title:'The Great Gatsby'})MERGE (scienceFiction:Genre {name: "Science Fiction"})
MERGE (fantasy:Genre {name: "Fantasy"})
MERGE (dystopia:Genre {name: "Dystopia"})
MERGE (classics:Genre {name: "Classics"})MERGE (fahrenheit451)-[:HAS_GENRE]->(dystopia)
MERGE (fahrenheit451)-[:HAS_GENRE]->(scienceFiction)
MERGE (fahrenheit451)-[:HAS_GENRE]->(fantasy)
MERGE (fahrenheit451)-[:HAS_GENRE]->(classics)MERGE (hungerGames)-[:HAS_GENRE]->(scienceFiction)
MERGE (hungerGames)-[:HAS_GENRE]->(fantasy)
MERGE (hungerGames)-[:HAS_GENRE]->(romance)MERGE (nineteen84)-[:HAS_GENRE]->(scienceFiction)
MERGE (nineteen84)-[:HAS_GENRE]->(dystopia)
MERGE (nineteen84)-[:HAS_GENRE]->(classics)MERGE (dune)-[:HAS_GENRE]->(scienceFiction)
MERGE (dune)-[:HAS_GENRE]->(fantasy)
MERGE (dune)-[:HAS_GENRE]->(classics)MERGE (gatsby)-[:HAS_GENRE]->(classics)
3.计算节点之间得交集和重叠相似性
MATCH (book:Book)-[:HAS_GENRE]->(genre)
WITH {item:id(genre), categories: collect(id(book))} as userData
WITH collect(userData) as data
CALL algo.similarity.overlap.stream(data)
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.asNode(item1).name AS from, algo.asNode(item2).name AS to,count1, count2, intersection, similarity
ORDER BY similarity DESC
结果:count1为跟from相关得节点,count2为跟to相关得节点,intersection为共同节点数,similarity为相似度
4.对相似度计算结果进行条件筛选,增加条件相似度大于等于0.75的
MATCH (book:Book)-[:HAS_GENRE]->(genre)
WITH {item:id(genre), categories: collect(id(book))} as userData
WITH collect(userData) as data
CALL algo.similarity.overlap.stream(data, {similarityCutoff: 0.75})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.asNode(item1).name AS from, algo.asNode(item2).name AS to,count1, count2, intersection, similarity
ORDER BY similarity DESC
结果:
5.为每个节点找到最相似的节点,并存储这些节点之间的关系
MATCH (book:Book)-[:HAS_GENRE]->(genre)
WITH {item:id(genre), categories: collect(id(book))} as userData
WITH collect(userData) as data
CALL algo.similarity.overlap(data, {topK: 2, similarityCutoff: 0.5, write:true})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
结果:
6.指定源和目标ID:有时,我们不想计算所有对的相似性,而是希望指定项目的子集来相互比较。我们使用配置中的sourceIds
和targetIds
键来做到这一点
MATCH (book:Book)-[:HAS_GENRE]->(genre)
WITH {item:id(genre), name: genre.name, categories: collect(id(book))} as userData
WITH collect(userData) as data
WITH data,[value in data WHERE value.name IN ["Fantasy", "Classics"] | value.item ] AS sourceIds
CALL algo.similarity.overlap.stream(data, {sourceIds: sourceIds})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, similarity
ORDER BY similarity DESC
结果:
4.源码解析
algo.similarity.overlap函数:
public double overlapSimilarity(@Name("vector1") List<Number> vector1, @Name("vector2") List<Number> vector2) {if (vector1 == null || vector2 == null) return 0;HashSet<Number> intersectionSet = new HashSet<>(vector1);intersectionSet.retainAll(vector2);int intersection = intersectionSet.size();long denominator = Math.min(vector1.size(), vector2.size());return denominator == 0 ? 0 : (double) intersection / denominator;}
algo.similarity.overlap存储过程:
public Stream<SimilaritySummaryResult> overlap(@Name(value = "data", defaultValue = "null") List<Map<String, Object>> data,@Name(value = "config", defaultValue = "{}") Map<String, Object> config) {ProcedureConfiguration configuration = ProcedureConfiguration.create(config);CategoricalInput[] inputs = prepareCategories(data, getDegreeCutoff(configuration));String writeRelationshipType = configuration.get("writeRelationshipType", "NARROWER_THAN");String writeProperty = configuration.getWriteProperty("score");if(inputs.length == 0) {return emptyStream(writeRelationshipType, writeProperty);}long[] inputIds = SimilarityInput.extractInputIds(inputs);int[] sourceIndexIds = indexesFor(inputIds, configuration, "sourceIds");int[] targetIndexIds = indexesFor(inputIds, configuration, "targetIds");SimilarityComputer<CategoricalInput> computer = similarityComputer(sourceIndexIds, targetIndexIds);SimilarityRecorder<CategoricalInput> recorder = categoricalSimilarityRecorder(computer, configuration);double similarityCutoff = getSimilarityCutoff(configuration);Stream<SimilarityResult> stream = topN(similarityStream(inputs, sourceIndexIds, targetIndexIds, recorder, configuration, () -> null, similarityCutoff, getTopK(configuration)), getTopN(configuration));boolean write = configuration.isWriteFlag(false) && similarityCutoff > 0.0;return writeAndAggregateResults(stream, inputs.length, sourceIndexIds.length, targetIndexIds.length, configuration, write, writeRelationshipType, writeProperty, recorder);}
algo.similarity.overlap.stream存储过程:
public Stream<SimilarityResult> similarityStream(@Name(value = "data", defaultValue = "null") List<Map<String,Object>> data,@Name(value = "config", defaultValue = "{}") Map<String, Object> config) {ProcedureConfiguration configuration = ProcedureConfiguration.create(config);CategoricalInput[] inputs = prepareCategories(data, getDegreeCutoff(configuration));if(inputs.length == 0) {return Stream.empty();}long[] inputIds = SimilarityInput.extractInputIds(inputs);int[] sourceIndexIds = indexesFor(inputIds, configuration, "sourceIds");int[] targetIndexIds = indexesFor(inputIds, configuration, "targetIds");SimilarityComputer<CategoricalInput> computer = similarityComputer(sourceIndexIds, targetIndexIds);return topN(similarityStream(inputs, sourceIndexIds, targetIndexIds, computer, configuration, () -> null, getSimilarityCutoff(configuration), getTopK(configuration)), getTopN(configuration));}
本文标签: Neo4j
版权声明:本文标题:NEO4J 内容由林淑君副主任自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.xiehuijuan.com/baike/1697212159a339588.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论