Using Node.js & Elasticsearch to search GitHub: 1 Setup
In this tutorial I will be showing how to build a JavaScript application on top of Elasticsearch. Its core will be written in Node.js followed by Vue.js on the frontend. We will be using modern JavaScript specifications (ES6+) with features such as async/await
, spread operator or destructuring assignment.
The tutorial will be split into several articles:
- Using Node.js & Elasticsearch to search GitHub: 1 Setup
- Using Node.js & Elasticsearch to search GitHub: 2 Vue.js (TBD)
- Using Node.js & Elasticsearch to search GitHub: 3 GraphQL (TBD)
- TBA
The source code can be found on GitHub.
Goal
The goal is to build a search engine for GitHub repositories. We will be connecting with GitHub's API to fetch list of trending repositories. Each repository data along with its README
will be indexed by Elasticsearch.
We will be constructing the application iteratively. Its initial versions won't be ideal. The intention is to showcase a possible process of software development: start with something small that works and improve it along the way by applying various refactorings.
Requirements
Before we start, make sure you have the following software installed:
Check if Elasticsearch running:
http :9200
HTTP/1.1 200 OK
content-encoding: gzip
content-length: 288
content-type: application/json; charset=UTF-8
{
"cluster_name": "tempertynka",
"cluster_uuid": "SWjeIaE4SrOQV-SJRxwObA",
"name": "ZGl7bwG",
"tagline": "You Know, for Search",
"version": {
"build_date": "2018-04-12T20:37:28.497551Z",
"build_hash": "ccec39f",
"build_snapshot": false,
"lucene_version": "7.2.1",
"minimum_index_compatibility_version": "5.0.0",
"minimum_wire_compatibility_version": "5.6.0",
"number": "6.2.4"
}
}
Create Project
Let's start by creating a project using Yarn.
yarn init
Next, install the `elasticsearch` package from the NPM registry.
yarn add elasticsearch
In the first step, let's try to connect to the Elasticsearch instance using JavaScript and check its health.
const { Client } = require('elasticsearch');
const client = new Client({
host: 'localhost:9200'
});
const main = async () => {
const health = await client.cluster.health();
console.log(health)
}
main()
Run it:
node index.js
{ cluster_name: 'tempertynka',
status: 'yellow',
timed_out: false,
number_of_nodes: 1,
number_of_data_nodes: 1,
active_primary_shards: 10,
active_shards: 10,
relocating_shards: 0,
initializing_shards: 0,
unassigned_shards: 10,
delayed_unassigned_shards: 0,
number_of_pending_tasks: 0,
number_of_in_flight_fetch: 0,
task_max_waiting_in_queue_millis: 0,
active_shards_percent_as_number: 50 }
Create Elasticsearch Index
Now we can create an Elasticsearch index. Let's call it github
with trending
as the type.
const init = async () => {
await client.indices.create({
index: 'github',
body: {
mappings: {
trending: {
properties: {
name: { type: 'text' },
url: { type: 'text' },
description: { type: 'text', analyzer: 'english' },
readme: { type: 'text', analyzer: 'english' },
}
}
}
}
});
};
Now we can start indexing documents. Let's create a wrapper around Elasticsearch's index()
method to control and limit possible input data.
const index = async ({ name, description, readme }) => {
await client.index({
index: 'github',
type: 'trending',
body: { name, description, readme }
})
}
In the example above, we are using destructuring assignment feature to extract values of specific properties from the input object and to store them as variables.
Connect with GitHub API
Next step is to connect with GitHub's API. We will start with REST API and eventually (in the following articles) we will refactor it to use GraphQL API.
First, install axios, a promised-based HTTP library.
yarn add axios
We will fetch the most starred (trending) JavaScript repositories from the last week.
const fetchTrendingRepositories = async () => {
const { data: { items } } = await axios({
baseURL: 'https://api.github.com/',
url: "/search/repositories",
params: {
sort: 'stars',
order: 'desc',
q: 'language:javascript created:>2018-04-15',
}
})
return items.map(({
id, full_name, html_url, description
}) => ({
id,
name: full_name,
url: html_url,
description
}));
}
We send a request to /search/repositories
endpoint. The query params indicate JavaScript only repositories created within last week, sorted by stars in the descending order.
axios
returnes response's payload under data
field while GitHub API places the requested repository list under items
field. We use destructuring assignment again to place that collection directly into items
variable.
As there are many other fields for each item of the collection returned by GitHub API, we filter out only those fields that we are interested in by using .map()
. In the process we rename some of those fields for convenience.
Next step is to fetch README
for each of those repositories. /repos/<repository name>/readme
endpoint is for that. Let's write an auxiliary function which fetches the README
of a repository specified by name
as the input parameter.
const fetchReadme = async name => {
const { data: readme } = await axios({
baseURL: 'https://api.github.com/',
url: `/repos/${name}/readme`,
headers: {
accept: "application/vnd.github.v3.raw"
}
})
return readme;
}
Save to Elasticsearch
Now we can merge those two data points and store it Elasticsearch.
const store = async () {
try {
const repos = await fetchTrendingRepositories();
for (const repo of repos) {
const readme = await fetchReadme(repo.name);
await index({ ...repo, readme })
}
} catch (error) {
console.log(error.message);
}
}
Query Elasticsearch
The final piece of the puzzle is the search()
function.
const search = async query => {
const results = await client.search({
index: 'github',
size: 10,
body: {
query: {
multi_match: {
query,
type: 'cross_fields',
fields: ['name', 'description^2', 'readme^3'],
operator: 'or',
tie_breaker: 1.0,
cutoff_frequency: 0.1
}
}
}
})
return results.hits.hits.map(({
_source: { name, description, readme }
}) => ({
name, description, readme,
}))
}
We specify the query as a multi_match
. Each field has a different weight specified with fields
as ['name', 'description^2', 'readme^3']
. With cross_fields
we indicate that all terms of a query must be present in at least one field for a document to match.
const results = await search('webassembly');
Source code
Check the source code on GitHub if you are not sure how to combine all pieces together.
Next
In the next article we will build a simple web UI using Vue.js. Stay tuned.