Amazon Neptune First Impressions

Salute, Khabrovsk. In anticipation of the start of the AWS for Developers course , we prepared a translation of interesting material.




In many user cases that we, as bakdata , see on our customers ’sites, relevant information is hidden in relationships between entities, for example, when analyzing relationships between users, dependencies between elements, or connections between sensors. Such user cases are usually modeled on a graph. Earlier this year, Amazon released the new Neptune graph database. In this post, we want to share our first ideas, good practices and what can be improved over time.

Why do we need Amazon Neptune


Graph databases promise to handle highly connected datasets better than their relational equivalents. In such data sets, relevant information is usually stored in relationships between objects. For testing Neptune we used an amazing project with open data MusicBrainz . MusicBrainz collects any conceivable metadata about music, such as information about artists, songs, album releases or concerts, as well as who the artist who created the song collaborated with or when the album was released in which country. MusicBrainz can be seen as a huge network of entities that are somehow related to the music industry.

The MusicBrainz dataset is provided as a dump of the CSV relational database. In total, the dump contains about 93 million rows in 157 tables. While some of these tables contain master data, such as artists, events, records, releases or tracks, others - link tables - store relationships between artists and records, other artists or releases, etc ... They demonstrate the graphical structure of the set data. When converting the data set into RDF triples, we got about 500 million copies.

Based on the experience and impressions of the project partners with whom we work, we present a setting in which this knowledge base is used to obtain new information. In addition, we assume that it will be regularly updated, for example, by adding new releases or updating group members.

Customization


As expected, installing Amazon Neptune is easy. It is documented in some detail . You can start the graph database in just a few clicks. However, when it comes to a more detailed configuration, the information you need is hard to find. Therefore, we want to point to one configuration parameter.


The configuration screenshot for

Amazon parameter groups claims that Neptune focuses on low latency transactional workloads, so the default timeout is 120 seconds. However, we tested many analytical user cases in which we regularly reached this limit. You can change this timeout by creating a new parameter group for Neptune and setting toneptune_query_timeout corresponding restriction.

Data loading


Below we will discuss in detail how we uploaded MusicBrainz data to Neptune.

Relations in the threes


First, we converted the MusicBrainz data to RDF triples. Therefore, for each table, we defined a template that determines how each column is represented in the top three. In this example, each row from the executor table is mapped to twelve RDF triples.

<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gid> "${gid}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/name> "${name}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/sort-name> "${sort_name}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/begin-date> "${begin_date_year}-${begin_date_month}-${begin_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/end-date> "${end_date_year}-${end_date_month}-${end_date_day}"^^xsd:<http://www.w3.org/2001/XMLSchema#date> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/type> <http://musicbrainz.foo/artist-type/${type}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/area> <http://musicbrainz.foo/area/${area}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/gender> <http://musicbrainz.foo/gender/${gender}> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/comment> "${comment}"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/edits-pending> "${edits_pending}"^^<http://www.w3.org/2001/XMLSchema#int> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/last-updated> "${last_updated}"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
 
<http://musicbrainz.foo/artist/${id}> <http://musicbrainz.foo/ended> "${ended}"^^<http://www.w3.org/2001/XMLSchema#boolean> .


Bulk upload


The proposed method for loading large amounts of data into Neptune is the process of bulk loading through S3. After uploading your triples files to S3, you start the download using a POST request. In our case, it took about 24 hours for 500 million triples. We expected it to be faster.

curl -X POST -H 'Content-Type: application/json' http://your-neptune-cluster:8182/loader -d '{
 
 
 "source" : "s3://your-s3-bucket",
 
 "format" : "ntriples",
 
 "iamRoleArn" : "arn:aws:iam::your-iam-user:role/NeptuneLoadFromS3",
 
 "region" : "eu-west-1",
 
 "failOnError" : "FALSE"
 
}'

To avoid this lengthy process every time we launch Neptune, we decided to restore the instance from the snapshot in which these triples are already loaded. Starting from a snapshot is much faster, but it still takes about an hour until Neptune is available for requests.

When initially loading triples in Neptune, we encountered various errors.

{
 
 
 "errorCode" : "PARSING_ERROR",
 
 "errorMessage" : "Content after '.' is not allowed",
 
 "fileName" : [...],
 
 "recordNum" : 25
 
}

Some of them were parsing errors, as shown above. To date, we still have not figured out what exactly went wrong at this moment. A little more detail would definitely help here. This error occurred for about 1% of the inserted triples. But as for testing Neptune, we accepted the fact that we only work with 99% of the information from MusicBrainz.

Even if it’s easy for people familiar with SPARQL, keep in mind that RDF triples must be annotated with explicit data types, which again can cause errors.

Streaming download


As mentioned above, we do not want to use Neptune as a static data warehouse, but rather as a flexible and evolving knowledge base. Therefore, we needed to find ways to introduce new triples when changing the knowledge base, for example, when a new album is published or when we want to materialize derived knowledge.

Neptune supports input operators via SPARQL queries, both with raw data and based on samples. We will discuss both approaches below.

One of our goals was to enter data in streaming mode. Consider the release of an album in a new country. From the point of view of MusicBrainz, this means that for a release that includes albums, singles, EPs, etc., a new record is added to the release-country table. In RDF, we compare this information with two new triples.

INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/release> <http://musicbrainz.foo/release/435759> };INSERT DATA { <http://musicbrainz.foo/release-country/737041> <http://musicbrainz.foo/date-year> "2018"^^<http://www.w3.org/2001/XMLSchema#int> };

Another goal was to get new knowledge from the graph. Suppose we want to get the number of releases that each artist has published in his career. Such a query is rather complicated and takes more than 20 minutes in Neptune, so we need to materialize the result in order to reuse this new knowledge in some other query. Therefore, we add triples with this information back to the graph, introducing the result of the subquery.

INSERT {
 
 
  ?artist_credit <http://musicbrainz.foo/number-of-releases> ?number_of_releases
 
} WHERE {
 
  SELECT ?artist_credit (COUNT(*) as ?number_of_releases)
 
  WHERE {
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
  }
 
  GROUP BY ?artist_credit
 
}

Adding single triples to a graph takes several milliseconds, while the execution time for inserting the result of a subquery depends on the execution time of the subquery itself.

Despite the fact that we did not use it often, Neptune also allows you to delete triples based on samples or explicit data that can be used to update information.

SPARQL queries


Introducing the previous subsample, which returns the number of releases for each artist, we have already introduced the first type of request that we want to answer using Neptune. Building a query in Neptune is easy - send a POST request to the SPARQL endpoint, as shown below:

curl -X POST --data-binary 'query=SELECT ?artist ?p ?o where {?artist <http://musicbrainz.foo/name> "Elton John" . ?artist ?p ?o . }' http://your-neptune-cluster:8182/sparql

In addition, we implemented a query that returns an artist profile containing information about their name, age or country of origin. Keep in mind that performers can be people, groups or orchestras. In addition, we supplement this data with information on the number of releases released by artists during the year. For solo artists, we also add information about the groups in which these artists participated in each year.

SELECT
 
 
 ?artist_name ?year
 
 ?releases_in_year ?releases_up_year
 
 ?artist_type_name ?releases
 
 ?artist_gender ?artist_country_name
 
 ?artist_begin_date ?bands
 
 ?bands_in_year
 
WHERE {
 
 # Bands for each artist
 
 {
 
   SELECT
 
     ?year
 
     ?first_artist
 
     (group_concat(DISTINCT ?second_artist_name;separator=",") as ?bands)
 
     (COUNT(DISTINCT ?second_artist_name) AS ?bands_in_year)     
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018
 
     }   
 
     ?first_artist <http://musicbrainz.foo/name> "Elton John" .
 
     ?first_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
 
     ?first_artist <http://musicbrainz.foo/type> ?first_artist_type .
 
     ?first_artist <http://musicbrainz.foo/name> ?first_artist_name .
 

 
 
     ?second_artist <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist> .
 
     ?second_artist <http://musicbrainz.foo/type> ?second_artist_type .
 
     ?second_artist <http://musicbrainz.foo/name> ?second_artist_name .
 
     optional { ?second_artist <http://musicbrainz.foo/begin-date-year> ?second_artist_begin_date_year . }
 
     optional { ?second_artist <http://musicbrainz.foo/end-date-year> ?second_artist_end_date_year . }
 

 
 
     ?l_artist_artist <http://musicbrainz.foo/entity0> ?first_artist .
 
     ?l_artist_artist <http://musicbrainz.foo/entity1> ?second_artist .
 
     ?l_artist_artist <http://musicbrainz.foo/link> ?link .
 

 
 
     optional { ?link <http://musicbrainz.foo/begin-date-year> ?link_begin_date_year . }
 
     optional { ?link <http://musicbrainz.foo/end-date-year> ?link_end_date_year . }
 

 
 
     FILTER (!bound(?link_begin_date_year) || ?link_begin_date_year <= ?year)
 
     FILTER (!bound(?link_end_date_year) || ?link_end_date_year >= ?year)
 
     FILTER (!bound(?second_artist_begin_date_year) || ?second_artist_begin_date_year <= ?year)
 
     FILTER (!bound(?second_artist_end_date_year) || ?second_artist_end_date_year >= ?year)
 
     FILTER (?first_artist_type NOT IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
 
     FILTER (?second_artist_type IN (<http://musicbrainz.foo/artist-type/2>, <http://musicbrainz.foo/artist-type/5>, <http://musicbrainz.foo/artist-type/6>))
 
   }
 
   GROUP BY ?first_artist ?year
 
 }
 
 # Releases up to a year
 
 {
 
   SELECT
 
     ?artist
 
     ?year
 
     (group_concat(DISTINCT ?release_name;separator=",") as ?releases)
 
     (COUNT(*) as ?releases_up_year)
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018 
 
     }
 

 
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 

 
 
     ?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
 
     ?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 

 
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
     ?release <http://musicbrainz.foo/release-group> ?release_group .
 
     ?release <http://musicbrainz.foo/name> ?release_name .
 
     ?release_country <http://musicbrainz.foo/release> ?release .
 
     ?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
 

 
 
     FILTER (?release_country_year <= ?year)
 
   }
 
   GROUP BY ?artist ?year
 
 }
 
 # Releases in a year
 
 {
 
   SELECT ?artist ?year (COUNT(*) as ?releases_in_year)
 
   WHERE {
 
     VALUES ?year {
 
       1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
 
       1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 
       1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 
       1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 
       2010 2011 2012 2013 2014 2015 2016 2017 2018 
 
     }
 

 
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 

 
 
     ?artist_credit_name <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?artist_credit_name <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit-name> .
 
     ?artist_credit_name <http://musicbrainz.foo/artist> ?artist .
 
     ?artist_credit <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/artist-credit> .
 

 
 
     ?release_group <http://musicbrainz.foo/artist-credit> ?artist_credit .
 
     ?release_group <http://musicbrainz.foo/rdftype> <http://musicbrainz.foo/release-group> .
 
     ?release_group <http://musicbrainz.foo/name> ?release_group_name .
 
     ?release <http://musicbrainz.foo/release-group> ?release_group .
 
     ?release_country <http://musicbrainz.foo/release> ?release .
 
     ?release_country <http://musicbrainz.foo/date-year> ?release_country_year .
 

 
 
     FILTER (?release_country_year = ?year)
 
   }
 
   GROUP BY ?artist ?year
 
 }
 
 # Master data
 
 {
 
   SELECT DISTINCT ?artist ?artist_name ?artist_gender ?artist_begin_date ?artist_country_name
 
   WHERE {
 
     ?artist <http://musicbrainz.foo/name> ?artist_name .
 
     ?artist <http://musicbrainz.foo/name> "Elton John" .
 
     ?artist <http://musicbrainz.foo/gender> ?artist_gender_id .
 
     ?artist_gender_id <http://musicbrainz.foo/name> ?artist_gender .
 
     ?artist <http://musicbrainz.foo/area> ?birth_area .
 
     ?artist <http://musicbrainz.foo/begin-date-year> ?artist_begin_date.
 
     ?birth_area <http://musicbrainz.foo/name> ?artist_country_name .
 

 
 
     FILTER(datatype(?artist_begin_date) = xsd:int)
 
   }

Due to the complexity of such a request, we could only perform point queries for a specific artist, such as Elton John, but not for all artists. Neptune does not seem to optimize such a query by omitting filters in subsamples. Therefore, each sample has to be manually filtered by the name of the artist.

Neptune has both hourly and pay for each I / O operation. For our testing, we used the smallest Neptune instance, which costs $ 0.384 / hour. In the case of the request above, which calculates the profile for one artist, Amazon will charge us tens of thousands of I / O, which implies a cost of $ 0.02.

Conclusion


First off, Amazon Neptune keeps most of its promises. Being a managed service, it is a graph database that is extremely easy to install and can be launched without a lot of settings. Here are our five key findings:

  • Bulk upload is simple but slow. But it can get complicated due to error messages that don't help much
  • Stream loading supported everything that we expected, and was fast enough
  • Queries are simple but not interactive enough to perform analytic queries
  • SPARQL queries must be manually optimized
  • Amazon payments are difficult to evaluate because it is difficult to estimate the amount of data scanned by a SPARQL query.

That's all. Sign up for a free webinar on the topic “Load Balancing” .

All Articles