TimeSeries Analysis with Apache Spark and Oracle NoSQL database

Hello What is a Timeseries?

Timeseries  is a sequence of Events, ordered by timestamp of the events. How to store and analyze  Timeseries data has always been interesting for financial and other business domains, but is attracting  more attention recently with IoT applications.

This article describes modelling and analysis of Timeseries data with Apache Spark and Oracle NoSQL database. 

A Timeseries is formally described as a sequence of events {e1, e2, … eN}
such that t(i) ≤ t(j) ∀ i < j where t(i)is the timestamp for i-th event ei

Timeseries model in NoSQL 

Timeseries data characteristics are different from RDBMS record

TimeSeries data is different in characteristics than conventional RDBMS record. The information of a timeseries  is in aggregate properties of its events. 

Timeseries data tends to be immutable, large in volume, ordered by time, and is primarily aggregated for access. 


why NoSQL?


A timeseries is modeled as a table in NoSQL database. The name of the timeseries is same as the table and hence usual restrictions on table name applies. A row in timeseries table represents a Time Slot.

A timeslot is block of events with some metadata attribute such as slot index, start and end time of the slot etc. A time slot is atomic unit of a timeseries  and an entire timeseries is a row of time slots.

A timeslot, as we would see later, provides the basic partition of timeseries and forms the basis for integration with Apache Spark.


Apache Spark

Apache Spark is a powerful cluster computing framework. Apache Spark performs a task on a large dataset by computing over partitions of the dataset where each computation may execute on different hosts of a cluster. Spark framework controls how partial tasks are distributed across the cluster hosts  and how to combine partial results.
Moreover, Spark models a computational task as a graph of actions where actions are lazily evaluated only when a result producing action needs to be computed.

Unique, monotonic sequence in NoSQL using SELECT FOR UPDATE style lock

Persistent Sequence in NoSQL database

What is a Database Sequence?

A database sequence is a monotonic series of unique numbers. A sequence is widely used in database for primary keys and other uniquely identifiable values. 

Database Sequence in NoSQL 

Because NoSQL database does not use locks, and data is maintained in multiple copies distributed across different storage medium, generating a series of numbers that are unique and monotonic across multiple threads (and applications) becomes a non-trivial problem.
To generate a sequence, we need a mechanism that can acquire a lock on a record, update its value and ensures that no other thread/process can update the same row in the meantime. This is typical  SELECT FOR UPDATE  behavior found in in RDBMS. It is also know as Read-Modify-Write pattern in database programming.
In this article, we show how advanced features of Oracle NoSQL database can achieve similar SELECT FOR UPDATE behavior.
We take a concrete example of persistence Sequence to   demonstrate this behavior. The persistennt sequence would generate a  series of values that are unique and monotonic>

From usage point of view, an user application builds a sequence with a SequenceBuilder  .  A  SequenceBuilder employs typical  builder pattern to create a Sequence:
 
     Sequence seq = new SequenceBuider()
                  .withStore(store)
                  .withName("mysequence")
                  .withIncrement(100)
                  .build();

A sequence is created with a database connection and an unique name. Additional configuration such increment i.e. as how many values are cached in a sequence and hence can be served without a network call to database can be specified.

Once a sequence is obtained, an application can get the series of values by:

   long val = seq.next()

For simplicity, without loss of generality, we assume that seq.next() method returns a long value.

In the code example above, the sequence is created to fetch 100 contiguous values from database at a time, and these values are cached in memory. But if seq.next() is called more than 100 times, next batch of values will be fetched from database transparently to user.

A sequence is monotonic across multiple threads, multiple application 

An application may create sequence from different threads, even different applications  may grab a sequence, but a sequence of same name created from same database always generate monotonic, unique sequence of numbers.

NoSQL advanced features to support Database Sequence

The key to the solution lies in special features of NoSQL database, namely 
  • Row Versioning: every database record carries a version 
  • Conditional put operation: a put() operation can be executed conditionally on record version.
  • Access to Previous State: previous state of a record is available

Row Versioning

NoSQL database uses Row as the basic abstraction for record. Every row carries a version.
Row.getVersion() returns a version which is an opaque array of bytes.  The version is changed on every modification of a row.

Conditional put() Operation

NoSQL provides a conditional put semantics where a row can only be put to the database if the row version in the database  match the version being put. The API method is the following: 


This conditional put facility is one of the main features used to implement Read-Modify-Write pattern.

Access to previous state

The ReturnRow argument on the about signature is one of the unique features in Oracle NoSQL database. A return row holds the previous state of a row. A return row can be supplied as an argument to a mutating operation such as putIfVersion(...).
The server would populate the return row with the state of the row before the operation. You can also specify which aspect of previous state be returned: only the previous version or value or both.

READ-MODIFY-WRITE pattern for Database Sequence

A unique, monotonic sequence is modelled with a name (string) and a value (long). We can add more attributes to a sequence, but these two fields are sufficient for current discussion.

To ensure the a sequence generates set of monotonic, unique numbers, the algorithm is to apply READ-MODIFY-WRITE pattern: 

  • READ  a record  by  name with ABSOLUTE read consistency option.
    Because NoSQL maintains multiple copies of data for robustness, a piece of data can be read from any of the copies (termed as Replication Node). Reading with ABSOLUTE consistency ensures that database has confirmed the data to be consistent across all the copies.
    Oracle NoSQL provides different consistency options to read data:  NONE_REQUIRED and ABSOLUTE. You can know about them in more details in the NoSQL documentation.
  • MODIFY update record value by incrementing its current value X by L. This value X denotes the next value of the sequence that has can be used, or in other words, values before X has already been reserved for use.
  • WRITE the value (X+L) back with putIfVersion(), supplying a return row. Again, as in the case of read, write the data COMMIT_SYNC durability warranty that would ensure that data is not only written to durable storage but was synchronized across copies to avoid any other update.

RMW pattern with version-based update

If putIfVersion() succeeds on thread T, it implies that no other thread had updated the record and the range (X, X+L] can be safely reserved for thread T. If another thread had indeed modified the record in the meantime, then the operation will fail (because version would not match), and return row will provide the last updated value X' and version V' of the record. So, thread T can call putIfVersion() again, but this time with X'+L and V'

This simple yet powerful RMW technique using features of Version, ReturnRow and putIfVersion(), it is possible to generate a monotonic sequence that is not only unique to multiple threads but even across different processes.

In code lies the truth...


Now is the time for some code. To summarize the discussion, i have packaged the concepts of  SequenceBuilder, Sequence and RMW lock and a JUnit test for verification in an Eclipse Project.  You can download (~8KB) from   here

Conclusion


This brief article highlights advanced features of NoSQL database to demonstrate how monotonic sequence of values can be generated   that are unique across threads and processes. 

New trillion