m-ld | news

Angus

mcalligator

30-Apr-24

Sustaining Truth across Integrated Applications - converge physical sources into one logical source

In the vast landscape of modern computing, distributed systems have become the backbone of web-scale applications. They offer unparalleled scalability, fault tolerance, and performance, enabling applications to thrive in dynamic environments. However, with great power comes great complexity. Architects and software engineers have to grapple with a whole new slew of challenges that distributed computing throws up - especially when it comes to connecting disparate applications so they share data.

Historically, applications were integrated for reporting purposes. ETL (Extract-Transform-Load) jobs would be run periodically to aggregate the data into a warehouse for query and reporting. This was acceptable at a time when business moved relatively slowly, but is no longer suitable for fast-paced decision-making - especially when that increasingly needs to be automated.

ETL for Reporting

More recently, aggregation for reporting has moved towards using streaming data, with transaction feeds from each application involved populating data warehouses or data lakes in near-real time. This is transforming decision-making by making much more up to date information available.

Streaming Data for Reporting

However, there are still drawbacks to that approach. It’s hard to update the source applications with this pooled information, so it can’t easily be round-tripped; the information is read-only; and discrepancies between applications on common data need manual intervention to address, which is time-consuming and error-prone. That means it either doesn’t happen, or imposes unnecessary costs.

This is a common reason why integration of transactional data between applications isn’t implemented, even when they have data in common - e.g. ticket bookings at a music event - and there would be clear benefits to doing so. Implementations are restricted to use cases in which one application remains authoritative for the data in question, with updates being shared via message queues. There’s no provision for changes to the same data in multiple applications, and that’s part of the reason why booking systems lock the provisional record until the entire transaction is confirmed.

Applications that are integrated don’t allow common data to be changed in more than one place at a time, because there are multiple physical sources of truth. Unlike centralised systems in which a single authoritative source dictates the state of the system, distributed systems consist of many nodes, each with its own view of state in the data they hold. This is even more true of formerly standalone applications that have been integrated. When they need to share data that each of them can change, this proliferation of sources of truth ratchets up the potential for conflicts and inconsistencies before changes have had a chance to travel to the other nodes, leading to myriad issues.

Let’s take our ticket-booking example for music events. Imagine a scenario where several users of different applications that manage music events all update elements of the same critical booking record. That record comprises a complex data structure that’s similar in each application. Since this is a distributed environment, those record updates will propagate asynchronously, almost certainly resulting in divergent states across the different applications. Traditionally, a problem like this might have tried using standards like XA / Two-Phase Commit at the database level for distributed transactions (although this is exceedingly rare in practice). Database instances remote from the one being updated would be temporarily locked while the transaction was being committed, but when multiple simultaneous transactions happen at the application level, that's extremely difficult to achieve. Without proper coordination mechanisms in place, conflicts are likely, jeopardising data integrity, system reliability, and ultimately, the trustworthiness of the data in those applications.

The concept of a single logical source of truth emerges as a promising approach to addressing these challenges. But what do we mean by this? In essence, it’s a local access point for trustworthy information. It’s local to each application using it, even when those applications are geographically separate. Where it's not possible to establish a central authority (like a Distributed Transaction Coordinator) or consensus mechanism (such as Paxos or Raft), distributed systems need a new way to converge on a unified view of complex truth. The approach that m-ld takes enables conflicts either to be resolved automatically, or to be exposed explicitly to client applications for them to determine the best course of action in that context.

Architecturally, this involves retaining the database that each application (or microservice) uses, with the m-ld layer intercepting writes of shared information to that database and using its "remotes" transport to propagate the changes to other integrated applications sharing that information. m-ld uses CRDTs, or Conflict-Free Replicated Data Types, able to support arbitrarily complex data structures (thanks to the wonders of RDF), to resolve certain conflicts automatically. For those it's not possible to do that with, it passes the data to the application to handle. This can even happen after a period of interrupted network connectivity, if the application adopts m-ld's asynchronous conflict notification. The integrated set of applications now has a single logical source of truth that is centralised, serving as a beacon of consistency and ensuring coherence across the system.

Architecture for Logical Single Source of Truth

We do need to be realistic about the challenges of incorporating m-ld into the context of integrated applications: because it's a very powerful mechanism to enable simultaneous distributed changes, there's also more involved than for say, batch-based ETL. It does require code changes to get the most out of it, but if the use case demands it, the benefits of making that effort are immense.

Distributed computing presents architects and software engineers with a complex tapestry of challenges, chief among them the management of multiple physical sources of truth. A single logical source of truth instead simplifies these challenges considerably, and enables them to harness the benefits of distributed changes to common complex information, whilst maintaining its integrity.

George

gsvarovsky

04-Oct-23

Countdown to Launch 🚀 - m-ld is closing in on its vision

Wow. It's been a long road.

Back in 2020 I plunged into lockdown with a vision. With a plucky new component called m-ld, I was going to drastically reduce the difficulty of building any system that shares live editable information – which is to say, most of them: not just multiplayer documents, but any app that allows an agent to change anything shared at all.

I had no idea how immensely long it would take, just to put something out there that I'm satisfied with. It seemed that every time I took one step forward, I saw ten additional steps to take.

Now, with the help of some truly special people, I'm excited to say those steps don't reach into infinity any more.

We've cracked some of the big tickets (mentioned last time) that we always knew would be essential for adoption, but we never seemed to get to. They're now released with v0.10 of the Javascript engine, which also includes some significant reductions in the getting-started overhead – you can even experiment with m-ld domains and clones right in the documentation.

Why not version 1.0? That's where you come in! A production-grade component is going to take a village – of enthusiasts, collaborators, and users. So far we've travelled primarily on our own conviction. Now we need yours. Can you see what we see?

If you've been here before, first of all thanks for coming back. If you're new, welcome. Over the next few weeks I'll be out & about a lot more than I have been, and I really look forward to meeting you or catching up with you. I'm going to try and express our vision in a few different ways, and I'm going to be asking whether we're on the right track to solve your problems.

If you already have some thoughts, please please do contact us or book a chat on my calendar! In any case thanks for reading and I'll see you out there.

George

gsvarovsky

22-Jun-23

Sustainable Web Apps - m-ld expanding again, redefining web apps, and sharable text

The World Wide Web, a "collaborative medium, a place where we can all meet and read and write" – Sir Tim Berners-Lee, 2005

Can we make it painless and fruitful to build apps that offer live collaboration, linking of data between applications and robust security, without having to lock users in, steal their attention, or take control of their personal data?

What would you build?

This year, we're making it a reality. So far, with m-ld, we've been focused on developing the fundamentals of live information sharing (and we've made lots more progress, see below!). With our new project, Sustainable Web Apps, we're really leaning into the developer experience. We want building live, multi-collaborator experiences in web apps to be fun!

Joining us on this journey is Petra Jaros, a super-powered Staff Engineer specializing in Linked Data, UI/UX, and TypeScript. Welcome Petra! I can't wait to see what we build together.

Of course, we're already working on some amazing ideas. We're going to adapt m-ld's native query language to overlay perfectly onto the way modern web apps are built. Say hello to Reactive Observable Queries!

In the meantime, improvements to m-ld itself keep coming. We're especially excited to announce our answers to these common questions:

Can I build a collaborative text document editor with m-ld, or collaborate on unstructured text as part of my structured data?

It's coming in the JS engine v0.10!

Can I ensure a structured field in my data never becomes an array of conflicting values?

It's already here in the JS engine v0.9.2! It's part of our new experimental support for SHACL, the standard for declarative graph data schemas. Watch this space for more schema features as we work on our vision!

Can I include binary data like images directly in m-ld?

It's coming in the JS engine v0.10!

Can I get started without having to set up a pubsub infrastructure?

We're working on the Gateway, an open-source cloud service that provides a server-based data store and message relay.

In the meantime, we've concluded our Securing Shared Decentralised Live Information project. We did a ton of research into how security (and other critical application characteristics) can be reconciled with fully live information sharing. We're continuing to work on the necessary support in m-ld, and ensuring it's robust for production use.

We look forward to telling you more about our projects in the coming months. As usual we'd love to hear about your projects too! You can contact us or just book a chat on my calendar.

Angus

mcalligator

06-Sep-22

News - Bumper Edition - m-ld expanding, public event, federating systems, and security!

There’s been a bit of a gap since the last m-ld news update, but that changes as of today! First up:

The m-ld team expands

In case you weren’t already aware, Angus McAllister has now joined m-ld full-time as Head of Product and Community, after 6 years at Amazon Web Services (and several other multinationals before then). This was actually all the way back in April, and we meant to announce it then, but we’ve been super-busy on other stuff, lots of which we can tell you about, and some of which you’ll have to wait a bit longer to read. My focus is on working with customers (developers, software architects, technical product managers, etc.) to define and refine our products so they meet their needs optimally. And to do that, I’ve begun building a vibrant community around them. Hopefully that’s what’s brought you here today to read this!

Workshop - The Power of Live Linked Data to Transform

One of the first things I did upon arrival was to organise an in-person workshop with this title. 13 of us from a broad range of organisations got together in Utrecht, Netherlands in May 2022 to talk through a most interesting set of topics relating to Live Data and Linked Data, and how combining these gives rise to some powerful solutions to long-standing technology challenges, especially in the area of creating decentralised software. We recorded the sessions - check them out on our YouTube channel:

Federating Sovereign Systems using Live Linked Data

On that theme, NLnet Assure awarded a further grant to a consortium of 4 organisations (ourselves, Ponder Source, Evoludata, and Muze to build a federation of Time-Tracking systems as a precursor to the more general case of federating Bookkeeping systems. We have been collaborating closely with them, and are very close to completing this successfully. The project has highlighted nuanced differences between data federation and integration, the semantics of metadata in each system, management of both identities and identifiers between them, and a host of other things. There’s a great deal more to explore to build on this, which we hope to do in a follow-on project. Ultimately, we’re working towards making it possible for all information to be live, irrespective of the application being used to work with it, and this is an important step along that journey. You can find out more about the project here.

Securing Shared Decentralised Live Information

You might recognise this from an earlier news update. What’s happened with this, you might be wondering. Well, we spotted the chance to build an exemplar app for this in the Federated Timesheets project, and have woven the two projects together. More details to come!

Other stuff

We can’t say much more about this just yet, but there are some exciting developments bubbling up of a more commercial nature, and as soon as we can go public on these, we will. Hold tight in the meantime.

George

gsvarovsky

17-Jan-22

Javascript Engine v0.8 - Latest release with usability, performance & extensibility

The next version of the m-ld Javascript Engine has been published (on npm)! We've been working hard on usability and performance, and we've also included the latest prototype work from the Security Project (we believe in continuous integration).

We're moving fast towards our vision of data-wrangling-free information sharing. The security work is key, not only because it boosts, um, security 😜, but also because it signposts the way to decentralised extensibility.

What's that now? We want anyone to be able to create new information sharing behaviours and rules, which are specific to their apps but also work well together – security rules are just one example. To us, these behaviours are also shared information. So when you use them in a domain, they will appear in the domain data itself (this is often called 'metadata' – data about data). One superpower this gives is to be able to dynamically adapt the presentation of information as its rules evolve. But more than that, it also gives users more freedom to change which app they use, while keeping their data intact.

We'll be writing a lot more on this topic soon. In the meantime, please do get in touch with your ideas and use-cases!

Here's a run-down of the headlines in the new release.

Query usability improvements: more intuitive delete-where & vocabulary references.
API utility wrappers, including property casting and an extended get method able to select individual properties.
Engine performance improvements, including faster @describe queries on larger datasets.
API support for back-pressuring read results. This better supports asynchronous results consumers, such as agents that update remote data sinks like databases.
API support for native RDFJS Dataset Source and queries using SPARQL algebra, for projects which use RDF natively.
Experimental support for whole-domain read/write access control, with users registered in the domain data.

George

gsvarovsky

09-Sep-21

Realtime Information Sharing with RDF - At the leading European conference on Semantic Technologies and AI

This week we were proud to present a posters & demos track paper about m-ld at SEMANTiCS!

We don't talk much here on the website about our internal data representation, RDF, because unless you're building an app that also works with RDF, it's easier to just treat m-ld data as JSON – we're committed to making shared data as easy as possible for app developers without adding yet another technology for you to learn. But the choice of RDF will really pay off in the long run.

Why? Because the vision of RDF is not only to bring developers together by sharing a format, but also to bring data together, by sharing meaning ("semantics"). With RDF the data you work on today, in your app, has already begun to be ready to be integrated with data from other apps – when (not if) that becomes a requirement.

This vision is more important today than ever, and the talks from our colleagues at the conference showed this with real use-cases (sometimes blowing our minds in the process): from answering questions about the COVID-19 response, to applying machine-readable semantics to legal documents (something we have our own ideas about!).

Still, RDF needs thought when it comes to actually sharing information live for realtime collaboration. That's where we come in!

If you'd like to know more, download the paper (also from the conference proceedings), or if you only have time for one coffee, try the poster.

George

gsvarovsky

14-Jul-21

July 2021 Update - vision sharing, a new release & projects underway

Hi folks and welcome to another bumper m-ld update.

Sharing our Vision

NLnet

Recently we shared our vision for Live Shared Linked Data at the NLnet Next Generation Internet webinar on Linked Data. You can read the presentation at your own pace; and we'd love to hear from you if it resonates!

We'll also be presenting a posters & demos track paper about m-ld at SEMANTiCS, the leading European conference on Semantic Technologies and AI. Will we see you there?

Web Starter Project

We've added a new starter project for web apps wanting to share information in real-time. It should help with getting to grips with the best patterns for incorporating m-ld into apps.

For this use-case, we've added an new way for clones to get in touch with each other. Socket.io is a popular library for real-time updates in web apps, and the Javascript engine's new plug-in makes it easy to enable real-time information collaboration. But also, by doing this we showed just how little code is needed to support a new messaging protocol. So, don't worry if your preferred choice is not on the list yet, we can make it happen!

Javascript Engine v0.7

Yesterday we dropped another release of the Javascript engine, which includes some important internal advances. It's live in the demo right now!

The biggest change is protocol support for journal compaction. This will help reduce storage costs, by allowing clones to compact and truncate their journals, as best suits the platform and the app. A 'balanced' journaling strategy with some simple options is the default for the Javascript engine, and we'll expose more options and APIs over time, as we learn.

Securing Shared Decentralised Live Information

As we announced last time, we've begun a project to research and prototype modifications to the primitives of the m-ld core protocol to support strong assurance of data integrity and traceability. You can now check out the project introduction and repository, and see how we're going about it.

The first step is a deep dive into security threats in a couple of domains, to make sure our security designs meet the needs of real-world scenarios; and the first of these is well under way...

Collaborative e-Invoice Composition is an early-stage collaboration project between m-ld and Ponder Source. Join us on Gitter to discuss ideas to transform procurement processes for the better!

George

gsvarovsky

07-May-21

Securing Shared Decentralised Live Information - a grant funded project with NLnet

NLnet

We're proud to announce that NLnet have chosen to commit grant funding to us for security research, under the NGI Assure programme!

The project, Securing Shared Decentralised Live Information with m-ld, will research and prototype modifications to the primitives of the m-ld core protocol to natively support strong assurance of data integrity and traceability, with authority assignable to identified actors, such as users or groups.

NLnet have said: Collaborative software does not currently keep track of authenticity information on the edits by the actors with cryptographic strength. Yet this is a requirement when parties without a mutually trusted platform want to collaborate on a project. Developing a protocol and software for this situation can bring about a big step for collaborating on knowledge graphs and other documents.

We're really looking forward to working towards an open information society in collaboration with NLnet and its partners.

NGI Assure is made possible with financial support from the European Commission's Next Generation Internet programme.

Check out the project on GitHub. As always, your feedback is most welcome! Do you have a use-case we can talk about?

George

gsvarovsky

26-Apr-21

New drop of the Javascript engine - WebRTC, json-rql, performance and more.

We're pleased to announce version 0.6 of the Javascript engine, which has a number of key improvements, paving the way to our Beta release.

We've added experimental support for WebRTC peer-to-peer communications. Since WebRTC needs the help of another service to connect, this is not a stand-alone remotes object but an enhancement to the Ably remotes. However it's been developed in its own package so it can be used to augment any other remotes implementation in future. We feel like we've learned enough about WebRTC to fill a book, so do get in touch if you're curious about this tech!

We've also added more support for json-rql query patterns, allowing more complex declarative queries and reducing the amount of data manipulation that apps need to do after retrieving data. In the headlines:

You can use @construct to get JSON structures which are the right shape for your UI, or for any other purpose.
You can @filter with Constraints to apply operators and reduce the amount of data you work with.

In the meantime, the support for Lists has now moved out of experimental status, as its API has been fully defined and it's surviving our compliance testing regime!

Part of that work also included some performance work on the core of the engine. In particular, we've removed some unnecessary asynchronous deferrals that came with a third-party dependency, meaning query parsing and data streaming can be 10x faster in a web browser! We're not done with performance yet though, so expect more improvements to come.

Many thanks to our friends at Beautiful Interactions for their ongoing work on Quadstore, which the Javascript engine relies on behind the scenes.

George

gsvarovsky

29-Jan-21

Truth and Just Lists - multi-collaborator editable Lists in m-ld's JSON-LD interface.

In The Data Æther I envisioned a future abstraction for data, X, in which its syntax is invisible, its semantics are always visible, and we always know how close to the truth it is. So we no longer have to wrangle data between software components, and we can get on with adding value and solving real problems.

Let me tell you about where this vision has led me, personally. I’m a practical type, and I’m only comfortable selling hopes and dreams if I can show that important parts of the dream work in a real life.

So I’ve been building m‑ld, which is like X in microcosm. It doesn’t try to change everything about software architectures, all at once. Instead it’s a practical step in the right direction, which delivers some new value.

That new value is all about live information sharing among collaborating actors, such as app users, software layers, or autonomous robots. m‑ld reduces the cost and complexity of coding information sharing, allowing the faster delivery of apps, information services and control systems. Think real-time editing features, like Google docs, but for any structured information.

So as well as making some choices that suggest a direction for X, m‑ld is deliberately targeted at one corner of the truth axis, the hardest one to get right, which is having some information be writable for multiple actors at the same time. It’s not the first or the only library for that, but by anticipating X it has a strong commitment to a data representation with inherent extensibility to new data structures and rules, and also new truth claims. It also tries to be as ergonomic as possible, and come with everything you need to get going.

Just recently I’ve put these claims to the test, by extending m‑ld with a new data structure. As of today, m‑ld natively supports Lists.

^{Sarcasm, from a monkey. Photo by Jared Rice on Unsplash}

I know… Lists!

Yes indeed, the data structure that’s used by every program that ever goes beyond “Hello, World”. But bear with me. Here are some other technologies that don’t have Lists as a native construct:

Relational Databases
Resource Description Framework (RDF)

The latter is particularly relevant here, as we’ll see in a moment. But relational databases, and good old SQL? The foundation stone on which some 60% of data applications are built, doesn’t have Lists?

Yes — and deliberately so. In designing the relational model, one of Edgar F. Codd’s goals was data independence, the ability to describe data without dependence on how it is serialised in storage.

… in recently developed information systems… the model of data with which users interact is still cluttered with representational properties, particularly in regard to the representation of collections of data

He singled out ordering dependence, because of its tendency to be driven by serialisations implicitly ordered by sequential addresses (like arrays) or pointers (like linked lists).

His solution was to insist that order of presentation is not implicit, but driven by some order-able component of the data itself. So in his model, Relations (e.g. tables) are strictly unordered Sets of rows and columns. Fast-forward to the twenty-first century and we find the database world littered with ORDER columns containing integer list indexes.

^{‘Experience’ versus design, applied to Lists in a relational model (original source unknown)}

So Codd’s concept of data independence, applied to Lists, essentially means that the ordering of the list must be explicit in the data. But if so, an application needs a way to enforce the correctness that ordering, for example:

Lists are either empty, or have a first element and a last element
The position of any list item unambiguous
Lists do not have gaps, cycles, or branches

These kind of rules fall into the domain of data consistency, which Codd initially accounted for with constraints and was augmented by the work of Andreas Reuter and others in consideration of the ACID properties of database transactions.

Consistency. A transaction reaching its normal end (EOT, end of transaction), thereby committing its results, preserves the consistency of the database. In other words, each successful transaction by definition commits only legal results…

As Pat Helland points out, this property allows for “a more cohesive semantic enforced by an application”. In other words, an application and the database collaborate on what constitutes a “consistent” state.

So, despite sowing confusion, this model has undoubtedly found considerable success. Why does m‑ld need Lists then?

JSON-LD and Why…

m‑ld’s foundational data model is not the Relational Model but the Resource Description Framework. RDF has even stronger data independence and an even simpler data structure: in RDF, a whole database is just a Set of triples: each a statement with three parts, subject, predicate and object.

This confers a number of advantages, but the main one is that data structures, all the way from sets to tables, to lists, to a robot’s worldview, can be layered on top in a consistent, seamless and well-defined way. This is critical for m‑ld, which offers a different model of concurrency control, broadly categorised as eventual consistency, which affects behaviour in all these structures and layers. We’ll go into this a lot more.

But at the same time, we didn’t want to have to invent a new data representation to achieve this. RDF is well-defined, has a formal query language (SPARQL), a number of serialisations, and library mappings to various programming languages.

That’s the positive side. However, while RDF has its fans, I’ll let Manu Sporny, the inventor of JSON-LD, exemplify its detractors and provide a convenient segue:

RDF is a shitty data model. It doesn’t have native support for lists. LISTS for fuck’s sake! The key data structure that’s used by almost every programmer on this planet and RDF starts out by giving developers a big fat middle finger in that area... For all the “RDF data model is elegant” arguments we’ve seen over the past decade, there are just as many reasons to kick it to the curb. This is exactly what we did when we created JSON-LD, and that really pissed off a number of people that had been working on RDF for over a decade.

For all the vitriol, I’ll hazard the suggestion that JSON-LD is an important boon to RDF. It’s worth reading the linked article for the motivation, but with JSON-LD, Manu et al. created a serialisation syntax for RDF that bridges a gap from academia to the real, dirty practices of software developers out in the world, solving problems.

So given that opinion, you will not be surprised to learn that m‑ld’s interface supports JSON-LD as its (currently, only) data serialisation syntax.

What Dreams May Come

So, we started this endeavour with a conundrum: RDF doesn’t natively have Lists, any more than the Relational Model does, and there is a good reason, in data independence, for this to be so. But Lists are so ubiquitous, and so valued, it’s hard to even imagine a new data management component that lacks them as a first-class citizen.

The escape hatch that we chose, is to treat API and implementation separately:

m‑ld’s API has native Lists, just like JSON-LD does.
m‑ld implements Lists as an extension to its core data model.

By implementing Lists we have ‘eaten our own dog food’ and exercised the same inherent extensibility that we offer to apps that use m‑ld. Lists have no free pass to do anything that an app data structure cannot, except for having some syntactic sugar in the API.

And boy, did we eat a lot of dog food.

^{“Whatever will I do with you.” Photo by Brooke Cagle on Unsplash}

There were two hard problems that we faced.

JSON-LD has Lists, but it doesn’t have a way to update them. We had already extended JSON-LD with our query language, json-rql, which includes an update syntax derived from SPARQL. To accommodate list updates at specific indexes, though, we had to slightly break the JSON-LD @list syntax, while staying true to its spirit.
RDF Containers and Collections can’t cope with concurrent editing. “Wait, you said RDF doesn’t have Lists!” Not in its core; but it does define some terms which capture some of the semantics of collections. Unfortunately, the terms define structures that are very brittle in the face of concurrent edits, and so, unusable in m‑ld.

Let’s dive in.

List Updates

By default in JSON-LD, array values are unordered sets. This is startling, considering that here is a technology deliberately created to look and feel like JSON — and the first thing we mention about it, is a departure from one of JSON’s core semantics (and there aren’t many of those to depart from).

But this is actually a great example of JSON-LD’s core pragmatism. JSON-LD is a graph data model, and a serialisation format for RDF, in which set semantics is more fundamental than array semantics. So the designers were between a rock and a hard place: either they break JSON a bit, or they break with JSON, invent their own syntax, and throw away the default support of thousands of tools and libraries.

To mitigate this, with an out-of-band annotation you can turn array semantics back on for fields that are definitely lists and not sets, or you can do the same in-band by way of a keyword, @list, which interposes between the property and the array, like this:

{
  "@id": "reminders",
  "phone": ["Alice", "Bob"],
  "shopping": { "@list": ["Bread", "Milk"] }
}

In this example I’m reminded to phone both Alice and Bob, without any prejudice as to which one first or their relative priority. For the shopping, the @list keyword tells me that the order of Bread and Milk matters (rather mysteriously — why it matters is not specified).

The ordering of a List in JSON-LD will survive the various canonical transformations that a JSON-LD processor supports, including translation to RDF (of which more in a moment). But that’s it, really. There’s no syntax for adding new items or deleting existing ones; updates are not part of JSON-LD.

Enter json-rql, a superset of JSON-LD, designed for just that. The above JSON-LD example is already json-rql, describing an initial insert of graph data into some store. But now I can update my phone reminders with an update ‘pattern’:

{
  "@delete": {
    "@id": "reminders", "phone": "Bob"
  },
  "@insert": {
    "@id": "reminders", "phone": "Claire"
  }
}

This means I don’t have to phone Bob any more, but now I do need to phone Claire. The shopping is unaffected, because @delete and @insert represent patches to the existing data. Note also that "Bob" and "Claire" don’t need to be in square brackets because in the graph data model, a value is equivalent to a singleton set.

Why the boldface? Because that, right there, is a subtlety that can really bite. What does this mean:

{
  "@id": "reminders",
  "shopping": [
    { "@list": ["Bread", "Milk"] },
    { "@list": ["Pink Wafers", "Spam"] }
  ]
}

Yes, you guessed it, it’s a Set of two shopping Lists. You can try it on the JSON-LD playground, it’s completely valid.

But, if I were to try and @insert Angel Delight into the shopping, which list would it change..?

So here’s one way that json-rql parts ways from JSON-LD, just a little, within its ‘superset’ remit. In json-rql, Lists are promoted to full-blown Subject nodes, and can therefore have an @id:

{
  "@id": "reminders",
  "shopping": [
    { "@id": "buy", "@list": ["Bread", "Milk"] },
    { "@id": "avoid", "@list": ["Pink Wafers", "Spam"] }
  ]
}

If you don’t provide an @id m‑ld will generate one for you, which will be visible when you do a query.

We can now uniquely identify the list we want to update, and so without further ado, here is the syntax for removing Bread from the “buy” list and adding Cheese at the beginning:

{
  "@delete": {
    "@id": "buy", "@list": { "?i": "Bread" }
  },
  "@insert": {
    "@id": "buy", "@list": { "0": "Cheese" }
  }
}

In the interests of brevity I just hit you with a few things at once; let’s unpack them.

List updates can use an indexed-object syntax instead of an array for the @list key. (If you’re familiar with Javascript this may not be surprising.) An index key must either be a variable, or parseable as a non-negative integer.
For the @insert, we specify the (zero-based) index we want Cheese to appear. A JSON key cannot be a number, so we wrap it up in quotes. (In Javascript, you can use a plain number.)
For the @delete, we don’t care which position Bread is in, so the index is a variable i (we could equally use an anonymous variable). If Bread appeared more than once in the list, this would delete all occurrences.

This indexed-object syntax also allows us to perform pattern matching against the list by index or item, or both. If I want to know which list Spam is in, and with what priority:

{
  "@select": ["?list", "?spamPriority"],
  "@where": {
    "@id": "reminders",
    "shopping": {
      "@id": "?list",
      "@list": { "?spamPriority": "Spam" }
    }
  }
}

This returns [{ "?list": { "@id": "avoid" }, "?spamPriority": 1 }].

You can try out this Lists API in m-ld using the web-based playground. Here is a link to the example. (Once the domain is connected, just click apply in the Transact pane to insert the list.)

So now we can express list operations in the json-rql API. How does this translate to the RDF graph?

RDF List Representation

A JSON-LD processor, when asked to produce RDF, will convert a @list field into an RDF Collection, which is a pattern for encoding the list items into the graph without losing the ordering. Here is our shopping list in N-Triples format:

<http://ex.org/reminders>
  <http://ex.org/#shopping> _:b0 .
_:b0
  <http://www.w3.org/1999/02/22-rdf-syntax-ns#first>
    "Bread" .
_:b0
  <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest>
    _:b1 .
_:b1
  <http://www.w3.org/1999/02/22-rdf-syntax-ns#first>
    "Milk" .
_:b1
  <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest>
    <http://www.w3.org/1999/02/22-rdf-syntax-ns#nil> .

Let’s trim this down with a simplified notation:

reminders shopping o .
o first "Bread"; rest p .
p first "Milk"; rest nil .

Where o and p are internal identifiers (note that the _:b0 and _:b1 above do not capture list index numbers, just the order in which the identifiers were generated).

Even if you’re not familiar with RDF and the syntax, hopefully you can see that the general pattern is a linked list. If you’re interested in the details, Ontola has written a nice article about this and other RDF options.

I’ll mention, in passing, that even though it can be done, this arrangement is a megalithic PITA to query with SPARQL, RDF’s query language.

But as a linked list, this structure also requires care when editing, to keep the list valid. In a programming language you’d encapsulate the delete and insert operations into functions, and preferably hide away the pointers so consumers of your list don’t accidentally make a mess of it. In an app, you can do the same for consumers of the RDF data.

A bigger problem is what happens when this structure is edited by multiple actors concurrently, which is a core requirement for m‑ld.

As we’ve noted, m‑ld uses a Conflict-free Replicated Data Type (CRDT) to ensure that every clone ends up with the same data. But using the plain RDF Collections pattern, concurrent edits generate ‘lists’ with an amusing cornucopia of empty positions, loops, gaps, and branches, even if all the edits were valid by themselves. Distributed systems veterans will not be surprised by this. But how to get around it?

Constraints

There are a number of existing languages for declaring the structure and rules of RDF datasets, including SHACL, RDFS and OWL, which provide a powerful toolset for information and knowledge engineering. It is very much the intention that users of m‑ld will have the option to use such tools.

However, these do not generally consider the particular demands of concurrent editing. So, for now we’ve been inspired by them but taken our own path with a lightweight way of declaring constraints (similar in spirit to Codd’s, hence the name).

A ‘constraint’ is a semantic rule that describes not only invariants about the data, but also encapsulates update rewriting, conflict resolution and entailments, as we’ll see. If that makes them sound like a bit of a sonic screwdriver, that’s pretty accurate.

“Scanner, diagnostics, tin opener!” Everything an adventurer needs for the incredible journey ahead!

In fact ‘constraints’ are an API that permit apps to define their own semantic rules in code. It’s still very much an advanced function, because of the need for the implementer to consider concurrency, unlike the normal app API.

Is it possible to use constraints to fix the the concurrent behaviour of RDF Collections? The simple answer is: we didn’t even try. (Although, if you feel like giving it a go, let us know.)

Enter LSEQ

The reason we very quickly elected to drop RDF Collections as a representation for Lists, is that it requires post-hoc resolutions to conflicting updates. This is supported by constraints; but the cost is that each resolution appears in the history as another transaction (the original transactions that generated the conflict already being committed). RDF Collections are so dramatically prone to conflicts that the marginal comfort of using an existing pattern just doesn’t balance the costs.

And of course we’re committed to data independence, so we can readily use a different representation if that gives us better mileage.

So, is there a way of representing Lists that doesn’t generate conflicts? As mentioned, there are a number of sequence CRDTs (recall that the C stands for Conflict-free) that have been characterised. Can they be overlaid on the RDF data model?

Yes. Here is our shopping list, once again:

<http://ex.org/reminders>
  <http://ex.org/#shopping>
    <http://ex.org/buy> .
<http://ex.org/buy>
  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
    <http://m-ld.org/RdfLseq> .
<http://ex.org/buy>
  <http://m-ld.org/RdfLseq/?=pmkHiW2fz54xWz98Q>
    "Bread" .
<http://ex.org/buy>
  <http://m-ld.org/RdfLseq/?=qmkHiW2fz54xWz98Q>
    "Milk" .

And in abbreviated format (note, m‑ld’s network and storage representations are closer to the below than the above):

reminders shopping buy .
buy type RdfLseq ;
    RdfLseq/?=pmkHiW2fz54xWz98Q "Bread" ;
    RdfLseq/?=qmkHiW2fz54xWz98Q "Milk" .

The URI http://m-ld.org/RdfLseq is a class name and namespace for a List CRDT based on LSEQ. We mark the list as belonging to this class to allow for future list CRDTs doing something different.

The characteristic of LSEQ visible here is that it uses an order-able position identifier for each item in the list. Yes, it’s an ORDER column! But the way the position identifiers are generated guarantees that:

Position identifiers are unique to a clone;
Position identifiers do not change; and
It’s always possible to position a new item in between two existing ones.

This means that clones can generate new position identifiers independently of each other, and maintain a global ordering in which the user intentions are always respected, without coordination or conflict.

(There’s an additional indirection involved in the actual representation to handle moving items in the list and maintaining numeric item indexes, via list slots. We’ll go into that another time. For a preview, have a look at Martin Kleppmann’s paper on moving elements in list CRDTs.)

Key to the way we implemented this, is that there is no RdfLseq-specific code in the core of m‑ld. It’s all done in a constraint:

Rewriting json-rql @list updates to RdfLseq position updates
Entailment of numeric item indexes for querying
Resolution of the one kind of conflict that can still rarely occur: duplication of a list slot

At the moment the prototype DefaultList constraint is available and active in the Javascript clone engine, and the constraint features are documented there. Of course, they will also have a section in the full m‑ld protocol specification as it beds down, and we work towards our Beta release.

Conclusion

In this article we discussed the theoretical basis for Lists not to be included among the primitives of a representation that seeks data independence; but their importance in practice deserves recognition.

So in m‑ld, Lists are an interface primitive but not an implementation primitive, which has led to the expansion and clarification of the extension API for data constraints.

In order to achieve an efficient convergent list data type in RDF, the foundational data representation used in m‑ld, we chose to bypass the RDF Collection terms almost entirely, and to slightly modify the JSON-LD List representation. We hope to engage more with the communities around these technologies to inform future developments.

If you’re interested in live information sharing with m‑ld, do get in touch! We’d love to hear about your projects.

George

gsvarovsky

14-Dec-20

The Data Æther - Data isn't really like matter. It's like space.

Let's talk about data.

Data is intrinsic to software. It's the inputs and the outputs. It flows in through interfaces, being checked and piped and deconstructed to fine granules, which find their way into and through the tiniest of subroutines. Some of it is discarded and deleted or garbage collected, but some will be transformed and routed and merged into new wholes, and delivered.

Read our article on codeburst.io for our vision of the future of data.

George

gsvarovsky

30-Nov-20

Collaboratenable your app! - Some of the magic behind m-ld

m-ld is built on cool computer science, which is changing the way we think about data. Come with us on a journey into CRDTs with our handy infographic.

This month we've released preview version 0.4 of the Javascript engine, which brings improvements to the API, as well as performance and scalability work behind the scenes.

What app would you like to collaboratenable? What would be the main benefits of enabling this software with a live collaboration feature?

George

gsvarovsky

05-Oct-20

The Playground is Available - A safe place to try out the m-ld API

As we continue our work on the features of m-ld, we've found that you (and we too) often want to quickly try out the query and transaction syntax against a temporary domain.

Now it's easy! The playground is designed just for that. It's all explained once you're there, and there's even an introductory video. We've also added some more examples to the Javascript engine transaction documentation, to help you find your way into the language.

Many thanks to everyone who has contacted us about use-cases in specialist domains. We're always open to having a chat, so get in touch by email, use the feedback form, or drop us a note on Twitter or LinkedIn.

Have a great Autumn!

George

gsvarovsky

01-Sep-20

Manifesto for Data - We believe data should be live and sharable by default.

Also published on codeburst.io.

The 'truth' should be the data that is being used, not the data in distant storage.

Distribute the data automatically, with the guarantee that all of it will converge on the same 'truth'.

Use a published open standard for encoding data with its meaning, and communicating changes to it.

Hi, I'm George. This year I left my day job as a software engineering leader, and plunged into lockdown under a mountain of work, uncertainty and risk. Last week, I pushed the button to launch the m-ld Developer Preview. In between has been a mad journey of creativity, anxiety, frustration, imposter syndrome, fight and flight, elation and time-dilation, and so! much! coffee!

But why?

As a data management app developer, I've used many ways to encode and store data. Frequently, they are combined in the same architecture, with one of the locations being blessed as the central 'truth':

centralised data

The specific technologies vary, but the overall pattern is very common. Motivations include properties of security, integrity, consistency, operational efficiency and cost. However, there are some other peculiar properties that stand out:

The 'truth' is on the far right-hand side; but the data is being used throughout, with particular value being realised on the left.
The software application is responsible for both distributing the data and for operating on it.
Every encoding syntax is specific to a technology, and does not expose the data's meaning enough to be independently understood.

The main consequence of these properties is application code complexity. We have to be incredibly careful to maintain an understanding, in the code, of how current (how close to the truth) our copy of the data is, operate on the data accordingly, and share the understanding with other components. This is hard, and frequently goes awry; resulting in software bugs which are very hard to reproduce, let alone fix.

In this blog, I'll argue that with recent advances in computer science we can make improvements to this, for many applications. Applying our manifesto, we want our architecture to look more like this:

live sharable data

But how?

One thing to notice in the centralised data pattern is that we're taking each encoding of the data and translating it into a new one, to make it suitable for computation, or storage, or to add security, or for whatever reason. At each translation the complexity of keeping the new encoding up-to-date with the previous ones ramps up.

What if we did away with the idea of re-encoding the current data, and instead transacted in changes? Humans do this naturally. When having a conversation about some information, we don't re-state it every time we want to adjust it. We refine information by discussing the delta between the old and the new. And we naturally switch between re-statement and deltas as required.

This concept is nothing new in software either – Event-Driven Architectures have been a common paradigm since at least the mid-2000s. But consumers of 'events' have a new problem: to apply the change to their encoding of the current data. This distributes logically duplicate program code to every consumer – and lines of code are at least linearly proportional to bugs. Even worse, the event ordering is critical, so the coordination of the totally ordered log of events becomes the new centralised 'truth' (and a literally bigger one).

Let's deal with the code duplication issue first. Being good engineers we take care not to repeat ourselves, but this becomes hard to do when re-stating something in different languages. So, what if we had a common language for data? One that could express both state and changes to state? Since we're here, let's have one in which we can encode the meaning of the data, per our manifesto, including a natural way to identify data universally. And further, can we have one for which native, widely-available, battle-hardened database engines exist, so sometimes we don't have to translate anything at all?

Sounds like a big ask. Luckily, academia and industry have been working on it for some time. But let's look at the other problem: change ordering.

Imagine if you shared some information with a friend, and then, every thought you had about it couldn't start until your friend finished whatever thought they were having about it. This is the strictest way that centralised data management systems maintain consistency.

To mitigate the impact of this on the fluency of data manipulation, there are various strategies available like fine-grained locking, optimistic locking and a choice of transaction isolation levels. These have various merits, but each of them re-introduces some of the very distributed application complexity we were trying to reverse, and they still require the central ordered log.

What if we went the other way, and just removed the ordered log entirely?

There are two approaches to concurrency control that don't need a total ordering of changes. One is called Conflict-free Replicated Data Types (CRDTs), and the other Operational Transformation (OT). These do provide the required guarantee that copies of the data will converge to the same 'truth'. But they don't remove the possibility that concurrent changes will disagree with each other and lead to a 'truth' that doesn't make sense.

But wait, you and your friend had no trouble refining your shared information, with no deterministic coordination whatsoever. How?

Humans employ myriad strategies for coordination. You withhold thoughts while someone else is talking. You undo and redo thoughts against new information, both before and after expressing them. You notice conflicts that corrupt the information or render it illogical, apply obvious resolutions, and negotiate others. You actively seek consensus, or delegate decisions.

In the case of document editing, we can go further and notice that, given a foundational level of concurrency control in the software – Google Docs uses OT – editing by multiple humans works fine, and doesn't require much explicit coordination at all. Research groups have found that this applies just as well to CRDTs.

There are many finer details to explore in practice. But we have established that our manifesto can be met, in principle, with application of current computer science.

The approach that we've taken with m-ld is to provide a protocol, with implementing engines, for distributing data in a distributed application.

The 'truth' is the data exposed to the app by the engine.
The data is automatically distributed by the engine with the guarantee that all engines will converge on the same 'truth'.
We use an open standard for encoding data with its meaning, and communicating changes to it.

For now, we're proving out the tech, and filling out the corners that we think are essential for collaboration and autonomy use-cases. But we think we're onto something important to data architectures in general.

We'd love to hear what you think.

If you're ready to try m-ld out, you can work with the Developer Preview right now. Let us know what you're building!

George

gsvarovsky

20-Aug-20

Developer Preview - The first m-ld engineering milestone!

Thanks for visiting! We have reached our first engineering milestone, the developer preview. On this website you'll find information about m-ld, including why and how to use it. There's also a cool demo app which shows one use-case for m-ld.

And of course, the main event is the Javascript engine, which you can download and begin to experiment with.

We're excited about the many interesting use-cases for m-ld! We're working really hard on the ergonomics, the performance and the security. This is a pre-release preview, and we're eager for your feedback.

If you have a request or a question to share, go to the Discussion page of our GitHub main repo. If you're unsure where to start, or you'd just like to talk, you can contact us any time.

I do hope everyone is thriving; and we'll stay in touch!

George

gsvarovsky

16-Jun-20

m-ld: An update! - Progress since we last spoke

Hi! and thanks for being interested in m-ld: a decentralised technology to enable sharing of live data, in a fraction of the time, for a fraction of the cost, and more reliably. I hope you are well and thriving in this new normal.

I'm writing because a lot has happened lately!

On the business side, our materials keep on coming as we get our message out, to UK innovation organisations, investors and potential customers. You've already seen our strapline; and you can explore increasing detail with:

An elevator pitch: https://bit.ly/m-ld-pitch-video

An investment one-pager: on request
A full introduction doc: on request

In engineering, we're working hard on two main deliverables that will drop next month: the live demo app, and the developer preview. The demo app will give everyone a flavour of just one kind of software where m-ld fits perfectly, though it's not the only one of course! You get a quick feel for it in our demo video: https://bit.ly/m-ld-demo-app-video.

The developer preview will let you try out a m-ld engine in your own code. First will be the engine for Javascript platforms, and it'll be quickly followed by Java and Docker, because we're committed to having good platform coverage early on. All the engines will be open-source.

If you're interested in m-ld and would like to get more involved, brilliant! Here are some ideas...

Want to invest, or can you help with investor networking? Email invest@m-ld.io, we'd love to talk!
Thinking about a great use-case? Email info@m-ld.io, we'd love to hear about it!
Want to join in the developer preview? Email preview@m-ld.io, we'll sign you up!
Interested in joining us or contributing? Email careers@m-ld.io, we've got lots to do!