RethinkDB Fundamentals
-
Is RethinkDB Right for You?
There's so many good choices when it comes to document data stores, even whether you should use one at all. Let's go over the reasons RethinkDB may or may not be a good fit. I'll just cut to it right now. RethinkDB is the most impressive no-sequel solution that I have ever seen. I am so impressed with it, from its functional ReQL language to its ease of use and its speed. Really, it's hard to see a reason why it can't work for you, but there are some considerations, so let's have a look. Specifically, what makes RethinkDB a good choice? You'll enjoy working with it if you like having flexible schemas. No more migrations. Just write your JSON to RethinkDB and it's saved. If you use an external system like Hadoop to run routine analytics, key word, routine, RethinkDB can take over for you. Using sharding and replication, you can keep your data in sync, but you can run your analytics completely on a separate server if you want. Finally, if managing your MongoDB or other document store system is a drag, well, RethinkDB can make your life a much better place to live. Now, conversely, RethinkDB might not be a good choice if you need to rely on your database schema directly or you have to have ACID support for some reason. RethinkDB enforces safe writes out of the box, meaning you have to tell it that you're willing to forego data safety in the name of speed. However, some people need more than that. If that's you, RethinkDB might not be a good fit. If you're running some deep analytics, you'll find RethinkDB's query language a bit confining. It can handle a lot, but it is not an analytical system. Finally, if you're running a system like Twitter or Papertrail that need blazing write speeds, you might be better off using Cassandra or Riak. RethinkDB does allow you to tweak write speed, as you're about to see, but it really is not designed for that. And that's RethinkDB. It sits right in the middle of high-end systems out there, allowing you the best of both worlds without going too far in each one. It's a lot of fun to use, and you can feel it the minute you install it and open up the administrative interface.
-
Installation
Let's get started by installing RethinkDB, and I'll do this on my Macintosh, and there's a couple of different ways to do it. So start off by heading over to rethinkdb.com. And here, you can see the install link right there at the top, I love that. And the version they have up there is version 1.16.2. Actually, the latest is 1.16.3, as I record this now, but anyway. If you click on install, it will take you to, guess what? The installation page. Here, you can see all your different choices for installing RethinkDB. There's a Ubuntu, OS X, CentOS, et cetera, and you know. RethinkDB does not run on Windows. However, you are a capable developer, and if you're running on Windows and you want to play along today, kick up an Ubuntu VM, and right here are instructions for installing on Ubuntu. And you can just follow these directly from their site. It's a real quick and easy installation. Setting up a VM is simple. Head over to Azure, or whatnot. You can also download and install the OS X installer if you're using a Mac, like I am. You can download the installation, as you see here on the top left, or you can just use Homebrew. Either one works really well. If you want it completely isolated, maybe using Homebrew might be better. But for today, I will be using RethinkDB 2.0. And you can see the release announcement post right here. Read it if you like, full of good information. However, if you go down to the bottom, you can see a link to their GitHub repo, where you can download the 2.0 release candidate one, which by the time you watch this, well, you're probably not going to need to do this. I believe 2.0 is going to be released by the time this video pops, but just in case it's not, here's instructions for downloading and using the OS X release candidate. All right, once you have it installed, let's check our version. You do this by typing rethinkdb -v into the command line, as I've done here. You can see, I'm running 2.0 release candidate one. Typing in --help, you can take a look at the commands that are available to you as well as a few other things. But notice, there's not all that many commands to working with RethinkDB. It is bleedingly simple. In fact, to actually run RethinkDB, you just don't have to enter anything at all. We'll come back to some of these commands later on. For now, let's start RethinkDB. And there you are. It starts right up. This is something to note straight away, that when you start RethinkDB, it will initialize a data directory right in the directory where you started it from. And it calls this rethinkdb_data. I kind of like this, because it keeps all your data files easily accessible and manageable. And the next thing to take note of here in the log splash is that you have three different ports that you're going to have to remember. 29015, 28015, and 8080. We'll talk about the first two later on, but 8080 is the one that I am most interested in, because that is the administrative interface. Taking a look at our directory here, where we launched RethinkDB, this is it. That's the data file right there. If you want to get rid of your database, just delete the data file. That might sound a little scary, but for development, it makes life easy. Let's head over to the administrative screen and go to localhost and then 8080, and take a look around, and here it is. That's the administrative interface. It's gorgeous, and it's simple. It's easy to use, and it's also aware. It gets push notifications from the server itself, so when anything changes, you're notified immediately. For instance, let's see what happens when we shut our server off. This'll give you a good idea of how well the team has thought through this admin interface. Look at that. You have been disconnected from the server, and it grays things back. It doesn't sit there and try and cycle and execute commands on a dead server. All right, for now, let's remove our RethinkDB data directory, and as you can see here, it's gone. The reason I'm doing this is because I want to reconfigure my server. I want to launch it with different options. And to see those options, type in --help. Specifically, I don't like the name that it gave my server. It just kind of came up with one. It was the name of my machine and then some random characters. I'd rather have a name that means something. So scrolling all the way to the top here, you can see the -n flag. That means server name, or --server-name, if you want to be more clear. d would be if I wanted to specify the exact data directory. I'll just leave that alone. What I think I want to do is I just want to start it up with a name. And I'm going to call my server Zayn. And there it is. It's all set up and ready and running. And it's worth pointing out that if you just turn off your server and try and start it again using -n, well that would not rename your server. RethinkDB is a little bit more robust than that. However, if you wanted to rename it through the admin interface, you certainly can. We'll deal with renaming servers later on.
-
Altering Schemas
Let's play around with some schemas. We'll create a database or two and also some tables. Here, I have the administrative screen open, and I can just go to add a database here in the tables window. And I can also add a table. And just like that, I have a database named music, I have a table named artists, and I have another one, now, called invoices. Now I don't need to accept the defaults. Specifically, RethinkDB will create an ID automatically for my document called id. If I didn't want that, I could specify a different key here. Now take a look at the toggle at the very, very bottom here, the one that says, "Acknowledge writes only "when written to disk." This is known as hard durability, and all tables in RethinkDB have hard durability by default. You have to switch it off. What this means is, when you insert some data, it'll take a little bit longer for RethinkDB to respond to you, so inserts can slow down. If you want to speed things up, you can turn hard durability off, in other words, setting it to soft. Now let's head over to the Data Explorer. This is where we get to play with the data that's in our database. I don't have any just yet, but maybe we can add some and goof around. First thing to notice is we have code completion. This is one thing I absolutely love about RethinkDB. It makes learning ReQL, the query language, really easy. All these little popups tell you the things that you can do with the various entries that you have. Here, I'm just going to tell it to list all the tables in my database. And when I hit run, I can see a JSONView down below. It's just an array of strings, artists and invoices. I can switch this over to a table view if I want, and there we go. One thing I really appreciate working with the Data Explorer is the way that it intelligently handles its code completion. Here, it knows I'm working with a table, and so it's going to pop up all the choices that I have for working with a table query. And you can see all the different choices that we will get to more later on. And you've probably been noticing that I've been typing r. a lot, and r is the root name space for working with RethinkDB. There's a number of things on here. There's conversions, there's root level objects, there's commands like do, there's things where you want to specify a literal. You can work with dates. But most often, you'll just be specifying what to do with the database. In this case, if you want to drop one, like I just did, you can do that. And look at this. This is one of the really neat things about working with RethinkDB. It tells you what just happened. DBs dropped, tables dropped, too. It's telling us the old value, that was the name of the database, music, and the new value is null, meaning it's not there. And if we head over to our tables window here and take a look, yep, it's not there. It's all gone. So you don't have to do everything through the admin screen. You can do it all right here in the Data Explorer. Let's create our database again, and I'll use r.dbCreate and then, there we go. You can see the changes down below. Dbs_created, one, and the next thing I do here is I can actually create a table, too, tableCreate and artists. It's that simple. ReQL is pretty simple stuff, and there's just a few things you got to learn. Okay, so I'm creating my tables here, artists and invoices, and I've got my new database, music. Let's head back over to the tables window, and there it is. Music database, and the artists and invoices tables. Now let's switch gears a little bit, and I want to show you one of the best things about ReQL and working with RethinkDB. This is the directory that I've been working in. As you can see, I just have my data directory for my database. Let's make a directory for a Node application, and I'll call it app. And, using npm, I'll initialize our application using npm init. And in here, I'll just kind of scoot on down through all of this stuff, and well, I got to just leave my name in here, as the author, why not? Yep, that looks pretty good. Okay, so now that we have a package.json, we can install the RethinkDB driver. And we can do this by just saying npm install rethinkdb --save. Now you might be wondering, Rob, why are you doing this? Why are we looking at Node? Aren't we supposed to be working with RethinkDB? And yes, of course, we are. However, when it comes to working with databases and booting them up, most of the time, you're probably going to be working with some kind of bootstrap file. That's what I'm creating here. I'm using the Atom text editor, and I wanted to show you how simple it is to use the RethinkDB driver to do some of the basic schema manipulations that you're going to need to do. Here, I'm just using the driver to connect to our local database, and notice the syntax. It almost directly mirrors what I wrote in the Data Explorer. Here, I'm just using r.dbCreate, our music database, and then I'm telling it to run, and I'm passing the live connection that was handed to us from the connect function above. This is the beginnings of our database bootstrap file for our project. And when you're working with RethinkDB, or any database, really, you're probably going to have a bootstrap file like this. So when team members come on the project, they can get up and running quickly. Speaking of, if we run this using the Atom's Runner, you can see, boom, we have a new database that was created for us against our local RethinkDB server. The next step, of course, is to be able to install the tables that we'll be using. For this, I want to be able to execute things in step in an asynchronous way. So I'll use an asynchronous manager for this. I could use eventing and a bunch of other things, but I find sometimes, just using the Async library is the easiest possible thing. So, and I'll reference Async above, and then, down below here, I'll create a few methods for working with our database. In fact, let's kind of orchestrate this a little bit better. Rather than calling r.connect in line, I think I'm going to try and make things a little bit better on the eyes. And so here, I have createDb, and I also have createTable. Let's move our connection inside of the createDb function, and then we'll just call next once that's finished. All right, well as exciting as this might be, I'm not going to make you watch me type this all out. So let's skip ahead a little bit, and you can see, a fairly typical bootstrap file. Creating the database, and then I have a createTable function, and then down below here, I am using async.map. That's going to call a function multiple times, depending on what values you have in the array. So in our array here, I have two string values, artists and invoices. That's going to call createTable multiple times and pass the name of our tables in. Finally, down below here, I'm using async one more time, telling it to create the database first, and then call createTables. All right, so, let's go back to the web interface and you can see, I have got no databases, no tables, nothing in our local server. They're all gone. So using the Atom Runner that I've installed, that's the Atom Runner package, and I run this, and boom! Everything has run pretty well, looks like. Database was created. Looks like the tables were created. I got an array of config changes back. Let's head it back over to the admin screen. And just like that, it's updated itself. I don't have to refresh. This thing is aware of the database, and if it sees any changes, like when you saw before, the server goes offline, or when a table is added or any other information that we need to know about, the admin screen will update itself.
-
Simple Import
Let's import some data into our database, and along the way, we'll see what we can do to improve our write speed. Notice here that our artists table has zero documents. And if we go over to our Data Explorer and we run this query, yep. No results returned. So we had absolutely no data. Let's head back over to our text editor, Atom, and in here you can see I have a new directory called data. And inside of it, I have two JSON files. And they're dumps from the Chinook database. And I just dumped them to disk here. In addition, I have renamed my index.js file to bootstrap.js, just so you know. So when you download the project, you can see all the files are in there. All right, so we are going to connect to our music database, and I am then going to just require the artists JSON file. And that's going to pull it in. That means I can just loop over every value that's in that JSON file and then insert it directly, one at a time. Simple pimple. This is how you do an insert. You just do insert and then the document, and then you have to tell the driver to run. That's really important, and a lot of people forget that. All right, we're running this, oh boy. That did not go very fast. I have exactly 204 records, those are all the artists in the database. And it took 2.246 seconds. Hm, that's a little bit slow. We'll have to improve that. All right, well let's head back over to the web interface and make sure we have some data. And I'll zoom in on this, so we can see the query better. And if I hit run, yep. We have lots of data, all the artists are in there. And notice that it only returns 40 rows at a time. It doesn't blow your browser up, which I really like. So if I type in .count, chain that method on there, you can see I have 204 records. And I can also delete them by adding .delete. They're all gone now. (laughs softly) So that is how you delete data in RethinkDB. All right, the reason I deleted it is because I want to enter the data again, but I want to do it smarter. So, to speed things up, I have a couple of options. The first is to turn hard durability off, and I can do this right here in the insert call. I don't need to go and change the configuration of the table. So once again, to remind you, soft durability means that RethinkDB will acknowledge that the write is in, but it does not wait for it to be persisted to disk. Hard durability on the other hand, means that the acknowledgement only happens when the data gets written to disk. Okay, so just by flipping the durability to soft, let's run this one more time with the Atom Runner. And look at that. That was a dramatic speed improvement. In fact, I believe it took a half a second. Yep, .466 seconds, that is a much better time for writing. However, it should be noted that if you're trying to speed things up, setting durability to soft might not always be your best choice. Let's take a look at other ways that we can speed things up. So first thing I want to do is let's flip back over to the web interface, and I want to make sure that I delete all the data. And, boom, it's all gone. Okay, let's flip back over into the code and see what else we can do here to speed things up. I can set noreply to true. By default, every time you do any operations on data in RethinkDB, it'll send back some form of acknowledgment, some kind of JSON record. We've already seen this when we've dropped and created databases and tables. Here, I'm just saying, I don't care if you want to acknowledge that the insert happened. Just don't tell me, I believe you. This is not the same as soft durability. If we run this, you can see, it is very fast, but there's a slight delay. It's still 2.5 seconds, but our code wasn't prevented from moving on. As you saw, the 204 records are splashed right out to the screen. With a notification down below, that took a little bit longer. That's because before we're able to close the connection, all the writes had to be acknowledged, although the driver didn't tell us about 'em. So it looks to us like it sped up. All right, well what happens if we set durability to soft and noreply to true? Let's try it out, and boom. Everything goes really fast, but it's the same effect as having durability soft. It's about a half second for all of the writes to go in. So really there's no gain for that. What we really should be doing here is we shouldn't be looping over the records and executing a run each time. Even though we have a live connection, executing individual writes like this is very slow. What we really want to do is insert the entire array at once. The driver is smart enough to know, oh, these are individual artists records. Next, I'll remove the durability and noreply settings, and then down below, instead of console.logging our counter, which would be one, I'll just put the length of the array, just so we can see how many records were inserted. And running it, that is pretty quick. That, again, is hard durability, and we're down to about a half-second count now, just by inserting an array instead of the individual artists. All right, now let's see what happens if I set durability to soft. What do you think's going to happen? If soft durability on that array, eh, you know what? Not that much of a win. So we're kind of close, which is great. This means, is that RethinkDB is really fast at writing, especially if you want to write an array, you want hard durability, but you don't want to wait for the acknowledgment. Here, it went in almost instantaneously. And our program can go on and process other things.
-
The Basics
Simple CRUD
Let's play around with some data, create, read, update, and delete. And I'll be playing with the artists Table here. And let's click on it and take a look at what we have. Now, it's kind of weird, because here it says about 100, as if it's not sure. (laughs) Which I don't quite understand. But here, if we say r.db and then music, and specify the table, which is artists. And then we ask for a count. We can do that by just chaining on .count. I'll zoom in so we can see a little bit better. We run this, we can see exactly how many documents we have, 204. What you see here are the basic mechanics of ReQL. You chain one operation onto the next. Here, I said I wanted to query all records of the artists table. And then I want you to count it for me. Next I could say I want to know what the type is of that result, and you can see down below there, it's a number. That's ReQL, you chain commands together, and it allows you to do certain things like inserting data. This is one of the fun things about working with a Schemaless Database. I can insert anything I want. I guess that's good and it's also bad, as you're about to see. Okay, have a look down below here, and this is the response that we get from RethinkDB. It tells us that one record was inserted, and also that it generated a key. That key is a GUID, or universal, unique identity. If we had supplied an ID value, it would have inserted that instead. Okay, now what I can do is I can run filter, if I just replace insert with filter, and the specification of my name. And there's the document. If I want to delete it, I can filter, and then delete. Notice that all I had to do was chain delete onto the end of the filter command. Now I can insert another document. Let's say my title here is Pluralsight Author. After all, I deleted myself, that wasn't very fun. And then boom, we have the same result that comes back to us. One inserted, and one brand new key. We can take that key, and we can use it yet again in a different function. I can say go and get the document with this ID. Now that I have that document, I can do things to it, like replace the entire contents. Let's say I don't want to have my title there, I want to have my Twitter handle. I can do that by using replace. Whoops, or not. Have a look down below. Inserted object must have a primary key ID. That means that I called replace, but when you call replace, well, it doesn't quite know what you mean, even though we asked for get above, well it should really replace everything. But it doesn't. In order to do this properly, you have to make sure that the entirety of the document goes in. In other words, everything will be swapped out, so I have to specify the ID here, and the name, and my Twitter handle. Running this, there we go, you can see that one document was replaced. That's what we wanted to see.
-
Chains and Filters
ReQL is a functional language, it was modeled after Haskell, and Lisp believe it or not. Sometimes, though, when I'm working with it, I feel like it's working with Javascript, although Javascript is object-oriented. I don't even want to go into those discussions. Anyway, here is the table of all of the artists that we have. Again, 204 records. And I want to be able to play with some of this data. Specifically, I want to be able to query for partial values. So I can do that using the keyword filter, and filter can take a number of things, including a callback function, as we're specifying here. Although in functional programming this is more of a Lambda, or anonymous function if you will. So what I need to return out of this is the filter that I want to see. In this case, I want the artist's name to match on something. And this is one of the fun things of working with Rethink, is that it supports regular expressions straight away. So in here I could say I want anything that matches House, and then I can pass in a .* although that's not really needed, this is a partial match by default. And then you can see House Of Pain. If I put in Ho, I can see Red Hot Chili Peppers, and House Of Pain. Notice again the .* or not, either way it's a partial match. Alright, well filtering on top-level keys is not necessarily that exciting, although I do like match functionality a lot. Typically when you're working with a document, you want to get information just about anywhere. Especially if those documents contain embedded arrays, like ours does. As you can see here, every artist has one or more albums, and every album has one or more tracks. So this is a double-nested array, how in the world are we going to filter on that? Let's take a look. First thing I want to do is I want to take a look at all the albums for each artist, and for that I can use concatMap. And this is one of the downfalls of working with something like ReQL is you have to discover all of these little functions that are buried inside of it. But the good news is there's not that many. So for the concatMap function, I need to tell it what I want to see. I'll explain how it works in a second. I'm telling it that I want to see all the albums for the artist. Here you can see that boom, it's given me all the albums, as if each album was a top level in our document structure. In other words, these are not arrays, these are just objects, and what I get back is an array of all of the albums. That is how concatMap works. So if I have a document with an items array, let's say I wanted to get at those items directly, I wanted to see them in a list. First thing I would need to do is to map those items, and that will return an array. Then I would need to concatenate the items out of the array and drop them into a single results set. So let's go over this step by step. Again, step one, I would need to map the values. And if you ever used map in Ruby or Javascript, you know that it returns an array. The second step is I need to take those items and strip them out of the array and concatenate them together. Well, the good news is concatMap does both all at once. (laughs) So it is one of the most useful functions that you will find in ReQL. Here, I'm just telling it, I want to see all the albums. And this is where things get fun. I can chain on another concatMap, because now I'm operating on individual albums. And here, I can say, I want to see all the tracks. And look at this, I could run this and now I have a list of all the tracks, as if they're individual documents. This is the functional nature of ReQL. Chaining commands together, composing and shaping the data as you want in a functional way. And to me, I think this is amazing I have never seen anything like this in any other database system. It's usually some wonky query language that's not easy to use. As you can see right here, things can be quite easy. Let's take a look at another example. You can chain together filters, that's what I'm doing here. I want to be able to filter all artist documents based on the genres that are in the tracks. So to do this, all I need to do is chain together filter operations until I get down to the track level. Here I'm three levels deep. And then I just need to return, if the genre_id is one. So running this, and having a look, yep, genre_id one. That was not that difficult. Alright, well I should note that I don't need to use a callback up here. Here I can set the filter directly on tracks, and I can do so using an object. And I can just say I want the genre_id to be equal to one, just like that. And if I run it, I get the exact same results down below.
-
Common Functions
As I mentioned before, getting to know ReQL is all about getting to know its functions. So let's take a look at some of the common ones that you're going to use often. The first one I want to show you is pluck. And if you're a Ruby developer you know what pluck is. This will take selected keys form a document and return them in an array. Here I just have albums and name taken from our artists. It's a bit like map, but it focuses solely on just returning selected keys. The next one I want to show you is limit, and if you've used databases before, then you know what limit does. It restricts the return results to a number, so if we only want five results back, well, you get the idea. Just say limit(5). And if we want to order by anything, we just specify it in an orderBy, and you can orderBy more than one key if you like. So here I'm saying orderBy the name, and then limit the results set to five. What you're seeing is functional composition at work. You just chain the functions in here, in a fluid, sort of interface way, and you can return what you like. Now take a look at this, the order of the functions is important. If I say skip, basically skip some records, after limit, well then no results would be returned. Because I'll only have five results that come back, and I'm telling RethinkDB to skip 10 of those, so I get nothing. However, if I put skip first, I would say skip 10 from the order, and then give me the next five. So here I've skipped 10 records, and this is the five records after that. Okay, let's do something a little bit different here. I've already showed you one of my favorite functions, but it is so useful, and I've used it so much, I want to show you again. And that function is concatMap, and what this does, to remind you, is it pulls out selected fields in an array, and concatenates them together into their own array. So here I'm going to use the artist document, and I want to return, and I have to do this using a callback. I want to return all of the albums. So this is going to return all the albums as if they were documents in their own array list. I can now filter that array if I want to, as I would any other table. So I could say filter, and let's do a callback here, and I could just apply a normal object filter if I wanted, in other words I could specify title equal to something. Here I want to actually do something more, I want to use match. So match is going to take a regular expression, as you saw me use before, and here I'm just going to look at the album title, with match of Jagged. But I run into a problem, look at this. Remember before I said you could insert any kind of document in there? And well, wasn't that neat? Well sometimes it's not, and in this case it's not. RethinkDB can't apply the filter to this document I inserted, because it has no attribute albums. And it might seem like a weird limitation. However, you can get around that quite easily, you just have to specify that you only want documents that have the field album, by using hasFields. And of course I have to put this in before any of the filters. And there it is, my match worked. Generally speaking if you have a table, then the documents in that table should be, well, roughly the same. But that's not always going to be the case, so hasFields comes in very handy. So the next thing I want to show you here is if you use filter, notice it's lowercase, well, filter is case sensitive. So here I'm just filtering on my name. If I put it proper case, then you can see the document that I inserted in the previous clip. Let's use pluck so I can grab the Twitter address from my document, and there it is. Well now let's use pluck on all of the table and see what happens. First thing to notice is that there was no error, which is a bit odd because this is a table full of artists, and only one document has a twitter address on it, and that would be mine. Filter through an error before, when we tried to filter on albums, and my document didn't have albums, so it said, "I can't do that," and it threw. This might seem counterintuitive, as a matter of fact I will go ahead and say that it is, so you're just going to have to memorize this. Could try and use filter on an attribute that's not there, then you'll get an error. If you use pluck, then you'll get an empty set, as you see. Either way, the answer is fairly straightforward, try to remember to use hasFields if you're not sure if the attribute is on your document. In fact, using hasFields together with something like pluck, is so common that RethinkDB actually has a built-in function that does both. And this is a common kind of thing you see with RethinkDB. You see functions that actually do two things, like concatMap. For this operation, use withFields. This is pluck and hasFields combined. Believe it or not, that's pretty much it. RethinkDB comes with a number of functions, but I don't find them overwhelming. I use pluck, filter, concatMap, map, hasFields, withFields, limit, orderBy, and skip the most often. When doing analytics, however, of course there's more you can use. But for the average queries, these are the common functions that you're going to probably be using most often.
-
Dates and Times
Working with dates and times in any platform is always a challenge, whether it's a database, or a web framework, whatever. Getting it right can be really, really, really difficult. So let's see how RethinkDB does it, and let's take a look at our invoices table here, and this is an invoice, as you'd expect, this comes from the Chinook database. And notice that I have an invoice date. And it has a very certain format. This is the way the RethinkDB driver handled the insert of the invoices record. So it takes a date from Javascript and turns it into what you saw there. So let's pluck that invoice date out, and you can see the format once again. It even comes with a timezone at the very end, you'll notice it's all zeroes, that's because I'm in Europe currently. So I can filter this table, and I can filter on that date. So for this, I'll use a callback function, and I'll send in the individual invoice. And here I'm going to return the invoice date, and then I can use a comparison function, .gt, just dot, greater than, and I want to check for all invoices greater than January 1st, 2012. And it just works, which surprised me. Believe it or not. I was expecting to do all kinds of gymnastics to get the dates to work right, because working with Javascript and dates, well, it's just not terribly fun. But RethinkDB handles it elegantly. Okay, let's do a little bit more here. Let's take a look at all the invoices that we just queried. And that's 162 invoices generated since January 1st, 2012. How can we look at the invoices only for the year 2012? Let's try and do that by chaining on less than, or .lt. And here I'll just say the date should be less than 2012-12-31. And 412 (laughs) that's a little bit counterintuitive, isn't it? Well let's take a look at what exactly we got back here. There we go. So we have a bunch of invoices, we actually have two subsets, we have a set of invoices with the invoice date greater than January 1st, 2012, and a subset that's less than 12/31, 2012, so it gave us back both, that's not very good. So let's use .and and see what it gives us. Now, same thing, a lot of invoices, and that's not exactly what we wanted. I would've thought that it would've applied, well, let's say a between condition here, but it doesn't, it actually brings us two different subsets of invoices, and try as I might to come up with an explanation for this, I can't. It's just not the way the API works. But the good news is there is a better way. If all we want is the invoices from the year 2012, well, we can specify year. Sort of, and I'll show you what I mean in a second. I'll use double equals here, and see what happens. And I think I'll remove pluck and count, because I want to see all the invoices that are returned, just in case I have to do some sleuthing, and (laughs) no results at all. That's weird. I've specified that I want the year to equal 2012. Hmm, let's dive into this a little bit. Here it's just trying to do a straight coercion comparison. If you've ever used Javascript before, you know that this causes problems. I'm trusting that .year is returning an integer value of 2012, but well, can't really be too sure about that, can we? So let's take away the double equals, and instead I'm going to use .eq. And you might be thinking, "Why don't you just "use triple equals here?" That's a strict comparison, that's what you're supposed to do in Javascript, and you are right, but it won't solve my problem, because my problem is actually a little bit deeper than this. So for this I'm going to use .eq, and this is an equality comparison function that RethinkDB gives you and I would highly suggest using this always, because it's going to handle all kinds of cases. So let's see what happens when I run this now. I get an exception. That is something that is surprising. A Not a TIME pseudotype. The thing that's actually throwing that exception is the .year function. As I'm trying to coerce a year out of something that is not a time pseudotype, I'll talk more about that in a second. Next thing that appears to be happening is I'm comparing an exception to number 2012. I don't know if that's the case. Either way, what I know is that exception was not thrown until I did .eq. So I did some sleuthing offline here, and I found that if I just did a map, and I return the invoice date .year, just to see what it is, well I get the exception there as well. I don't know why it wasn't throwing when I was trying to do the comparison before, I can only guess that it's just a little quirk in the release candidate. However, it is just a strangeness that you should know about if you're trying to follow along. So if you use .eq, everything should be okay. But how do we solve this weird issue with this TIME pseudtype, what is this after all? Well, the short answer is that when RethinkDB was released, they didn't have support for dates, and that was a long time ago. And since then, they have supported a number of different data types, and date and time is one of them, and so the way they do this is they create what's called a pseudotype. It's a bit of a manufactured way of handling specific kinds of data, and you'll see this with geometry data and other kinds of data inside of RethinkDB. So for dates and times, they want it in a certain format as you see here, with the timezone attached because that's very important when you're trying to do time calculations. While all of that is interesting and neat, why am I getting this error? After all, I stored this as a date, and you can see the formatting there is specific to RethinkDB. And it should know that this is date and time. However, unfortunately, RethinkDB stores dates and times as strings. You have to coerce them in your query to a date format. I find this a little strange, specifically because the geometry types are actually stored in a very specific way, and we're going to see that later on when we do geospatial querying. But if I was to come in here and write a query and say, "Hey, tell me what the type of "this invoice date it," I can do that using .typeOf. It would tell me it is a string, as you see here, which is a little bit frustrating, because you have to write conversions in line as you go. That's just the way it is, wish I could give you a better answer. But what we need to do here is we need to convert our invoice date into ISO8601 format. And if we do that, we can then ask for the year. And by doing that, we can finally get the right invoice dates. So here we could just say we want all the invoice dates with a year of 2012 and we can see the right answer, which is 82. Now, there's more we can do with time in RethinkDB. You can use r.time to construct dates that you want to work against later on in a given query. Here you can use the time function as I'm showing, and it takes a year, month, day, hour, minute, second, time and timezone. And you can use shortcuts and syntactic sugar, you can use r.january for instance if you don't want to type one. Anyway, here I give it four arguments, and it gives me time, and I can also just put in a one, and same thing. So what did we learn using dates with RethinkDB? Well, I learned a number of things as I've been using it over the years. And first is that, always use .eq. You never know what's going to happen with that double or triple equals, so it's a good habit. The next is, if you need to query against any specific kind of date measurement, like year, month, day, quarter, whatever, just store it as its own attribute in a document. It's easy to do and, as I mentioned before, space is not necessarily a concern with document systems like RethinkDB. If you do need to do querying with dates, make sure you use toISO8601, it's a built-in function and it'll make sure that is a date or time pseudotype that you can then query away as you normally would. So all in all, querying with dates, yeah, a little bit rough around the edges, but it still can be workable.
-
Types
In the last clip with dates and times, you saw me use the typeOf function to find out what kind of type I was working with. Let's continue on and find out some of the base types here for RethinkDB, and I'll be interspersing some of the documentation from RethinkDB's website. And the first thing we'll start with is tables. Pretty obvious here, they store data and you can write to them. So we already know this, because we specify .table and that is, a table (laughs) no surprising thing here. But what about something like concatMap. You've seen me use concatMap before. What does it return? Well let's do what we've been doing before, and we'll return the albums. And you can see it's a type of stream. What's a stream? Well that's what you're going to be working with quite often when you do a query with filter or returning an array with concatMap. It just returns a cursor, it doesn't return the entire result set, and this is really helpful if you're bringing back a big result set that you want to loop over. And you can loop over it right in your client application in the same way you would do it in the data explorer. So those are streams and it's really important to know when you're working with them. But there is a slight difference if you're using filter, let's take a look, and for this I'll just return the album_id that is equal to two. And what I'm most interested in is there we go, the type that is returned, which is a selection of type stream. That's a generic, if you've never seen it. So what is that? Well, selection represents a subset of tables. So that's easy to understand. And so if we use filter, we get that subset, and it's returned to us as a stream. But there's two kinds of selections, there's an object selection, which is just one record, and there's a stream selection, which returns a bunch. And so in our case, we got a stream selection. Why didn't we get an object selection when we asked for one record using album_id of two? Well, because get didn't use get, we used filter. And filter returns a selection of type stream. So let's do something a little bit easier, what if we did count, what does count return? Well that's good news, it's just a basic number. However, this is the part where type enthusiasts out there are going to say, "How are the numbers stored?" and they're stored as double precision, or 64-bit, floating point numbers internally. This means that if you're doing anything with numbers that requires deep precision, decimal places and so on, well, floating point arithmetic is going to be a bit of an issue for you, so just be aware of that. Alright, well let's take a look at the ID here, and as I mentioned before, all the IDs generated by RethinkDB are stored as GUIDs, or UUIDS. If we use a get and we use that ID, what do we have? Well as I mentioned before, we have a selection of an object. This is not using filters, so that would be a selection of a stream, this is just selection of an object. If I use pluck on here, you might expect this to be a string, after all it's just the name, but it's not because it returns name, and then the value, so it's actually an object. Which brings to mind that the core data types that are in RethinkDB are the basic Javascript types, with the exception of date. There are objects, there are arrays, there are numbers, and there are strings. And everything else gets coerced into one of those. You can then spin up from there using the pseudotypes that are in RethinkDB, we'll take a look at those more in-depth later on. But what about the document's ID? I mentioned before that it was generated, and here you can see that ID, it's a big fact GUID, and it's stored as a stream. I guess no surprise there. If I ask for a string field, in this case status, and I ask for the type, it's a string, so again no surprise there. Strings, numbers, arrays, objects, and let's take a look at one last example here just to reiterate this. If I say r.now, which returns the current date and time, you can see down below it's a pseudotype, or a PTYPE, a pseudotype of time. And that's what we tried to work with in the last clip, and it was a little bit confusing, but the good news is that there's easy ways to work with it, and if you read the documentation on time, well then you know that the drivers, the Javascript, Ruby, or Python drivers, will automatically convert those dates and times for you, for your client to consume, even though they might not be stored with those specific types in the database.
-
Shaping Data
When working with a document database, sometimes the challenge is getting the data back when you do a query in a way that you want to see it. Let's take a look at how you'd do that with RethinkDB, and I'll be using the invoices table. And on each invoice, I have a customer. What if we wanted a list of customers? Well, the neat thing we can do here is we can just add on customer in parentheses, and what we get back is, as you would expect, a list of customers, it is just that easy. You might be thinking, "Wait a minute, "why didn't you just do this instead of concatMap "in one of those earlier examples, "that seemed a little more complicated." Well the answer is that concatMap works on embedded arrays, and this is just an object called customer on our main document. Next up I can use merge, merge is a very useful function that you use sometimes with RethinkDB, and what this will do is merge keys onto your result set. So here I can say that I want invoice now added to my customers that are returned. And if I run this query, you can see I have a new key down below, that's an array, and it just says one, two, and three. Simple enough. Alright, well let's continue on here. What I'd really like to do is to get the actual invoice numbers inside of here instead of just these fake ones, and I can do that with another query. And here I have to make sure that I specify to DB that I'm querying against because I'm using the data explorer. I'm going to query the invoices table, and I'm going to filter that invoice, can you guess? You've seen me do these queries enough now that this should start looking familiar to you, as you see the ReQL kind of flow out. I want to return all invoices with a customer_id equal to what? Well I have a customer object that I can use. And I can specify the customer_id. What I am doing here is effectively flipping around and inverting our document. Which is something that is not terribly difficult to do with ReQL. We'll try to run this, whoops, that's kind of a wonky error. Expected type datum but found a selection. Well the good news is, we know what a selection is. Selections come from filter. So we can look at that selection, we can say, "Hmm, it's expecting a datum, "what in the world does that mean?" Well, reasoning it through, merge, it appears, needs to have an actual chunk of data, or datum, and what we're giving it instead is a selection that we're offering from our filter function. What we can do to get around this is to turn that filter function into an array. And we do that using coerceTo. Here, seems a little hacky, but hey, it works. Look at this. I should say right now there are better ways of doing this, but I did want to show that it is just this easy to invert your document using ReQL, and a query that amounts to eight lines. Actually, seven if I take that space out. Alright, well last thing I need to do here is I don't want the entire invoice brought back for the customer, because that would be weird and recursive. So I'll use pluck, and I'll just grab the invoice_id, and now we have an array of invoices, with an invoice_id only. And there you go. As you'll see in the next modules, actually shaping your data so you can operate on it is quite important. Here, I inverted our invoice documents to be customer documents with invoices appended to them. And it really was not that difficult.
-
Advanced Topics
The Import Tool
A major challenge that I personally have when showing people how databases work is trying to keep it real. And so far, we've been covering the basics with a pretty basic dataset, but that is about to change. What I want to do is I want to generate a bunch of data, and there's a number of different ways that I can do this, I can find sample databases, or I can just make it myself. And for that, I am going to turn to faker.js. You can use this in the browser, or you can use it with node. I'll be using it with node, and the idea here is that you can just tell faker what you want to create, and it will go out and do it for you. You can create fake names, you can create fake emails, you can change the locale to generate fake data, it's amazing. So, here we are. Names, I can do addresses, phone numbers, avatars, email, usernames, domain names, passwords, and I can do company information, images, I did images of cats, I can't believe I generated all this stuff up and I looked at these images of cats. I'm like, you've got to be kidding me. So as you go through here, you can see, there is a ton of stuff that you can generate, and that's what I want to do today. In fact, if you want to test your UTF-8, you can even use Chinese characters. This thing is really complete. In addition, I have taken another data dump from the Chinook database, but I've kind of moved things around a little bit. So before I had an artist table, but I want to sell individual albums as products, so I did some post handiwork and tried to make it look as much of a product as I could. I added a SKU and some pricing information, and I embedded vendor and details in as individual objects, and also an embedded array of tracks. So in preparation, I have cleared out our music database of everything, and I do have a JSON data dump of albums, and I'll show you that in just a second, but what I wanted to show you right now was the RethinkDB importer tool. So flipping back over to our console, if I type rethinkdb import, and then dash, dash, help, it'll tell you all about the things you can do, and all the things that you're going to need to have. So we can use JSON, we can use XML, we can use CSV. I have a JSON dump, and I'm going to be importing with that. Either way, no matter what you do, you got to have the python driver installed. The importer works in python, so that's just what you got to do. Okay, well, let's import our file, and I will specify its location using -f, and it's in the data directory of our little node application here, and I've called it products.json. And all I have to do is specify the table, and here I'll just say music.catalog. Now, you have to make sure that you do music dot, if you don't, you'll get an error. And that's that, that was really fast. I have to say, it was a little surprising that we'd go that fast. All right, so let's flip back over to the admin website, and there it is, our site has updated itself, it saw that a new table went in, and in there, we have about 300 documents. If we go to the data explorer, if I type in r.db and then music, just to be sure that we have the data available to us, catalog as the table, and I'll zoom in so we can see this better, and if I run this, yeah, look at that. We have all of our brand new albums in there. For those about to rock, oh my gosh, I remember when that song came out, oh jeez. That's a lot of fun, AC/DC, rock on.
-
A Better Dataset
All right, let's kick this into high gear. I have 350 product records, but I want a lot more than that to really test out RethinkDB. So, in our data directory, I have a number of documents here that I've used as real world examples. I have Stripe checkouts, I have commerce checkouts from my work at Tech Pub. And I'm trying to make this as real as I possibly can. I want to use this stuff with faker, but I also want it to be based on real information. So here's a Stripe response that I got for a successful credit card checkout, and here's how I've changed the generator. I've taken checkout information as a template from Tech Pub's sales engine. I've also taken fulfillment stuff. And all the data of course has been anonymized, and there's no real data I'm taking, I'm using faker to generate the fake stuff, but I am using the same document structure, to keep this as real as I possibly can. I am generating fake addresses in here, and I am calculating the amounts as accurately as I possibly can. I am doing my best, is what I'm trying to say, to make this as real world as I possibly can. What I'm trying to do is emulate a really successful business. So let's flip over to our generator, you can inspect this code on your own in the download, if you like, and if I run the generator, we are going to generate (imitates fanfare) 100,000 sales, because I think our business has been really successful, isn't that neat? So it's generating the sales, it's writing them to disk, I'm not actually going to jam them into the database. I want the import tool to handle this for me. So that is kind of thing number one. Let the importer do the heavy lifting. It has been written specifically so it can do it efficiently. If you try and do it yourself, well, let's just say, I know this from experience, you're going to crash your system. Anyway, I'm outputting those documents. You can run the generator yourself too, just output the documents, and then you can run the importer, as I am doing here. I want to see how well RethinkDB handles pulling in 100,000 records. So, I am pulling in the sales.json file from the drop directory, and there it is. It has created a sales table for me, and what we can do is we can actually sit here and watch the import go off. And you can see the graph moving, as the writes start to happen, and we've pulled in about 7,000. There's 18,000, and it's kind of fun to watch this, but notice, it doesn't go incredibly fast. Our writes are right around, oh, 11, 12, 1300 documents per second. But this can be influenced by any number of factors. As you're going to hear me say later on, the write speed depends on the size of document, of course, any indexing that you have on the table, whether you're using durable or soft writes, if you get replies back, and so on. In our case, we are using the importer, so the importer has kind of managed all that stuff for us. I believe it does 200 document batches, and that's why it goes so fast. Okay, skipping ahead a little bit, I'm not going to make you wait for the whole thing, but all in all, it took about a minute and a half, 94 seconds, to load 100,000 documents. That's not so bad. I don't have any indexing on this table, I'll do that in the next few clips, but if we want to count our records to be sure that we have all the data we expect, a running account, there it is, 100,000. Now if you want to amp this up and go for a million or two million, go for it. It's up to you. You can take a look at the code, and you can tweak it as you like. Okay, let's pull a document out, and see what kind of information we have in here. This document represents a checkout, so this is actually a rather large document, and if you recall the big document approach, the process approach that I talked about in thinking in documents, then you'll recognize this. I've got a billing address, I've got a shipping address, I have all the items as part of the checkout itself. I have payment information, I have the processor response, and this is emulated, coming from Stripe, and again, all this data is anonymized, there's no real information in here. Faker has created fake data for us. But, this is a very real Stripe response format. I have referral information that I've randomly dropped in there. I also have source and status as complete, I have some date information, so yeah, this is a big document. All right, let's do the next one, this is fulfillments. So when a user checks out, they create that checkout document, and then the system will create a fulfillment document, and this means stuff that our system or maybe some administrative person somewhere, who knows, or warehouse, you have to fulfill that order. So, since I have 100,000 sales, correspondingly, I have 100,000 fulfillments. But you'll notice this is going a lot faster. This is happening at about 2,000 to 2,300 writes per second. Well, that's because our fulfillment document, as you're about to see, is a lot smaller, so it writes a lot faster. In fact, it bears repeating that write speeds are influenced by so many factors. If I had thrown some secondary indexes on this table, it would be a lot slower. And correspondingly, if I had a bunch of shards and I had replications, then it would be a whole lot slower. But anyway, the write went off pretty quickly, so let's take a look at one of our fulfillment documents, and I'll do the same thing that I did before. In here we have a little bit less information, but we do have information about the destination of where the order fulfillment's going to go, how many times it's been downloaded, because we're doing digital items here. The email of the customer, we have an items list, but notice that the items list now is more about the download status or the fulfillment status of each item in that list. And we have notes, and in addition, I've dropped in notifications. This is the way I store data for orders. I want to make sure that whatever happens when fulfilling that order is tracked in a single document. This way, I can tell if emails were sent, perhaps some kind of messages back and forth, who knows? And any shipping information that goes along with it. Okay, there is our test dataset. We are going to now ramp up our querying to work with something that's just a bit more real world.
-
Secondary Indexes
When working with big datasets like the ones we just brought in, it's really important to set up secondary indexing. Secondary indexes are core to working with RethinkDB, and let's take a look at them now. So, let's say we wanted to query an invoice by customer email. Now, what a lot of people might do is they might just put email as one of the top level attributes on their document, but you certainly don't need to. What you can do instead is you can set up a secondary index. And to do this, I'll zoom in, so we can see this better. You just say index create. In fact, here's a list of all the index things we can do. For now, I'll create an index, I'm going to give it the name of email, and then I'm going to tell it where to find it, and I'll use r.row, which is syntax you haven't seen just yet. R.row is basically a shortcut function to say, the current document's customer email attribute, or r.row customer email. I'll be working with this more later on. There's also a couple of ways I could have created this index. I could have specified this using an inline function if I wanted to. I'll show you that in just a bit. For now, let's create our new index, and if we scroll down here, we can see the result returned to us from RethinkDB, and hooray, it's been created, that's exciting. So how do we actually use this thing? Well, for point of comparison, let's do a quick filter for a customer's email, and for that, I'll just use the filter function. Now, RethinkDB doesn't automatically see an index and say, a-ha, I will use that in this query. You actually have to specify it. What I'm doing here is I just want to create a comparison. So I want to filter by customer email, and I'm going to use .eq, and then use one of the emails that I have. And you can see, right here, there's the result for our invoice, or invoices, I should say, with the customer and their email, and you can see that it executed in 558 milliseconds, with 13 rows returned. So keep that in mind, it's just about a half of a second. So let's use that index, and to do this, we're going to use the special getAll. GetAll is specific for working with secondary indexes, and here, I'll just specify the value and the index that getAll needs to use, in this case, email. All right, keep your eye on that, 558 milliseconds, and if I run it, boom, 10 milliseconds, that is pretty darn fast. That is the power of secondary indexes. If we go down here to the sales table, where I created the secondary index, you can see, it is now listed here as a secondary index on our table. People see this kind of performance boost, well, they want to throw indexes all over the place, but you have to be careful. Because the more indexes you have down, well, the less write performance you're going to have. So, for a table like sales, not really a big deal. But, you know, you have trade-offs. That's all I'm trying to say, and you'll have to choose from whatever works for you. You can also pass in additional values here. You can have as many keys as you want passed in. So here, in 96 milliseconds, we returned all the orders for the two emails that you see on screen. But getAll is not the only thing we can use, we can also use between, so retrieve all invoices with emails before Henderson and Ubaldo. I would imagine, this is all the emails that are H, and J, K, and so on. While it might not be useful for emails, it is very useful for a number of other things, including numeric values and so on. Secondary indexes are useful for all kinds of things. In fact, you can order by them. Here, I'm just going to say order by the index email, and you get very speedy sorts. And if we scan down the document here, just to be sure, yeah, there it is, sorts by AA, Aaliyah. So that is obviously sorted correctly. If we wanted to do this in descending fashion, you use r.desc, that is how you do a descending sort. You have to tell RethinkDB to do it in a descending fashion. Scanning down the document to be sure, and yup, Zula_sauer44.
-
Compound Indexes
Secondary indexes with RethinkDB are very powerful, as I hope you're starting to see, and you can have more than one on a table if you want. So let's take a look at a second type, which is compound indexes, indexes of two or more values. So for every sales document, I have referrals, and for the large part, they're null, but some aren't, so let's restrict on that, and I'll use has fields to do that. And if we take a look down the list here, we can see yes, there is a referral. So let's say I want to have an index where I can query quickly and easily on this referral identifier, but also, the state where the buyer is located. So for that, I can use the billing address state, in this case, it could be Pennsylvania. So for this, I might want to create an index that will index both values, the referral identifier, and the state of the customer. So I can do that, and I will create an index, and I'll call it referrals, and for the definition, I am going to just pass in an array. What you saw me do before was just a single row, customer and email. Here, I'm just going to do an array of basically the same thing. Billing address and state, and then for the second value, I'll drop this down so we can see it, for the second value, I'm going to pass an r.row, and this is going to be referral, and identifier. So, running it, and our index is created. So now, let's put together a query, using getAll. I can specify that I want Pennsylvania, and I also want the referral identifier 22222. Because this is a compound index, I'm going to have to do two things. First is I have to send this in as an array. Second, I have to make sure those values are in the same order that I specified in the index. And I'll just specify the index is referrals. Okay, executing our query, it comes back in fairly fast order. And having a look at our data, yup, we have a state of Pennsylvania, and let's check out the referral down below here, and there it is, that is the right identifier.
-
Arbitrary Indexes
Sometimes, it makes more sense to just index values that are rather arbitrary, and let's take the case where we want to see the status of an order combined with a shipping company. We could do a compound index here, but for the sake of example, I want to show you a way you can actually just make up an arbitrary index out of arbitrary values. So to do this, once again, we will use indexCreate, and I'll give it a name, however, instead of just specifying r.row and telling it which row, I am going to actually use a callback function here, and I'll use F to signify an individual fulfillment. Here, I just need to return something, and what I'll do is I'll just say let's concatenate together, using r.add. We'll concatenate the values that are in the shipping company row, I'll add in a dash, and then I'll also throw in here the status. So r.add is going to concatenate those values together, and make a whole new value that wasn't even there before. So our index is created, and now if I do UPS-pending, and I specify the index as shipping-status, that's just a name that I use for convenience, there it is. 124 milliseconds, and that actually makes a fairly interesting and powerful query. You might be thinking, why is this even useful? Well, here's a couple of ways that it is. Let's say if you want a query on someone's first name and last name, or perhaps you want a query on the sale date, and you might want to query against year and quarter together, and look at all the sales between two years, and two different quarters. You could concatenate the year dash quarter together, and have an index that you can query against arbitrarily, rather than having to store those values individually. Either way, if you can't figure out how to index something that's on a table, you can always just make it up.
-
Big Mapping Query
Our boss has just asked us to go through these sales and fulfillments tables, and come up with a list of all the pending orders, sorted by company, and grouped by company. Oh, boy. Our sales documents and our fulfillment documents are quite large, and we'll want this in readable fashion. So, let's do some data shaping, or some mapping work. And to do this, we will abuse the mapping function. So, working with the sales table, the first thing I want to do is I want to pull out some information about the customer and the sale. I can use pluck for this, and I can pluck out the ID, and I can pluck out the email, but this gives me the customer's ID, I actually want the order ID, so I'll have to do something a bit more intelligent. So I'll use map for this, and what I'll do is I'll return sale_id, so it's nice and clear, and I'll specify r.row and then ID. For email, I will pull the embedded object email out, and put it right up next to the sale ID, so I'm basically shaping this data to look exactly as I want it to be. So if I run this, there it is. I have company, email, and sale ID, and I can use this for further querying, by just chaining some more functions on. But before I do that, let's talk about this r.row thing. R.row is an alias for the current document, and as you see down below here, I can ask for ID, customer, email, and so on. This is just how I'm using the map function. I am not using a callback, I'm just specifying the way I want to see the data using r.row. But I don't need to do it this way, I can also use a callback. Sometimes callbacks are a little bit more clear, so for instance, here I am just asking for the sale ID and the sale customer email, so basically, I'm using sale as a variable instead of r.row, which is an alias. Choosing between using an object as a return with map, and using r.row, versus using a callback like I am doing here, comes down to a number of things. The first is that if you're going to do additional work with a sale document, say you want to apply a filter or do some additional mapping, well then you're going to want to start off using the function as I am here, because you'll need to have reference to that document. R.row, remember, just refers to the current document, and if you try and use it in any kind of subqueries, it's just not going to work. Because there's multiple documents in subqueries, r.rows will apply to the top level. Anyway, it's kind of like this, that's the easiest way to think of it. If you have something simple to do, you can just use r.row, as I am doing here. Otherwise, you might want to use a callback. Okay, in addition to the sale ID, let's also grab the customer ID, just in case we need it later on, and that looks good. All right, well, we have a little bit more work to do here. The next thing I want to do is an innerJoin, and this is going to join one table to the next, or one sequence of data to the next. Here I just am going to go grab the fulfillments table, and I'm going to pass a callback in, and since I'm using innerJoin, that callback is going to expect two arguments, it's going to expect table one and table two, and then I have to specify which Ids are joined. So here I'll just say the ID of the sales table is going to equal the sale ID of the fulfillment table. This is kind of a wordy way of joining things together. You have a table, and then you innerJoin a query, or another table if you want, and then it's going to be a function, the table, left, and then the query, right. And then you have to join those things together manually. And now, I can use map to help put the data in the way I want to see it. But before I go further, what I'm joining here is not actually two tables. I am joining the mapped query above, so I'm going to rename this, to avoid confusion, to sale_summary. I'm also going to drop a limit on here, so I don't crash my server, or crash my client, with a ton of data coming back. All right, what I want to map out here is the status of the fulfillment, and for this, I can use r.row. In fact, I kind of have to, because what's given to me in this mapping function, straight out of this join, is actually a right document, and a left document. So I have to say, I want the right side status, which is our fulfillment table, and I want the left side email and the left side company. This might look a little bit strange, but this left-right kind of thing is what you have to deal with, with joins, and also groupings. We'll see that more later on. All right, so if I run this, uh-oh, I get an error. Why am I getting this error? It's telling me that I don't have an ID in this object, and that is my mapped query that I started out with above, and of course I don't have an ID, what I have instead is a sale ID. So that's happening right here on this join, and I need to make sure that I change this to be sale ID in our join statement. Okay, let's run it again, and yeah, it's not the fastest thing. And this can be a little bit off-putting, but don't worry, we're going to deal with this in just a second. What's important is it looks like it works. 14 seconds, for three rows of data, ugh. Why do we have three rows? Because I put a limiter on the bottom, here. What I think I should have done is put this after our first query, to limit the big fat result set up there. The location of limit is really important, as you can probably imagine, so I want to limit to the first three sales, and that actually improves things a lot, 1.8 seconds. But I know that you're thinking, I don't want to have to use limit everywhere, what I want a query over the entire result set? As I mentioned, we will get to that soon, I promise. What we might want to do here is we might want to filter on a specific result set for a given company. Notice that I can just drop the filter statement in line right here, and specify that I want to see all the sales for a given company, and here they are. However, that filter is taking a while, in fact, it took almost 17 seconds for seven rows. All right, how are we going to fix this? Well, that's coming up next.
-
Optimizing
Well, one of the neat things about working with real world data with real world data sets is that you have to tackle things like 16 second queries as part of your demo, which is good, because those are the things that happen in the real world. So the first thing I'm going to do is I'm going to create an index. But what do I create this index on? Well, a god rule of thumb is that if you're filtering on something, that should be what the index is on, so create an index on whatever you filter on. So in this case, I was filtering on the company name Stracke-Klocko, love that random data. Okay, so that has been created. The next thing I want to do is I want to create an index on the sale ID on the fulfillments table, because I'm joining on that, so this is step two, create an index on keys that you use in joins. Okay, so let's rewind a little bit, and go back to our query. And taking a look at this, we've got a lot of stuff going on here, including an inner join, which I don't think we need. Let's see if we can be smarter about this. And the first thing is that we should be using an eqJoin, that's the join I should have been using the whole time. It's specially designed to work with primary keys, and here, I'm telling it to use the ID of my sales table, and join it over to my fulfillments table. However, instead of specifying a key name here, I'm actually going to specify a secondary index to use. This is going to use the sale ID index from our fulfillments table, which is going to speed things up dramatically. Now, I didn't have to use that index, but it's always a good idea, if you're going to be doing queries like this, to create an index on what's supposed to be a foreign key. Okay, well now that I've done that, let's do our mapping once again. Here, I'm going to map the sale ID to the left side ID, that is going to be our sales table's ID. And then the fulfillment ID is going to be the right side's idea. The status, of course, is going to come from the right side table, which is our fulfillments table, and the company's going to come from the left side, with the customer table. All right, don't blink. Notice I don't have any filters on this. 169 milliseconds. There are no limits, there are no filters. This is an eqJoin on an index, returning the exact data that we need. Look at that! That is amazing. Now, if I put a filter on here, check out what happens. I'm filtering by the company Stark Group, but notice how everything slows down. In fact, it's going so slow, I'm just going to have to abort this, because what it's doing is it's trying to apply a filter over 100,000 rows of returned data. Now that is something to really keep in mind. I have optimized everything using keys, using joins, using indexes, but if I throw a filter down, it's going to have to go back and apply that filter all over the place. To get around this, instead of using the sales table directly, I'm going to use the secondary index company that I created. I'll just say, getAll Stark Group, and then I will join that on an eqJoin down below. Don't blink, 11 milliseconds. How is that for an optimization? I have to tell you, when putting this demo together, I literally screamed, and then I started laughing. I could not believe what I saw. Quite the optimization.
-
Analytics
Grouping
Now let's move into doing some advanced fun work with RethinkDB and we will do some analytics. One of the first things that we are going to need to learn how to do is how to group records together under some kind of grouping. So in the catalog table here, I have individual albums and each album has a vendor specified. So if I zoom in here, you can see that the vendor is it is the artist that recorded the album. So what I want to do next here, is I want to be able to group everything by vendor. That's going to be one of the first steps and I'll do that by grouping by vendor name. To do that in RethinkDB I use the group function and I'll specify using r.row. And here I'll just say it is going to be grouped by vendor name. Surrounding this, you can see we have a pretty interesting result. It is not at all in the form that we are used to in writing queries. We have a group and we have a reduction. And the group is the vendor's name. AC/DC. The reduction however is each document grouped under the specified group. So here we have a reduction that's an array and we have the details of each album and if we cruise on down where, you can see that each album is represented for the given artist. We can take a look at this probably in a better way using table view. If we open up the reduction, you can see that each reduction is itself an array of objects. And each one of those objects represent an album. Let's take a look at what kind of thing we are working with and we can use type of and we could see that it is a grouped stream and what is that exactly? Let's take a look. The best way to show this is by example. We are working with a stream. It is a stream of records but that stream is grouped together. And to show this is the most effective way let's use map and here I'll use a callback. And when using the callback here, notice that I'm actually going to be working on each album. So even though you saw a group in reduction before that is not the top level document. The top level document that we are working with here is album. So let's just say that we are going to return the album name. And there we can still see our group but in the reduction, we see the album name. So it is kind of an interesting way of looking at this data but it is something that's really good to know that when you are working with a grouped stream anything that you do to the data you are actually doing on the stream itself and that is inside the grouping. And that makes that a little bit weird but let's output the price here. The price for each album and there you can see that everything is grouped as we would expect. Then with output count because we might want to a rollup on this. And that comes next. We can do a sum on top of the grouped stream and we can say we want to see the sum of the count. And then the reduction all of a sudden changes now. We don't have any individual items anymore. Instead we did a sum on the items inside of the reduction. If we don't perform a reduction then the item itself will be returned. This is a little bit confusing the first time that you see this. So let's keep going with some more examples.
-
Ungrouping
We've grouped our albums together by artist and we've output pricing count but we want to do a little bit more. Let's take a further look at grouping with RethinkDB. And here is the output that I showed before. This is a grouped stream. We have a group and then we have a reduction that we can operate on. I showed you some before but I can also use something like average. For each group show me the average of the price and we will go in and do exactly that. You can add an aggregation function like average or sum or count or what-not to a grouped stream which again is what we are working with here. And RethinkDB will be smart enough to understand that you actually want to apply that to the stream itself and not the grouping. We can get away from this however, if we use ungroup. If we do ungroup, RethinkDB will take the results out and just say okay well I'll turn this into an array. It is no longer a grouped stream. This might seem academic but it's actually quite important especially when you want to further operate on the data. Let's see an example of what I mean. So here's what I will do. I will back up here and let's just get the average of the price. And I'll use the aggregation routine and I'll just say average price and out. I get a group and a reduction. AC/DC and the reduction is 1301. And then I'll apply ungroup. That means right now we are just working with a straight up array of data and we can do all kinds of interesting things with it including mapping it. Here I'm going to use the mapping function and pass in a call back and I'll want to work with each item and perhaps describe it in a little bit of a better way so instead of grouping reduction, I might want to return something a little bit more readable. Something like artist and average album price. So I can do that using map. Here I'll just send the artist key to the item group and then average album price to the item reduction. If I run this, that reads much better. This is one of the key used cases of ungroup. And you are going to see me use ungroup a lot.
-
Map-Reduce
You've seen some interesting gymnastics with ReQL so far but you are about to see more and it very well could confuse you if you are not used to Map/Reduce and how it works. Even if you are an Map/Reduce expert, it might be worth watching this clip if only to see how it is implemented with RethinkDB. Straight off Map/Reduce is all about taking advantage of horizontal scaling and multiple core processors. All Map/Reduce queries in RethinkDB are executed in parallel across shots using multiple CPU cores. This makes Map/Reduce tremendously powerful for analyzing massive bits of data but what is Map/Reduce? If you like me are a little bit hazy on Map/Reduce and how it works, well hang out for just a bit and we'll talk about it. Jumping over to Wikipedia you can read up on it but the basic gist is this. Map/Reduce was developed by Google as a way to sift over data distributed over many mini systems. The idea is that you define the data that you want and then in a second step, reduce it, bringing it all back together. Let's take a look and an example. Here I just want to find the number of albums that we have in our catalog. You've already seen me use the map function. So that should look familiar. Here I'm doing something that might look a bit strange. I'm just returning a one for each album in our catalog. If I ran this, you could imagine the output a bunch of ones. 347 to be exact as I already know the number of albums in our database. These are basically useless to me unless I decided to count each one of them on screen, which would be ridiculous. If I tack on a reduce function however, things get easier for me. Reduce takes a callback, which expects a left and a right argument. More on that in just a second. I can now work on those 347 records all of them once by adding them together. I do that by adding the left to the right using the add function you see here. What's left and right? When the reduction starts, the left value is the very first one and the right is the next one in the sequence. The reduction is working on top of a sequence in a recursive function. Applying the addition calculation as I specified. Once that sequence is completed, the left becomes a two and we move on from there. Or maybe the right is a two. And the left is a one. This is an important thing to understand. The calculation is not linear from left to right. It happens in a distributed way. As mentioned it can operate across your cluster in parallel. We are just working on a left side and the right side of a the sequence. This is the definition from the RethinkDB documentation. The goal of a reduction is to squeeze a single value down from a map function based on the operation you defined in the reduced function. It is not just left to right. It happens on all sides at once in many directions. One thing I haven't mentioned yet is the role of group. In other systems, Mongo Hadoop etcetera you group your values together in the map step. RethinkDB has an explicit group function which does this for you and you've seen me use this already. A quick side note, one thing I love in reading through the RethinkDB documentations are the illustrations from Anne Wright, the team's designer. It has a nice personal touch and I find that refreshing. Here is MongoDB's implementation of Map/Reduce which is a bit from RethinkDB. The map function shapes the data by providing a key and a value, which you output using emit. Map/Reduce with MongoDB used to be single threaded only but in recent versions of Mongo, you can now use multiple threads to execute a Map/Reduce function. The reason I mention this is that execution in parallel is not evident with the API that Mongo gives you. RethinkDB embraces this concept by using left/right sequencing and is right in the Map/Reduce call.
-
More Map-Reduce
Let's take a look at a common Map/Reduce function here. And what I want to do is I want to get some sales numbers for our sales table. So I'll zoom on in here and let's take a look. You remember in our document we have all kinds of information rolled up in items and in items we have vendor information as well as sku information, price quantity and so on. So these are the values that I will be working with. First thing I want to do is I want to map the data and I want the items here and I have a bit of a problem. If I use just an item selection as I've done, it is going to return an array or the items array from the document. I can't really use that but check this out if I map over each item, and I ask for the item sku let's say and maybe the item price. When we are going to get back and not very usable, or get an object back with price and sku keys but each one of the values is an array. Not terribly useful so instead of doing a selector here, I'm just asking for items I'm going to use concatMap and concatMap as you recall is going to map and concat, any of that item's array and is going to return it as if it was its own document. That's exactly what I want. And then I can use a map function on top of that and I will isolate the sku, the price, and the quantity. In addition I can do some calculations in line here. I can specify that the line total is going to be the item price and then I can use .mul, which is multiply and I'm going to multiply the quantity but I can also use add or I can use divide or so on, you get the idea but I'm going to multiply this by the item quantity and there you go. That is as I would expect. This is great because now that I've mapped this I can then do some roll-ups but I want to do a little bit more here. I want to group all of this by sku and in addition I'm going to limit it to a 1000 documents so I don't slow everything down. Okay we will take a look at what I get back. This is again a grouped stream because I'm using the group keyword. I'm grouping by album and my reduction is an object. So I have all these numbers in there that I can then further reduce. And at this point, I should tell you that the example that you are about to see uses the long form reduction. In other words, I'm going to write it all by hand. RethinkDB ships with convenience reduction function like counting, sum, average so on. That you can use just right in line just to wrap it on the end of a grouped stream like group and it will do the reduction. But for this example I'm going to use reduce and I'm going to write it all out by hand. So reduce takes a call back with a left and right argument. As mentioned before this is a sequential reduction so we are going to take left values and do something with them as well as right. So let's play with our mapping above and for the sku I want to return that value. For the total, I'm going to add the total from the left sequence to the right sequence. This is just a simple addition. And assuming I get my brackets right, running this, there it is. We have a reduction. So we have a group which is up to 10 and then we have our total which is 1430. If I want to read this as dollars I can't. I can divide it by a 100, but look at that. Remember I said once before that we have numbers stored as doubles. So if you are worried about anything with floating points just deal with integers and that's what I'm doing here. I'm dealing with pennies. Okay so we have some interesting numbers coming down. Now let's work with them a little bit more. My next step here is to ungroup that grouping. And I want to map the results now. I want to show the sku but I also want to show count of sales as well as total sales. And I want those labels to make sense. So I'll put sku and I'll put sales total and oops, there is no total in the object and I love this. The reason I'm showing this to you is that this is a great error response. Basically its saying I tried to find this column in this object and was it there? And that's because I needed to do reduction and then total. There we go. That looks right. All right. Let's add a sales count to our map function way above and that's just going to be a one. And then down below here I'm going to do a reduction on the sales count. I'll add the left sequence to the right to come up with a count total. Although what I really should be doing is adding the quantity. While this isn't a course in analytics, thank goodness the whole point of me showing you this is just to show how I was able to change things together and eventually work and tease the result that I wanted. Shaping the data, running some mapping, doing a reduction, ungrouping and then mapping again. That is the essence of what it is like to work with RethinkDB. You can pretty much intuit the query that you want to write just by thinking about it, thinking okay if I tweak the data here, add this function here, and do this other thing over here, eventually you will get the data that you want.
-
A Co-occurrence Query
Running analytics with RethinkDB is a lot of fun. Mostly because it feels like you are writing code. You are not writing database instructions or SQL or transformations in some business intelligence application. You are just composing a bunch of functions. Okay for this demo we are going to do something a little bit more interesting. What I want to do is what's called a co-occurrence query. And the best way you can think of this is if you are shopping on Amazon and lets say you go to a product page and you look down below right below the product fold it says people who bought this also bought this other thing. That's called a co-occurrence query. So for any sku that's in my database what are the other skus that have been sold along with it? It sounds like it could be easy but it is actually not. So I'm going to go a little bit faster for this demo. The first thing we are going to do here is I'm going to limit the results by 5000 because there's a lot of stuff I need to do. And then I'm going to do a concatMap and I'm going to bring up all of the items because I'm going to be working mostly with the items. So this will bring each item up as if it's its own document. My next step here is to shape the data that I need. I don't want to bring in anything that I don't. It's just going to save me some processing power. So let's map and I'll map each item that is returned to us and I'll just return this sku and a count of one. So right now that's all I'm interested in is just the count of skus in our orders, there we go. Things are looking a bit better. Now let's group by ID. This is the critical piece I want to group all of the albums sold by an order ID. That is going to give me the rollup that I want to see because remember none of these counts make any sense unless I do it in the context of an individual invoice. Okay so now that I've done that let's ungroup and then run a filter and I need to ungroup to run a filter because I need to filter the actual array. I don't want to filter the group. That won't work. Here I now have all the invoices that have album 10 as one of the items. I now have all the data that I need. And this is good news. What I don't need is I don't need that invoice information. I don't need the grouping. So let's get rid of the grouping. I just want to work with the reduction. Since I used ungroup above, now I am just working with a simple object with two keys group and reduction. Reduction being an array. That means I can use concatMap with it. concatMap to remind you takes an embedded array and then elevates it up as if it was just a solid list of objects and that's what I want to see because I can roll up on this now and get some very good results which is going to be my next step. Let's group again. This time by sku. Here we have another group stream. Now I can use one of RethinkDB's convenience reduction function it is a mouthful but I like it. And I can just sum up on the count. Man we are almost there. Look at that. We have Albums 101, duh-duh-duh-da but this is sorted arbitrarily. This is not giving the sort that I need. For that I'll need to ungroup again. And then what I'm going to want to do is I'm going to want to ungroup and map. So in this case I'll want to return, almost there, I'm going to return the sku and this is going to be semi-final and the group itself. So I'll want to see that but then also I'll put a key in here sold with and this is going to reach into the reduction which is that guy there. And it is going to give me that number. So let's take a look at this. Good. Skew album 10, sold with. But we have album 10. I don't want album 10 in there. I want to remove album 10 so let's just return everything. Let's not album 10. Let me do that .ne, not equal and this is going to return everything that's not album 10 because this is going to confuse my results. So now we have some pretty good results. Let's order this though because we want to see it in top down order. So I'll say r.desc("soldwith") we are done. Look at that. Album 248 is sold the most with album 10. Our co-occurrence query is complete. Now you try and write this with SQL. It is not undoable but it usually involves a lot of really strange joints. I'm not going to tell you that this was easy. However, this thing as I keep saying with RethinkDB is as you start writing this and you get to know these things, it becomes rather simple to figure out what you need to do next. You know what concatMap does, you know what ungroup, what filter does, what map does. You can start throwing these things together, chaining them together to get the results that you want to see.
-
Validating
Good analytical systems are not only easy to use to query your data and structure some analytics. But they are also good at helping you to validate the data. To show you what I mean let's put together another simple Map/Reduce query here and we will just do some sales counts. But what I want to do after that is to sleuth the data to make sure that I'm right in my assumptions. And so let's do that straight away. Now the first thing I'm going to do is what you've seen me do. I want to do a concatMap and I want to throw all the items from our sales table up so I can query against them. Here what I want to do is I want to do roll-ups on the vendor. And so what I'll do is I'll just do a concatMap and then a map and I'll do a sku and I'll put sku here and then the artist, and I'll use the vendor for that and the sales count is this I'm going to do the right thing and use the quantity. So running this good, so you can see that our sales count is one or two or three and we have a good sku and we have an artist. That's as I want to see. Next step we will group by that artist. And so if I run this, it is going to put together a group stream, oops (laughs). I get that error again. That's a good error. If I just run grouping and I don't specify some kind of reduction, I could blow up my browser. Let's not do that. So let's sum on the sale count and there we go. There is a good reduction. So AC/DC has sold 2600 albums. Great. So you can see we have 5091 and 2691. This again is all fake data. This is stuff that I was generating using faker and so I would expect to see sort of an even distribution here it is all random. Let's check that by doing order by and have we've seen this before we sure have. I'm trying to do an order by and a group stream. You can't really do that. You have to use ungroup, turn it into an array and then we can use order by and here we go. This is a rollup of sales per vendor. But it is not descending order just yet. Let's put this into table view and run it again. There we go, alright. Iron Maiden. Look at that, 54,000 sales for Iron Maiden. The next one is Led Zeppelin at 36,000. How can that be right? It seems a little bit off doesn't it and this is where validating things come in. Let's check the data to make sure it is right. If I put a filter in there and the vendor being Iron Maiden yeah that's the number we expect to see. Let's take the grouping off and scan down and sales counts, they are all there. There is a lot of sales for Iron Maiden albums. Still doesn't make sense though. Why is there are so many more sales for Iron Maiden? Let's do a count here and 27,000 individual invoices for Iron Maiden. That's the number of orders if we do a sum on sale count however that is the count of albums sold and that looks extremely high. So how can we prove this? Let's go into our catalog and take a look around at the albums. Remember I used a random record generator to generate all of these numbers but what could cause this problem? Let's use a mapping function here because what I want to do now is I want to isolate the records that I want to sleuth. I can just quickly throw map down here and I can take a look at the artist and I'll return the album vendor and then name. Then I want to see is the title of each album next to it. I just want to see how many albums go along with each artist and for now, I'm going to throw a filter down and I'm going to specify that the artist is Iron Maiden. Maybe there is something in the individual albums. Woah, look at that. I don't know who created the Chinook database but they are certainly Iron Maiden fans. There's a lot of Iron Maiden albums in there. Let's take a look at how many albums we have in our database per artist. And we can do this by using a count and then an ungroup, order by descending and then we will order by descending on the reduction. Look at that. Iron Maiden has almost twice as many as the next artist there, Led Zeppelin. 21 albums in our database. Now that would explain why we have so many sales because there is twice as many. If we use a random generator that's what's going to happen. It is going to be influenced by the number of albums. So cool, we were able to sleuth that and RethinkDB helped us.
-
Administration
Tagging, Renaming
Keeping your servers organized in a cluster, well that could be a little bit difficult. RethinkDB gives you a way to rename and tag your servers as you need as your cluster grows. Here in the admin interface you can see a lot of information about our server Zayn including the default tag in there. If we click on the server, you can come in here and you can see what records and tables it's responsible for. Here we have music catalog, music fullfillments, sales and it holds the primary replica set for all of our tables. Well let's head over to the data explorer, which you've seen me work in. And I want to change some of the information now, I want to change the tags specifically for our server. And I do that by opening up the server config in the RethingDB database. Inside of here you can see information about all the servers including the unique ID. We've seen this ID before, if you head on over to the console, when you start it up it says server ready, Zayn, and then it gives the ID. And that little grid right there is the same that we see right here. We can even query the server in the exact same way that we query data and I can run an update if I want. Because right now the only tags I have in there are default. But what a lot of people like to do is they like to add a tag of where the data center is. So if we're using something like Amazon, we might want to put in the data center that our server's in, like US West. Now notice if I put in crap data like blurgh, it'll stop me, it's saying, "Hey, you've got the data in a wrong format, "it needs to be an array of strings." So it does check that, so I can't really screw it up. So I will leave the tags here as default and US West and there they are. Well having good naming and tagging for your server might seem a little bit goofy but as you're going to see later on, it can actually save your bacon. So let's refresh our server window here and there it is, US West and our server has been tagged. Good, we can also if we want to, change the name of our server. So let's go back over here to the data explorer and I'll pull up the document that describes our server. And if I wanted to I could update the server's name to Rob. I don't know, I don't know if I recommend you naming your server after me, but go ahead if you want. I'm going to change it back to Zayn. As tempting as it is, I think a server named after me is probably bad luck.
-
Simple Sharding
Now let's take a look at RethinkDB's greatest strengths and that is scaling out horizontally and how easy it is. So notice here that I have one server, one of one connected and I've got three tables in there. And I can look down the list and it says one shard, one replica. That's all very important when you're working with a distributed database system. And here's all kind of information about that table, the catalog table. Including where the data is distributed across the shards and since we have only one, we have one shard, one replica, everything is pretty straightforward and simple. But nothing ever stays that way, does it? Alright, well let's head over to our console because what I want to do now is I want to scale out, I want to add another service and I can do this on the same machine by simply offsetting the ports and I'll show you how to do that right now. But first, what I want to do is I want to make a directory where the new database is going to keep its data in, so I'll call it r2. Next, I'll tell RethinkDB to launch and I'm going to set the name of this server to Liam. And I'm going to tell it that I want it to offset its default ports by one. So I'll show you what that means in just a second. And then I'm telling it that I want it to join localhost:29015. 29015 is the port that you need to connect on for intracluster connections and that is displayed to you right at the startup of your service. So this is Zayn's startup and he's on 29015 and the client driver is 28015, and then admin interfaces 8080. Since I told it to do a port offset, all of those ports are going to be off by one. In addition here with join, I've told it not to startup on its own, startup and join localhost 29015. Finally, I've tagged this as US East using the -t flag. Alright, we're ready to go, let's whoops, start it up. Looks like I made a goofy error. If you don't specify where the data directory is, then RethinkDB won't start. That's important to understand, it won't overwrite another database's data. So to specify the data directory, I just use -d. And we are up, Liam is running, hooray. And it's running on the ports that I expect. In addition, it's connected to server Zayn, that's exciting. So, if we flip over to the web interface, it's already updated itself, look at that, two out of two servers are connected. I love how this admin interface updates itself, you don't need to come in here and refresh things all the time. I think that's really neat. Alright, heading over to the server window, you can see we have Liam here has a default tag as here, which it needs. And then it also has a US East tag, that's good. So, we're pretending they're in two different data centers. Alright, well let's come down here and monkey with stuff. Right here I can now reset the replicas and the shards for my catalog table. I can just say I want two replicas per shard, hit reconfigure and some rights go off. And just like it, it's done. Easy peasy, right through the web interface. If I come down here to the servers used by this table, you can see that a secondary replica of our data, the entire table has been replicated over to Liam. This buys us some redundancy, we'll talk about that more in just a second. So let's head over to Liam, he's on localhost 8081 and you can see the log entries for Liam. And if we go over to the tables, you can see that we can actually query every single table. That's because Liam is part of the cluster. Liam's only duties right now are to carry the secondary replica of our catalog table. But you can access the entire cluster dataset from Liam if you want to and here I am querying the catalog table but I could query any of the tables if I wanted to. We're going to get into all of this in just a second.
-
Sharding and Replication
Let's take a deeper look at sharding and replication with RethinkDB. Understanding how these two things go together is critical when working with a distributed database system. Let's start with a simple scenario. You've got 100 gigs of sales data sitting in a table on your server. You've just imported it and you want to start working with it. The first important step to understand is what you're going to do with this table and its data. In our case these are sales records from the past. We want to query and analyze the records using ReQL's amazing map-reduce. And we don't want to wait for hours while the query runs. Conversely, if this were a log table, we would want to make sure that it's set up for a lot of rights. Not necessarily a lot of queries. Finally, we have a third consideration with our products table. We want to be sure it's always available for reads from everywhere. It's how we sell things. I'll talk more about writing and availability in a bit but for now let's focus on the first use case, analytics of our big sales table. We want to take advantage of RethinkDB's ability to execute queries in parallel. So, if we want to speed up our analytical runs we can. By attaching more CPUs while simultaneously reducing the table load on each server. This is known as sharding our data. Physically separating our table across multiple machines or servers. When you shard with RethinkDB you also create replicas, which are exactly as they sound, replications of our data across these shards. Each shard needs an authoritative replica, as there can be more than one. And in RethinkDB this is called the primary replica. In the above diagram we have four shards. Each with a single primary replica that has exactly 25% of our data on each shard. Our analytics will run quite fast because we've divided our data out. But if there's a problem, we are dead in the water. We have just lost data because we only have one replica on one shard. Now this might be okay for you, believe it or not. If you have an isolated data set that you can reload if a server explodes, you can have a setup like this. In the real world however, you'll want to have some redundancy built in. Here we've told RethinkDB that we want an additional replica for each of the primaries across our four shards. RethinkDB will reach out to the other shards and intelligently replicate the data, essentially backing it up. Let's take shard C offline. And also destroy it completely, it's gone, it's never coming back. With C gone, any tables of primary replicas that were on C will be set to outdated reads, I'll talk more about that later. To repair this, we need to do two things. First, we need to locate C's secondary replica, which currently resides on shard A. This has all the data that C was serving. We then need to rebalance the table across the cluster, taking the data from the secondary replica on A to replace the lost data from the primary on C. Since we only have three servers, we need to rebalance down to three shards, you can't have more shards than servers. And good news for us, we can do of all this with a click of a few buttons, there's no need to actually remember anything of what I'm saying. RethinkDB is smart enough to know what to do. We just need to tell it to do it. Now it's important to understand that this is not an automatic process as it is with MongoDB. There is no automatic failover in RethinkDB, at least not yet. RethinkDB will tell you if there's a problem but you will have to go out and manually reconfigure things. Let's setup our catalog table in the configuration that I was just talking about. Here I've added two new servers to our boy band cluster. I've added Niall and Harry, and now we have four servers that we can create shards and replica sets with. Zayn is currently carrying the entire load, let's change that. And I do that by going over to tables and I'll click on the table that I want to change, catalog. And I'll say that I want to spread the data across four shards and I want to have two replicas per shard. That is the exact configuration I was just showing you in the slides. And we wait for just a second. And there we go, it spreads itself out and we are happy. Let's go over to our fullfillments table and do the same thing. And we wait for just a second, and it pushes the data across. I should tell you that for both fullfillments and sales, I've reduced the number of records in each one to 5,000 so this didn't take forever. Alright, so for each table, I have now sharded them four times and I also have a secondary replica on each shard. As you see here. So taking a look at shards one through four, you can see each one's got a primary, but that primary also has a replica somewhere else. This means if there's any kind of failure, I should be able to recover from it. Speaking of, I heard that Zayn wants to leave the band. How sad. Alright, so Zayn is now offline, that means we can't use the web interface on port 8080, we need to go over to 8081, good old dependable Liam, he's going to help us out. And right away we see that we have an issue, talk more about these in a second. It's telling us that we need to remove a server, and then we do. Zayn has now left the band. And uh oh, our shards need primaries, we are now left with only secondary replicas for all the data that Zayn was carrying on his server. So what we need to do is we need to go back now and reconfigure, and we have to do it manually. So I'm going to go in and specify that there's only three shards. And RethinkDB is smart enough to go and pull the data from the secondary replicas for Zayn and reassemble it for us into other secondary and primary replicas across the cluster. This is insanity. This is not an easy thing to do with any other system. And RethinkDB is well probably the only system I've ever seen handle this so elegantly. Alright, one more table left, we have to go reassemble fullfillments. I go in here to reconfigure and we set it to three shards, this is going to rebalance this table across the three shards, pulling the data from the secondary replicas that it needs. And we now have no more issues and our cluster is well, less one member but at least it works. Now let's consider the case where we might want the highest availability as well as the highest redundancy. We want to be sure that if something bad happens. (laughing) These animations just kill me. We can still read and write to our cluster, sort of. This is where things get a bit more complicated. With this many shards and replicas, it's trivial to boot server B out of the cluster and rebalance across A, C and D. If there were any primary replicas on B however, that table now has a status of outdated reads. And this basically means the table is useless. Now if B was acting as a secondary, only a secondary for other tables, well they can continue on without a problem, they'll just kind of mention, "Hey wait, one of my secondaries is missing "but it doesn't really matter." Let's see that in action now. Good news, the band is back together. All four servers are setup again as they were before but this time Liam is carrying all the primaries. Let's reconfigure our sales table here and this time I want to set it up so that it is spread across four shards and that it has four replicas per shard. This might seem a little bit paranoid but I'm paranoid, I like to make sure all my data is backed up and replicated. And there we go, we'll just wait for a second and good. So now that, that is refreshed, let's do a quick review. Our catalog for this configuration is only spread across two shards. But I've specified that I want four replicas per shard. Same with fullfillments, two shards and then four replicas per shard. This is going to make sure that I have the best failover I can possibly have. Speaking of failover, let's have some fun. You can see on the server window the way the primaries and secondaries are distributed across our little boy band cluster. But history's about to repeat, Zayn wants to leave the band. So, let's hit Control + C, Zayn has shutdown. So, just as before, when we flip over, well we don't have a console anymore to play in. We're going to have to flip over to Liam and we do that by going to 8081. And there is good old trusty Liam, where he's always been. But we have some issues to resolve. And this time we're going to have some different characteristics that we're going to need to deal with. Because I haven't spread all my tables across my cluster. I've only spread one, in other words, I've only sharded one across the cluster, the other two only have two shards. This availability and redundancy roll-up here kind of explains it. As far as availability goes, only one of three tables are affected. One of eight primaries are unavailable. That would be my sales table. Now how do I know this? Because in the beginning if you recall, before I reconfigured everything, I mentioned that the band was back together but the primaries for all of my tables lived on Liam. When I sharded my catalog table and my fullfillments table, I made sure that they didn't use Zayn for a primary replica. It's because I don't trust Zayn, Zayn likes to leave. So, it turns out that, that was a good choice because Zayn has gone down and I can't get him to come back. It looks like Zayn is gone for good this time. Well let's take a look at some of the availability of our tables now. This is interesting, our catalog table is completely ready. Some of the replicas are missing, they're only secondaries, so it doesn't matter. Primaries for our catalog table are located on Liam and Harry, and those two tables are just fine. That means I can read, as you can see here, I can run queries and I can also write to the table still, even though one of the servers in our cluster is down. Here's the configuration that I have for my catalog table and you can see the primary replica is located on Liam and another primary is located on Harry. Zayn only carries secondaries, it 'cause I don't trust him as I said. Alright, let's go back now to our fullfillments table and it's the same deal, it's still ready, and that's good. Because once again if we take a look at the configuration down below, Liam and Niall are the ones that are carrying the primary replica and again Zayn is only carrying secondaries. That means that I can still write to this table if I want to and I can do queries. However, I have outdated reads as my status for our sales table, that's not good. Well why is that? Because I've sharded it across all four servers and the primary that lives on shard two right here on Zayn is missing. That means if I try and query it, I'm not going to be able to, I'm going to get outdated reads which basically says this table is useless. Here I'm just trying to query it in the data explorer and you can see the error I get down below. Basically saying the primary replica on one of the servers, some hex name is offline. So, no data for you. Well this time Zayn is gone for good, let's remove him permanently, even though I thought I did this before but I don't think the band's going to let him back in this time. So now that I've done that, I have the same issues that I had before. Although this time, there's only one that I have to resolve. We have three of three servers connected, two of my three tables are ready and just fine. But one shard that I have needs a primary and that's sales. So I just need to go in and reconfigure sales and tell it that there's only three shards available to it now. And you just saw me do this but just for fun, let's go do it again. I'll come down over here, sharding and replication, and click reconfigure. I'll then reset things, as you've seen me do before, specifying only three shards. And I'll keep this at three replicas per shard. So the table rebalances itself, which is good. RethinkDB has found the data in the secondaries on other shards and has rebuilt this. Zayn has left for good, our issues are resolved, I say we move on. The final consideration here is writing data to our cluster. If we have a distributed system, physics will take hold and we'll have to wait for the new data to be propagated and pushed through the shards and then finally written to disk. In this setup, we only have a single primary replica. So other writes will be relatively fast. A shard will be chosen for the data and the data gets written and we're good to go. RethinkDB is durable by default, this means it won't acknowledge your insert or your update until the data actually hits the disk. With a single server this isn't a big deal but with multiple shards, it can really slow things down. Here in our fully available and redundant setup, the data is written to one shard and it has to be replicated across the entire cluster. It's not the fastest choice, of course. The good thing for us however, is if the data isn't critical, meaning that we can tolerate some minor data loss if a server blows up. Maybe its log files or 140 character posts from grumpy developers. We can switch off the hard durability and move to an eventually consistent model. This means that when a write comes in, we'll trust our servers to stay up and the write will eventually propagate. Data loss during a soft write in this setup is actually quite hard. If one of the shards goes offline, the write will wait until it comes back online, queued inside the cluster. This is how we stem the scariness of eventual consistency by having a cluster that is fault tolerant. If the server that's handling the initial write goes down and melts before the write's propagated. Yes, you will lose that grumpy developer's tweet. Usually however, when the servers go down there are ways to bring them back up. Let's take a look at that. We're back here with our volatile boy band cluster. And as you can see, I've got four servers again. They let him back in, I don't know why. So the servers this time have been tagged with different tags, we have US East, Asia, US West and Europe. We've distributed the servers across the globe, maybe they'll get along better that way, who knows. But the one bad thing is that Zayn is carrying the three primaries. Zayn has proven to be well, not terribly dependable. So our DBA has looked at this and said, "Get our data off of Zayn." And I can totally understand why. I just can't go into it anymore. Okay, so what we need to do is we need to actually say move the primary replica for our tables to Liam, and Liam resides in Europe. So to do this you would think it would be as easy as pressing a button in the web interface but you have to do it through ReQL. Which to me is better than using a web interface. So the way we do this is by using the reconfigure function for a table. In here we just specify shards, we can also specify replicas and which one is going to be the primary, which is lovely. So I'm going to specify that we have one shard and then I'm going to say that I want the replicas and I'm going to pass it an object here, replicas and I want it to go to the server tag eur for Europe, I want that to carry the one replica. And then I'm going to specify that the primary replica tag is going to be eur. Not so sure about this API but that's the way it works. So, running this function is going to relocate the primary replica off of Zayn and as you can see the result set down here, it's going to reset it now to Liam. That'll make our DBA feel a whole lot better. So let's go over to our server here and if I refresh, you can see that Liam now carries one primary and Zayn only carries two. We can go back over to our data explorer, do the same thing for catalog. And when we run this, it takes just a second, but it's been copied over to Liam as well. And then let's do this for fullfillments, this has got a little bit more data to it. But we run this and boom, over it goes to Liam. Checking our server here and refreshing, yep, our server list. Zayn has zero primaries, Liam has three, a relatively easy way of dealing with a very problematic server that enjoys going down and leaving. When working with a distributed system it's always nice to know where the system you've chosen sits with respect to CAP theorem. It's a heady title but basically it means you can't have all three things at once, consistency, availability and fault tolerance or partition tolerance. We've just seen three scenarios which cover parts of CAP. But we had to make choices for each scenario. We can't have high writes, consistency, bulletproof availability and full redundancy at the same time. Some systems like Cassandra and Riak embrace this and lean more towards availability and fault tolerance. Twitter and Facebook match this use case quite well. The businesses won't go under because I've lost my grumpy post. RethinkDB however leans towards consistency and availability. This quote is from their documentation which clearly states their position. You can however tweak things so that RethinkDB acts a bit more like Cassandra, that's up to you.
-
Backups
Backing up the data from your RethinkDB cluster is a pretty straightforward affair. In fact, they give you a command for that right in their RethinkDB runtime. So you can see this if you run help, RethinkDB help. And in here you can see export and dump. Export just drops the data to disk for exporting to another database. And dump issues a backup with full metadata and so on. So here I'm just dumping everything out and it was pretty quick. And one of the nice things that it does is it zips the export directory for you. Speaking of what's in here, if I open this up and take a look inside of our dump file, we have music, that is the name of our database. And inside of here we just have straight up JSON, one JSON file for each table. In addition we have some .info files for the metadata. You'll probably want that backup to go off on a semi-regular basis. And since we're using UNIX-based systems, the best tool for us is Crontab. And here we are, using Crontab you add an entry and a command, and it goes off on some timed basis. If you're a Crontab expert then go for it. If you're like me and you don't know it so well, there's tools out there like Nodecron, this is not really using the cron table to run, this is just running in the background, you declare a cron job and then you give it some code to execute on some sort of cron basis. And it uses the exact same notation as cron table does. So you just give it a function and it executes it. Now I was inspired by that and also by this project here, Node MongoDB S3 Backup. Long name, but what this does is it uses Node once again but it uses Node Crontab which I just showed you and it backs up your MongoDB database. All you got to do is fill out that config file and then it runs and then it crumples everything up from the Mongo dump and sends it up to Amazon's S3 simple file storage. When I saw this, I got kind of jealous. So I forked it and I actually stopped recording and I rolled the whole thing over to work with RethinkDB. So, there you go, if you want to back up your RethinkDB on a nightly basis, well you can just use this package right here, it's rethinkdb_nightly. And I published it as a Node module that you can install with NPM. And just read the instructions and it'll show you how to setup a nightly backup to Amazon's S3 service. And again this was based on the work from Swift and his Node Mongo S3 backup utility that I just showed you.
-
Compose_io
Running your own server can be a bit of a drag, especially if you have to manage a big cluster and run backups, make sure everything's going. If you're a small business, you might want to go with a hosted solution, so that's what I'm showing you now. It's compose.io, it used to be MongoHQ but they've now broaden things out. They host MongoDB, Elasticsearch, Redis, PostgreSQL, and RethinkDB. Hosted RethinkDB, it's up on Amazon, that's where the data center is. So basically it's a slick interface on having your RethinkDB server running up on AWS. It's a subscription service and they do it by data size. So, that's actually pretty good, when you consider how big your database can get if you're a successful business, well it's well worth paying 75, $150 a month to house that much data. You can also choose where you want it to be. You can see EU West and US East, those are the only two choices you get but well that's just the deal for now with RethinkDB, it is still in beta and I have a feeling they're probably going to broaden the choices out as soon as they exit beta. Alright, so I'm creating a deployment here and it takes just a minute as they do their things and it's kind of fun, you can sit here and read these little messages (laughing) that they give you down at the bottom. It took about two or three minutes, let's just skip ahead, it would of been fun to read them all but I want to show you what's going on here. So having a look through this console back here, you have an overview and it tells you how to connect and add a user, and where your server is, and connection information. I'll talk about each one of those things in just a minute. But if we scroll down here, you can see our deployment topology, in other words, what do your servers look like, how are things all setup. And you can see each one of our servers is healthy, it's got an IP address, and it's got a name for each one of the servers, including a proxy, which is quite nice. So, there's our topology, if we want to go and we can see our backups right here, there's no backups, but they do run it nightly for you. And they also support on-demand backups if you want all your data right this very second. There's a lot to this service including a Mongo importer but what I want to do right now is let's login and take a look at our server setup. So, I'll take what they gave me here, an SSH tunnel and I'll try and login, and I should be challenged for the authenticity of the host and I say yes and boom, permission denied. That's an important thing to see here, that it's denying the permissions to access the server directly. And so what we want to do instead is we want to be sure that we add a user in and you do this by adding in an SSH key, just like you do at GitHub. So to get at your SSH key, if you're running on a Mac, you can just use cat to display the information to the screen. And it should be living in .ssh and then id_rsa.pub and yes I have blurred this out, so you don't get to see it, sorry. I'll paste it in here and then when I hit save, I just have to wait for a second or so while it creates the user on each one of the machines. Good, now I should be able to SSH into the boxes. What I want to do though is I want to actually see the admin UI. And it gives you some SSH tunnel settings right here. Now for some reason, I don't know why they tunneled me in on those IP addresses, it didn't seem to work. So what I did is I had to reset this to 127.0.0.1:8080, as you see here. Otherwise I got a weird error. Anyway, now that I've got the SSH tunnel in, I can go to localhost:8080 and there it is, I am browsing my servers that compose.io setup for me. And from here I can do all the administration tasks and querying and whatnot that we have been doing up to now in this course.
-
Tips and Techniques
ORMs
Well, if you read my blog or have seen any of my courses here at Pluralsight, you know I'm not a big fan of ORMS and data access tools, but, let's take a look at some of the better ones for RethinkDB. And I want to start over here at RethinkDB.com in the docs and you can see that there are so many resources and drivers that you can play with. The officially supported ones are Python and Ruby, and JavaScript, but there's a whole lot more, including c sharp and .net. If we click over to integrations, that I have open, over here. This is docs, frameworks and libraries. You can see all of the extensions and the drivers written by the community. Here's some ORMs. We'll take a look at those. There's Python libraries, and there's ways that you integrate with other tools, like rabbitMQ and Flask. It's a pretty comprehensive list of tools an other things supporting at RethinkDB, so have a look through here, if you have a moment, but, for now, let's take a look at some of the ORMs that you can use, 'cause I know a lot of people don't like working with drivers directly. So, there are a bunch of different kinds out there, including rethinkdbdash. And, this is an interesting project, because this is, basically, a more helpful driver, if you will. It wraps the core rethinkdb driver with just a little bit of elegance, and this is for use with node, by the way. I should mention that everything I'm going to show you right now, is using RethinkDB with node. So, yeah, this actually returns all cursors. They're coerced to arrays, by default. I'm not sure why that's a good thing, but that's what they want to do. In addition, there's is a RQL Promise. If you like working with promises, then you can use this light, bit of extraction, to run some queries to get a promise back. I'm not a promise fan, myself, but, ehh, that's neither here, nor there. Alright, well moving on, we also have, rethinkdb-co. This is for use with ECMAScript six, so if you want to take advantage of some of the new stuff that's coming out, especially Koa, which is the next greatest version of Express, and you want to use generators, well, there you are, you can use that. There's also reheat, which is a full blown ORM, and I think this is the one that people talk about the most. Now, I haven't used it, but it's got a ton of convenience methods that might make your life a little bit easier, especially if you like ORMs. And so, you can see here, how the connection is set, and then you declare your model. You can do various operations with the model, and it even supports promises. So, pretty common ORM interface. Again, there's thinky. Thinky works almost exactly the same way, but it supports associations, as you can see there. So, let's take a look at a tool that's not an ORM. And, it's called Chateau, and you can install this, using npm-g chateau. And, what this is, is a data explorer. In here, I have it running to local host 3000, but if you want to just explore your data, and click through and see things, you can, and it puts it out in a tabular format. So, the convenience that this gives you, is that you don't have to actually sit there and write out all your data queries by hand, you can actually just type things in a form interface here and work with your data directly. On the site, it says that it is heavily alpha, so you have been warned. In addition, it's got another neat feature here, where you can import data directly in through the interface, rather than going through and using the import tool, an interesting project, if you like exploring data through an interface. Alright, well, the last thing I want to show you is something that I created, and it's called second-thought, and as I mentioned, I am not an ORM fan. I prefer, just light weight, abstraction wrappers, something to make my life easier. And, here it is. I wrote this about two years ago, when I was using RethinkDB for a project, and all this does is kind of abstracts things little bit for you to make things simple. So, one thing it does, is it actually creates a table object and sticks it onto a db namespace. So, right there, that is actually a RQL table that you can query against. If you want, you can use equijoins and a to array and all those good things with it, directly, and it's just convenience stuff. The other thing it allows you to do, is to query using convenience methods like, first, and exists, and destroy, and so on. And, it allows you to create database tables by passing in a string array. So, that's there for you, if you don't want to work directly with the RethinkDB driver.
-
Change feeds
If you're a fan of Pub/Sub and pushing data down to clients from your database, well, this will excite you a little bit. Let's work with change feeds inside of RethinkDB. This is a new feature that as introduced last year. So to do this, I'm going to create a table called, monkeys, and, yep, that's it. I don't plan on doing much with this table, short of just putting some data in and just pushing things around. So, that's where I'll start. Let's insert a document in here, and I'll just put a name of our monkey, Billy. And, in goes Billy. And, the next thing I'm going to do here, is I'm going to watch changes on the tale, just by invoking .changes. And now, look at what popped up in our interface. Listening for events. So, I've split my browser window here, and I have two data explorers open. So, on the right side, let's insert a brand new monkey. And, this one, I'll call Joey. And, the left side is listening for changes, and boom, as I save it, well, this is like every Pub/Sub demo that you've seen. However, I do want to point out, that this is built into the database, and its drivers, so let's do this. Let's get that monkey out, and then I'm going to update the name. And, let's set it to Larry. And so, once again, the interface on the left side refreshes and you can see that the new value is Larry, and the old value is Joey. So, the interesting thing about this, is if we switch now, over to node, I can use changes in here, just like you saw me do with the web interface. It takes a call back, and that call back comes in the form of a cursor. And so, what I'll do, is I'll just loop through that cursor and I'll just output the value using console.log. So, let's run this. And, I'll just run node and changes.js and it's going to sit there and you're not going to see anything happening. It's 'cause the connection is live and open. However, since I have changes sitting there waiting on my monkeys table, it's as you would expect, that came down from my web interface and was pushed out to my node client. And, I think that's pretty neat. That's a very useful feature. In fact, it's going to come in very handy, in just a minute.
-
Full Text Search
One thing that a lot of applications need, is an intelligent, full text searching capability. And you've already seen me use the match function, on top of filter. And, to remind you, match just uses regular expressions. So, here, I can create a filter command and then, use the call back, and then return the album name, and then use .match, and then pass in some kind of reg x. Now, if I use lowercase devil dot star. Well, we had this problem before. Remember, it's case sensitive. I can actually get around this problem of case sensitivity, just by remembering a little bit of reg x. So, what I can do, to switch off the case sensitivity of this query, and I'll cruise in here, so you can see it, is to put in an expression here. Question mark, and then i, and that says, make this case insensitive. Well, that's interesting, but it's not really full text index searching. And, for that, you're pretty much going to want to switch over to a dedicated full text search server. And, you can't get much better than elastic search. You could also go with a search service like, swiftype. We're going to come back to elastic search, in just a second, but I do want to mention swiftype. This is a great tool, because it comes with analytics and it tells you what people are searching on, on your site. It even has a specially tuned one, for ecommerce. So, with the drop of a JavaScript, it hooks up to their service, it scans your site, and for a lot of customers, that's all they really care about, is what they can see on your pages. And, a lot of times, that corresponds directly with products. It is a paid service, however, and I know a lot of folks who'd rather go with something like elastic search. So, if you want to use that, there is a section here, in the documentation that tells you everything about how to set it up. And, that's what I'm going to go through here and show you. So, I've already installed elastic search on my machine. If I go to local host 9200, you can see that I have the output here, from the elastic search server. And, I am also running Zayn right next to it. So we have our RethinkDB cluster running, we have elastic search running, so now, I'm just going to go step right through and install the plugin, the river plugin, that hooks up RethinkDB to elastic search. And, this is provided by the RethinkDB team. It can take a while to download, especially if you're on really slow wifi in Italy, like I am. But, anyway, once it gets installed, I'm just going to restart my elastic search service. And, there we go. It is up and running. Okay. Now, the next thing I need to do, is I need to tell elastic search to use this plugin, and I need to configure it a little bit, and tell it what to look at, what ports and database to look at in RethinkDB. So, that's the next part of the tutorial, here. It tells you exactly how to do that. It's simply XPUT up to your elastic search service and you send in a payload, as you see here. Type is rethinkdb, et cetera, et cetera, et cetera, and then the databases. Well, I have already formatted this, so I can just paste it in here. And, I'm telling it to, agan, use our plugin, and that our database's name is music, and it should look at the catalog table. And finally, the host is local host port 2815. And, sending that off, it was created. That's good. So, let's go over here and use a Chrome plugin called sense, and this is going to allow us to browse all the data in our elastic search instance that's running. And, if I run this query, which means bring me back everything and scan on down here, hey, there's our music catalog. It's the music index. The type is catalog, and if I scroll down here, we can see all of the albums. That's good, but I don't know if I want to scan over all of those things, but I'll handle that later on. For now, let's just query this and see if it works. So, I am just going to send it in a straight up string, over the entire index and, I'll use double, as I've been doing before, And good, there's the hits. We have a total hit of five, and it returns the documents that I expect. Good. Alright, well the next task here, is to make sure that our application can talk to elastic search, so I want to be sure I install the node bits for elastic search, and you do that using npm install elastic search, and then you can use it right inside your application. And then, I could just copy an paste this code right out of the elastic search documentation. And, what I want to do, is I want to run it inside of our app to make sure it works. So, just copy and paste it down. The index is going to be music and the type is going to be catalog. And, what I'll do, is I'll just mirror the exact query that I did in the sense interface, and I'll just query in devil, like I did. So, use query string and then query. And, let's flip back over into our terminal, and I'll run this using node search.js, and boom, there's all the hits we expect to see. This is great. Our catalog is being synchronized directly into elastic search, and we can query it from our application.
-
Full Text Search, Part 2
I think the plugin provided by the RethinkDB people, for plugging into elastic search, is pretty neat, but I like to have a little bit more control over what's being searched and how. So, here in my project, I've added a little bit of abstraction. I've created a db module and inside of it, I've abstracted away things a little bit and I have kind of a repository, if you will. You can see I have all products and then I'm also returning products and sales back out, as r.tables. That'll make querying just a little bit easier. So, to use this thing, I just have to new up a brand new instance and I just have to pass in the database that we want to use. And, I can just say db.allProducts, and pass in a callback and what I'll do is I'll just flash this to the screen, so you can just see that it works. Alright, simple enough. Flipping back over to the console, this, probably, is not going to be too much of a surprise. I actually do practice these demos. And there it is. It works. Alright so, we know that our little instance works, we know that we can now have some abstraction. The next thing I might want to do here, is I have db products set up as an r.table. I might want to monitor changes on our products table, and I can do that, by saying changes, and then .run, and passing in the connection. Then, once the changes go off, I can roll over each one of them, and, can you guess what I might want to do with each changed product? Yep, I want to dump it out to elastic search. So, let's then require the elastic search driver here and I'll connect to it right here, at the very top, creating a new in elastic search client. Down below, I can just say, es.index. This will add a value if it's not there, to an index, or it'll overwrite one if it is there. So, I'll specify that this time the music is index, but the type is product. I'll just separate that from what the river is integrating from the other example. And, I'll just set the body to our product. And then, I'll just output the response, so I can see what happens whenever a change goes off. Alright, well, that looks simple enough. I'm just listening for changes, and then I'm going to push it over to elastic search, but I really don't like having this code all in line right here. So, let's just cut it and put it into its own function above. And, I'll just call this syncSearch, and I'll paste all the code in. It's just because I'm feeling very OCD about this. Okay, so I'll call it here, whenever the product changes. Alright, well, let's stop and restart things to make sure everything goes off okay, and good, there's elastic search. I can see it started running, and it's listening excellent. So, let's test it out, and see if we can actually change a product and I'll just get number 13, an I'll update it by adding an updated at timestamp. So, this will trigger a change that is in the document, and hopefully everything will go off as we'd expect. Okay, let's flip back over to the terminal and re-run our page, and great. Look at that. Elastic search says, request complete. We've added a brand new id to our music product index. We are good to go. But, look at this. In the web interface, I can get document 20, and update it from here. And, I can set updated at to r.now, which is the RethinkDB timestamp, and boom, id 20 goes into elastic search, in our music index, under type product. Great, well, speaking of, let's head over to elastic search. It's running on port 9200, and take a look at product 20. And, yeah, I'm still storing all of the album information, but I don't really need all that information in there. What I really want, is just the name of the album, and I'll call that, title, here. And then, what I also want, is just the name of all the tracks, combined together separated by a comma. These are the only things I really want to search on, and that makes much more sense. Okay, well now, I can just re-run this, and since I'm using the index method, it should just overwrite the product with the id of 20 inside of elastic search. And, it does. Look at that. That worked just right. Yeah, but we have a small problem. The title is undefined. But, that's okay. Again, we're using index, and it should just roll right over it and replace this document. Okay, so there we go. Let's refresh, and, yep. Good. Version five, that's right. And, having a look here. That's great. The title is in there and the tracks are in there, as well. Let's do one more. Why not? We'll go and update, let's say, product 25 and see what happens. We'll come over here and scan it inside of elastic search, and yeah, that has been changed, as well. Well, great, well, let's go and update all the documents. And, I can do that just by dropping update on the entire table, and there it goes. All the documents, now, have been updated, and they are now synchronized completely, into our elastic search index. Now, I like this a little bit better. It's a little bit more manual. However, it is exactly the information I want to go in there as opposed to the entire document. And, in addition, I have control over when it goes in and gets integrated. To me, this is a perfect use case for change feeds, integrating with elastic search and keeping a search index up to date.
-
Geospatial Queries
RethinkDB supports geospatial queries, right out of the box, which is quite interesting. For this, we can just do a really quick demo and I can show you how to find the distance between two locations. It could be locations of offices, or maybe of your customers. That's up to you. So, the first thing I'll do, is I'll create a brand new table called geo, and then, I'll insert some data into it. In here, I have San Francisco and San Diego, and the both of those documents have gone in. And, notice here, that I have location set for both of these and they're set to r.point. That's important and I'm going to talk more about that in just a second. The first thing I'll do, is I'll grab the first document, and I'll get the location off of that document. This is San Francisco. If we take a look at this, look at that, it's a RQL type geometry. It's also a type point. It gets set that way because I specified r.point, and I gave it a latitude and a longitude, and that is how it stores a point in the database. Alright, so, big deal. What can I do with this? Well, let's see. What I can do here, is I can ask for the location, and then I can use the distance function. And, the distance function is going to want to have a second location passed in, and for that, I can use a query, using straight up RQL, and I could say, go grab the location from the geo table of document two. That's Los Angeles, by the way. And, okay. When running this query, I get back an interesting number. 734,125, what is that? Well, taking a look here, in the drop own, code completion window, well, the default units are in meters. I can override that if I want to, and I can specify km for kilometers. And, If I re-run this, that looks a little bit more sane. 734 kilometers. And we want to see things in miles, I just enter mi, and 456. That's how many miles between San Francisco and Las Angeles, and it happens to be true. I grew up in LA. I used to drive to San Francisco all the time. Well, the demo I just showed you is up on RethinkDB.com I thought it was impressive enough, that I just wanted to show you how it worked. I could do a whole course on geospatial querying, but I just wanted to show you that they have it in RethinkDB. And, it's fascinating stuff, and it's quite fast. You can even do a secondary index using geospatial coordinates. As for me, that's it. That's all I've got to show you. I really hope you've enjoyed this look at RethinkDB, and I hope you like it as much as I do. It's a lot of fun to play with, and I think it is a really compelling case for using a no sequel document storage system. Thanks for watching, and if you need to get a hold of me, my twitter address is on screen, so is my email. I'll see you again soon.