Can I use a column with a comma separated list of items in Cassandra?

As I'm working on Snap! C++, I have many places were I want to record a list of something.

Many times that list includes a reference to another column. Within the content table I use the links plugin to create references / connections.

So... Today I was enhancing the use of attachments to posts you make on a website. The actual files get saved in a table named files and I wanted to know who saved that file so I have a reference back to the content page. This will be useful when I want to file all the pages because, for example, the file is proven to be a virus.

If you're an SQL programmer, the first idea coming in mind, and probably the cleanest way to do this, is to create a connection table. This kind of table has IDs of the files and IDs of the pages so that way they are linked. This is how a relational database works.

Another "cheap" way of doing this, is to create one column in the files table named "references" and that column is a list of IDs (i.e. something like "1,5,123,4004".) This is practical because you avoid another table and for relatively small lists, it looks rather good.

Now... the table solution doesn't help much in Cassandra because it is not a relational database. Thus, many people may think: hey! Let's create one cell and add references, comma separated as shown in the "cheap" way of doing this.

This is an interesting concept and since the size of the content of a cell and the number of cells (columns, if you prefer, but these are really cells, not columns...) is "not limited" (2 billion each, it's virtually not limited if you consider that one website cannot have that many pages ever.)

However, using that second technique with Cassandra means using a read, modify, write cycle. This means it has to be done atomically (which SQL databases do and thus make them somewhat slow in that respect when accessed by many processes in parallel.) This is going to slow down everything because to be atomically correct you need to use a quorum and a lock...

To avoid that problem, and since each reference is anyway unique, you can instead make use of the fact that any number of cells can be created in a row. The name of the cell will include the reference (unless it is not safe/possible somehow, but Cassandra accepts binary keys, so there should be really no reason for that not to work.) So let's say you call that column "reference", and you have the references 1, 5, 123, and 4004, the result would be:

reference/1
reference/5
reference/123
reference/4004

And with that result, you can now add/remove references without much worries because each cell is distinct.

This is a different way of thinking for those like me who come from SQL. But I wanted to mention it somewhere as it is important to understand how things can work without the use of locks.

Snap! Websites
An Open Source CMS System in C++

Contact Us Directly