The big project I worked on over the summer is called MiBox. For those of you who know me, you know that I love Dropbox and use it for almost everything. I use it so much that I subscribe to Dropbox Pro to get the 50GB of storage, and then I store my full iTunes library and all my photos. That said, Dropbox has a huge weakness, which is that the Dropbox server admins have the ability to view your files, even though they are encrypted. The reason for this policy is so that Dropbox can cut down on storage costs. Every Dropbox upload is hashed globally, meaning that if I upload a 700MB video and someone else uploads the same one, they only store it once. In order to hash the file, they obviously need access to the unencrypted data. They talk about this policy here, and ensure everyone that there are plenty of in-house restrictions to prevent employees from accessing anything. This makes me sleep relatively well at night, but I still encrypt some of my documents which are very confidential.

All of this being said, I decided to take a shot at creating my own version of Dropbox which has true client-side encryption. This means that nothing ever leaves your computer without being encrypted with a key that the server doesn’t know anything about. I decided to call it MiBox, although the name is pretty pointless. The way I decided to create it is to use Amazon Web Services for all storage, since I am not in the web hosting business nor do I want to be. I used S3 for storage, and SimpleDB for metadata storage. S3 has metadata capabilities, so I replicate the metadata there, but I still needed SimpleDB for querying based on date.

I did an iterative process to develop this, since it was by far the biggest project I’ve tried to do on my own. I started out simply getting the program to do a full sync between client and server, meaning that it got a list of all files on server and all files on client, and synchronized them appropriately. This process alone was very hard to visualize, and I had to work it out on paper for many hours. You know you’re working on a tough project when you can’t even start coding without writing a bunch of ideas out on paper. This is the result of just planning this sync process:

Anyway, I got the syncing done in about 2 days worth of work. The next iterative step was to work on partial syncing, meaning that the program could only compare the files on the server and client that have changed since the last time, and then sync appropriately. This introduced a ton of edge cases which my metadata model did not handle, so I had to add a lot of columns. I took this opportunity of reworking everything to also modularize the actions needed to run the synchronization steps for various situations (e.g. Local File changed, Conflict, Remote file deleted), thinking ahead to where things would be multi-threaded.

After getting this working, I actually tested it with real data, and I found that the upload times were horrible. It turns out that the AWS SDK provided by Amazon is very high performance in terms of transfers. I did some research and decided to switch all S3 operations to use the JetS3t library. This had the benefit of having built-in support for hashing (which I had done manually up until this point) and encryption (which I hadn’t yet tackled). I went through the surprisingly small effort to convert all my S3 calls to JetS3t calls, and the transfer speeds were already much better. It’s worth noting that somewhere in these iterations I also implemented versioning in a rather simple way, by storing file data in a separate bucket using the hash as the key, then making the actual file hierarchy refer to the hash. The database also has a versions table which stores the metadata for each uploaded file. I have not, however, added the feature to actually roll back to previous versions, but it shouldn’t be too hard.

Since I was now using JetS3t, my next step was to get encryption working. It turns out that it only required minimal changes each time I downloaded and uploaded a file, thanks to JetS3t and the Bouncy Castle add-on.

After these steps, here is the status of the program. You provide the AWS credentials and a few other settings (like the local path to the directory you want to be the root of your MiBox, the names of the buckets to use on your S3 account, etc.) in a configuration file. Then when you run the program, it will perform a partial sync (which in the first run will be a full sync). This establishes a local SQLite database, the remote SimpleDB database, and all the S3 buckets necessary, as well as transferring the files. You can do this on multiple computers, and it properly handles conflicts (I’ve tested this using VirtualBox and running multiple instances of MiBox).

What it does not yet do is monitor the directory in real-time. I’m thinking this will be a relatively easy addition thanks to JNotify, but it will introduce two new required features. First of all, I will need to implement a Thread Executor which maintains a hash map of all files which are being synced. The reason for this is to handle the case where someone modifies a 10MB file, it starts syncing, then they make a change and save it again. There’s no sense in finishing the sync if there is another newer pending action on the same file. I have this process planned out, and I believe it will work pretty well. However, for this to work, the second required component needs to be taken care of, which is proper error handling within the Sync Actions. This is needed so that I can safely interrupt a sync action such as uploading a changed file, downloading a change, deleting a file, etc. without leaving the remote or local metadata in inconsistent state. This will be even more important when you consider the case of multiple computers modifying things at the same time. None of the sync operations are atomic because they involve multiple service calls, modifying SimpleDB, the local SQLite database, etc. They will have to be made atomic by making them behave like transactions, which will require custom handling of rollbacks and things like that. This is no small undertaking, and I haven’t fully planned out what it will look like.

Adding real-time syncing in the opposite direction will be a lot easier, actually. I can just query the SimpleDB database for any records modified since the last sync, and if anything changed it can kick off a partial sync on those files. This would have to work in the context of the Thread Executor and transactional model above, so it can’t be implemented until those are finished.

Because those new features are going to be a ton of work, I don’t know when or if I will be adding them. My real concern is that even if I get this working perfectly, it’s still going to be missing a ton of features that I use all the time from Dropbox, such as the web interface, the iPhone app, and more. That said, I am happy with how far the project has come already, and the fact that I basically did implement the core features of partial syncing. Creating the architecture of how everything would work was a great learning experience, and it gave me a serious respect for AWS. I know a lot of people are not too happy with SimpleDB, but I found it to be pretty powerful if you model things right.

To summarize the architectural components, the program uses ORMLite to easily maintain a local SQLite database. It uses JetS3t for all S3 transactions, and it uses Amazon’s own SDK for all SimpleDB transactions. The JetS3t encryption is powered by Bouncy Castle. The local SQLite database, after syncing, has the same information as the remote SimpleDB database, which is basically the metadata for each file (things like last-modified date, hash, last sync date, etc.) which is useful for synchronization. The files are stored in S3 with two buckets. The first contains the file hierarchy, such as the folders and file names. Each of these maps to a file hash. Then the second bucket is just a bunch of file hashes which map to the actual file data. This use of hash to store data is both useful in saving bandwidth (identical files are not uploaded twice) and versioning as mentioned above. All of this would probably make for a really nice diagram, but my patience for Visio is running thin thanks to too much diagramming for my Senior Design project.

I wrote all of this in Java. You can download the source package here. It uses Maven for dependency management, so it should be relatively easy to compile. Actually running it will require having an Amazon AWS account, and placing your credentials in the file. I will answer basic questions over e-mail, but I am not going to provide any full support for this program since it’s very much a work in progress and not a complete product.

The github repository for this project is here.