Tuesday, October 6, 2009

ORMs and Declarative Schemas

My prior post was more controversial than I anticipated. In hindsight, I should have realized what a hot button issue web frameworks are. One assertion went all but unnoticed. I expect it may be even more controversial: the schema-generative ORM paradigm is fundamentally flawed.

Disclaimer

I came to this conclusion while working with Django's ORM, but this post is completely object-relational mapper agnostic. We are now using SqlAlchemy, but we are not using any of the many available declarative layers. Instead, we are using schema reflection and semi-automatically configured mappers. This is not an argument against ORMs. It is an argument against generating database schemas from ORM declarations. By extension, this is an argument against Django's ORM because Django uses an exclusively schema declarative model. That said, Django's ORM is far from alone in this camp.

Data Outlives Code

When code is dead and gone -- be it through rewrite, obsolescence, or by other means -- the data will still be there. Longevity implies slower evolution; data is always more difficult and riskier to change. Data is also more valuable. What if Facebook rebooted their database? They've already rebooted their software several times.

Schemas are data. As data, schemas are longer lived, less flexible, and more valuable than code. These factors alone suggest that the database itself should hold the authoritative schema, not a class declaration in the code.

If you have inherited data from another project, you already know this lesson. You can't generate the schema from code because the schema already exists. You can mimic the authoritative schema in your declarations, but it is easier and more accurate to use reflection.

Ineffective Domain Objects

Object relational mapping is primarily a serialization problem. Every serialization solution has its quirks. The scale and number of quirks seem directly proportional to the absolute difference between the runtime and storage representations. Since a database is a completely distinct type system, rather than an opaque byte array, serialization to a database can have particularly quirky quirks.

Modeling is one of the fundamental challenges of software development. Capable developers prefer highly expressive or unconstrained type systems to aid with modeling. Generally, runtime type systems are more expressive than those found in databases. Declarative relational mappers, however, constrain the programming language type system to its less expressive counterpart. When building domain objects, the developer must think in terms of the database's type system, not the programming language's.

While using a declarative object relational mapper, developers are effectively trying to design storage and runtime models simultaneously. On average, superior results are achieved by modeling these two concerns separately and then solving an additional subproblem: serialization mapping. You might wind up with slightly more code, but it will be easier to understand and maintain.

SQL Is Not Going Away

Despite forcing a less expressive type system on to the developer, declarative ORM layers attempt to treat the database as an implementation detail. However much we wish this were true, the database is not a detail which can be ignored. Sooner or later, you are going to have to open your database shell and write a SQL expression. This requires knowledge of your database's particular SQL dialect and idiosyncrasies. Exacerbating this issue, generated schemas are typically full of name mangling and other ugliness. It is far more pleasant to work with a carefully designed schema than one that compromises for the ORM or the runtime type system.

Schema and Declarations Diverge

Data migrations present the most pressing need to work with SQL directly. As an widely unsolved problem, automation can not be trusted. Unless you are painstakingly simulating the schema generator, the production database schema will slowly diverge. Most deviations are tolerable as they will not affect the runtime behavior, but it is wise to minimze differences between production, staging, and development environments. To faciliate this, store migration scripts and backed-up schemas in version control.

Some deviations will directly affect runtime behavior. For example, consider the case where two versions of an application are running in production. An is_read boolean column was added to a message table in the database. It's default value is false. When a row's page is viewed, the is_read column is set to true. The old version of the application doesn't know about the column, so it can not set the flag. When the new version is rolled out to everyone, affected user will see a bunch of read items marked as unread! The solution is to set the column's default to true, but initialize it to false in the application. Declarations must either deliberately deviate from the schema or present a misleading default value to be overridden during initialization. This is just a simple example, real world schema migrations can be significantly more complex and suffer from numerous more subtle problems.

Monday, August 24, 2009

URL routing and views

As promised, I'd like to elaborate on the URL routing system I came up with.

Weighing at less than 200 lines of code (including example), I'll let it speak for itself: download it.

This approach seems to be working great for us. Love it? Hate it? Feel free to let me know what you think.

Wednesday, August 19, 2009

Dropping Django

My partner and I built a non-trivial web site on Django. When the next version ships, there might not be a single Django module imported.

We're not trying to drop Django; it is just sort of happening. Piece by piece, it is failing to meet our needs. Despite the marketing copy on the Django site, most components of the framework are tightly coupled enough to make customization frustrating. It is often easier to rewrite core framework components than to implement them on top of the existing extensibility points.

What follows is a loose chronology of our migration away from Django.

URL Routing

A flat list of patterns violates the DRY principal when creating nested URLs. Trees are a superior representation. Having a tree of views also enabled us to optionally associate a "binder" function with each node. These bind functions are executed for each URL component from left to right, filling the template context as they go. Breadcrumbs are automatically generated as each node binds, but only the last node executes its full view logic.

Authorization

Our site enforces permissions on every resource, but Django's database ACLs would have been prohibitively numerous. Instead, views or their URL binders may raise an AccessDenied exception. Upon catching such an exception, a middleware layer serves a login form. This ensures users have permission to access the current resource, as well as all ancestor resources bound to the URL.

Authentication

Both of Django authentication's key extensibility points are flawed. These two extensibility points are "user profiles" (storing additional per-user data) and custom credentials (such as for logging in via email address instead of username). Django's documentation and numerous internet sources cover both topics, but all of the guidance lacks important caveats. The admin UI, in particular, is very easy to break with either extensibility mechanism.

Extending the User model with the ORM requires a one-to-one database relationship. This relationship can be implemented with a "user profile" setting, an explicit foreign key, or model inheritance. Each approach has its own strengths and weaknesses in terms of performance, API semantics, subtle behavioral changes, and outright bugs.

Enabling custom credentials requires implementing a trivial authorization "backend" object. Unfortunately, it is non-trivial to replace usernames with email addresses. The admin UI's login form refuses to accept email addresses without hacking the template. Even if you hacked the template, the User model would still enforce a non-null constraint on the username field and the generated database schema enforces a uniqueness constraint as well. It turns out to be easier to fill the username field with a dummy value and "support" both forms of authentication with your backend, but you won't come to that conclusion until your head has already bore a hole in your desk.

Templating

We do our best to keep view and template logic separate. Django's templates are targeted at designers, who aren't implementing any real logic anyway. However, we're a pair of hackers. Sometimes it is just more convenient to put a little bit of logic in the views. Besides, templates are code; code needs to be reviewed and tested. We wouldn't ever hire a designer who couldn't pass a code review for some trivial template logic.

We needed a pragmatic template language to replace Django's idealistic one. Any template language with greater expressive power would have been welcome, but Jinja2 fit the requirements and provided the easiest migration path. Ultimately, we'd prefer to use something like HAML, but there doesn't seem to be a Python equivalent besides the inactive GHRML project. We are, however, using SASS. I will never write CSS by hand again.

ORM and Admin UI

One of Django's most touted features is the Admin UI. For simple "active record" style database models, the Admin UI is a huge time saver. Sadly, it struggles a little bit with nullable fields and is tricky to customize. You'll definitely need to write custom UI for complex models, but by and large the admin solves the problem at hand: viewing, creating, updating, and deleting database rows.

After using the Admin for a little, I found myself missing Microsoft Access. I never thought I'd say that, but it is true. Django's admin does not support sorting, filtering, or other impromptu queries. Edit: It turn's out I was mistaken about sorting and filtering, but I stand by the core message of this section. I found myself writing impromptu queries in the database and Python shells. After a while, I just gave up and installed a desktop client. I haven't visited the Admin UI since.

Django's ORM has shortcomings with respect to querying, especially for joins and aggregation. It has been improving over time, but it will likely never reach the capability of projects solely focused on databases, such as SqlAlchemy. With the admin having fallen into disuse, the Django ORM lost all advantage. Beyond Django's specific weaknesses, I've come to believe that the schema-generative ORMs paradigm is fundamentally flawed. That is a topic that deserves an entire (Django-agnostic) post of it's own. We are now using SqlAlchemy via schema reflection; no declarative layer.

Form Validation and Generation

Here is where our chronology meets present day. We are still using Django form validation, but never used form generation beyond scaffolding. Nearly all of our templates customize labels and display of errors. Additionally, embedding widget information in the Python code is cumbersome during template development. Django forms is a quality validation library, but there are some inconsequential style things that I like better about FormEncode. Preferences aside, the difference isn't large enough to justify switching.

While I like FormEncode, I'm still not sold on its anti-form-generation companion, htmlfill. I think there is a middle ground with form generation that provides scaffolding during development, smoothly transitions to production use, and cooperates with validation. As we implement more complex client views, I'll be on the lookout for ways to improve our form development toolbox.

So, ugh... What's left?

Besides a few isolated helper functions, not much is left of Django.

The last big ticket item is the HTTP framework and WSGI server. We could continue using Django as if it were CherryPy or Paste, but Django has this nasty habbit of insisting on running your code for you. The settings and manage.py infrastructure are fiddly for deployment and don't really add any value over simple scripts using our application like a library. Might as well use a simpler WSGI library, and replace those over-engineered management/commands/foo.py files with vanilla scripts/foo.py files.

Moral of the Story

I'm sure there are numerous lessons to be generalized from this journey. Personally, I've developed a moderate fear of the word "framework", as well as altered the way I think about software abstractions. I think the most important lesson, however, is one I already knew: choose the right tool for the job. Unfortunately, we had no idea what the right tool was when we started. I'm not sure we know any better now.

Friday, June 12, 2009

AppWeek

Shawn's AppWeek post inspired me to write one too. AppWeek is our chance to be creators for a little while and it was a lot of fun. I didn't set out to build something nearly as ambitious as "Super Avatar Sample Smashup EXTREME! - 'Capture the Cat' edition", but I did get to take a swing at a game I've wanted to build for a while: Rock'em Sock'em Avatars Avatar Boxing. Avatars, being a new feature in this release, were an unwritten requirement for all of the AppWeek games. Between SASSECTCE, my game, and the many others, Avatars were chasing cats, beating each other up, play futuristic sports, falling off buildings, dancing in a cloud of gems, being launched from canons to save the world, and much more. All this excitement was almost too much for a bunch of exhausted engineers, but that's what the beer was for during the game unveilings.

Here's what the game looked like with the basic animations wired up. You'll notice that the avatars have been hitting the gym. That's because their arms were too short to reach each other! I added a little extra bulk because I was laughing too hard not to. I directly bound the game pad triggers to the shoulder and elbow joints and rigged up the chase camera sample to inspect my work. There wasn't much game play yet, but it was already fun. That's always a good sign.

bdbx1-1

bdbx1-2

Even with just one week, I decided to invest some time into debugging visualizations. That turned out to be a really great idea.

bdbx1-5

Then, I added some collision spheres for the heads, hands, and upper bodies. This was a hacky, trial and error process. Thankfully, C# compiles quickly.

bdbx1-4 

bdbx1-3

At this point, I spent an entire day working on the physics. I wanted the avatars to bounce/wobble when they got hit, so I rigged up some complex spring systems. Things were starting to work, but I'm generally pretty bad at this sort of thing and my simulation routinely exploded. The avatars arms went shooting off into space and I was getting pretty frustrated. No screen shots of that chaos because I am embarrassed.

With half a day to go, I added the obligatory damage bars and some rudimentary hand-to-head collision detection.

bdbx1-6

I was feeling pretty good about the game, despite my physics failures, it was pretty fun anyway. I wondered down the hall to chat with Jace, who had just added sound effects to his game. His game was hilarious before, but the sound effects were priceless. I ejected the sound effect CD out of his machine, yoinked it, and took off running. An hour later (and 10 minutes after the deadline), my game had some sweet punch and miss sounds. I also made the avatars' heads pop up when their damage bar was full, accompanied by an awesome zip-tie sound.

bdbx1-7

At our team happy hour, I'd like to think Avatar Boxing was a fan favorite. I certainly had fun making it! I hope everyone enjoys Avatar support in the new XNA Game Studio.

Monday, April 20, 2009

PowerShell: condemned to reinvent

I tried PowerShell when it was first released, but never used it for real work. I recently attended a "brown bag" presentation about PowerShell. This presentation spurred me to augment our team's environment with PowerShell and I have been using it every day since.

In the past weeks using and abusing PowerShell, I have drawn two conclusions:

  1. PowerShell has a killer set of standard tools with brilliantly designed usability.
  2. The PowerShell team doesn't understand UNIX and therefore were condemned to reinvented it, poorly -- with apologies to Henry Spencer.

First things first: if you spend any time working with Windows, get PowerShell. Now. Stop reading my blog and go download it immediately. It mops the floor with cmd.

The key premise behind PowerShell is that it operates on live .NET objects. This is beneficial because it eliminates a lot of the text cutting and manipulation common in shell scripts. Additionally, it puts the full .NET Base Class Library into your scripting toolbox. PowerShell tools, known as commandlets, typically only render the most common fields for their objects, but the less common fields are easily available in memory. By convention, Commandlets are named with a verb-noun pattern and support a common command line parsing behavior. The repository of commandlets and the command line options of each are easily queried and highly consistent. All this meta-data makes PowerShell a breeze to learn.

I fell in love with the the discoverablity and ease of use when I tried to kill a collection of runaway processes:

PS> get-command -noun process

CommandType Name Definition
----------- ---- ----------
Cmdlet Get-Process Get-Process [[-Name] <String[]>] [-Verbo...
Cmdlet Stop-Process Stop-Process [-Id] <Int32[]> [-PassThru]...

PS> get-process notepad | stop-process
PS> get-alias | where { $_.definition.contains("Process") }

CommandType Name Definition
----------- ---- ----------
Alias kill Stop-Process
Alias ps Get-Process

PS> ps someotherapp | kill

OK, that's pretty cool and oh-so-very Unixy -- right? Wrong. Notice the "CommandType" column in the results of get-command. There are many other types of commands besides commandlets: functions, filters, scripts, applications, etc. Each of these has slightly different semantics for pipes and parameters. Applications, for example, have no way of accepting .NET object pipes. You must develop a separate commandlet. Yikes!

Compare to Unix: all commands are applications which accept a command line and pipe byte streams in and out. Much simpler, but byte streams aren't as friendly, discoverable, and maintainable as object streams. However, the brilliantly simple thing about Unix is that, when you get right down to it, object streams are just byte streams! There is absolutely nothing stopping you from implementing get-process and stop-process as Unix programs which pipe object references, JSON, pickled Python objects, XML, S-expressions, or any other data format you fancy. Doug Mcllroy, the inventor of Unix pipes, was right: text streams are the universal interface.

Actually, this is no different on Windows. All of the PowerShell commandlets could have been implemented as applications which import a library. This library would replace main in much the same way as winmain, provide a metadata enriched implementation of getopt, man, etc. There is no need to invent a new shell in order to acquire the power of piping objects. Sure, cmd is old and needed to be retired for many other reasons, but it is a real shame that the PowerShell toolset is not available to those of us stuck in batch scripts.

Personally, I would really like to see such a library developed. Microsoft has certainly proved one thing with PowerShell: steep learning curves are not intrinsic to command line interfaces. Unfortunately, commandlets are two steps forward and one step backwards. I have no doubt that we can retake that forward step.

Monday, February 9, 2009

Language-oriented programming: too much, too fast

There was some recent discussion on Hacker News about the 2004 article on "Language Oriented Programming" by Sergey Dmitriev. With my growing interest in programming languages, this article, and the pending release of JetBrain's Meta Programming System, I have been thinking a lot about the future of programming.

As Sergey points out, MPS is not the first entry into this paradigm of software development. Intentional Programming was being demonstrated by Microsoft Research as early as 2000. Wikipedia lists several implementations of the language-oriented programming concept. (OK, that is enough links for now!) Sadly, none of these systems have been met with wide spread success. Despite my unlimited respect for the JetBrains team and love of their products -- especially Resharper -- I expect MPS to fail to achieve critical mass. I don't think many programmers would disagree. It's just too much, too fast.

Software development needs to evolve, not start anew. If you change too much at once, the common developer simply won't accept it. Language designers are still re-inventing Lisp piece by piece in an effort to make sense of all the different concepts for Mort. In order to improve adoption of language-oriented programming concepts, the critical path must be identified and the programming community must be lead down that path one step at a time.

I believe that the first stepping stone is the box editor.

Both MPS and Intentional Programming provide a visual tree of boxes editor and store the abstract syntax tree in a sort of source code database. Both of these concepts are critical to power of language-oriented programming, but changing the storage of source code is simply too radical of a change. Well, actually, source code is routinely stored in databases for IDE features, but the authoritative storage always remains text files. All of a programmers most trusted tools operate at the string level and it is simply impractical to throw everything out at once.

This is why I propose the creation of a new editor which uses an abstract syntax tree box editing model, but preserves the source as text. This editor would have to play nice with other developers who are using traditional text editors. It should be possible for a single developer to try the editor for a day or two without another collaborator even noticing. There are clear benefits to a box editor even in the absence of language-oriented features. For example, navigation of source code by parent, child, or sibling relationships instead of by word or by line. Advanced renders can be used to layout math expressions or to show a referenced image file in a comment.

Editors are a religion, start a new one. If some box editor becomes popular enough, other tools would spring up around them. To help the process, the editor should have well exposed extension and composition mechanisms. For example, it should be trivially easy to add new renderers for various nodes. It should be equally trivial to utilize the abstract syntax tree system to create stand alone code analysis tools.

If box editors garner enough of a following, then the language-oriented programming advocates can seek the next stepping stone.

Tuesday, February 3, 2009

Break the cycle of broken builds

Broken builds suck. Breaking the build sucks. No one wants to have their work stopped by others' build breaks and no one wants to cause work stoppage on account of their own build breaks.

So then why do builds break so often? You'd think that after so many years of software development, someone would have solved this problem. Microsoft employees tens of thousands of engineers across hundreds, if not thousands, of code projects. Many of those code projects are development tools for the engineering community at large as well as engineers within Microsoft. Yet, still, all of the tools at the disposal of those engineers are highly prone to build breaks.

Below, are three ideas for reducing build breaks. Together, they could all but eliminate the problem.

Use a better version control system

Everyone keeps raving about distributed version control, and for good reason. I won't rehash the argument here, but I will say: branch early and merge often. Perforce (better known as Source Depot within Microsoft) and it's spiritual successor Team Foundation Server are completely unbearable. A great majority of build breaks can be contained if people are only pulling known good changes from dedicated integrators. Everyone checking in all at once to a main branch simply doesn't work with anything but the most tightly knit of teams. Even if the does build break, it should only block the person waiting on that particular change.

Isolate build breaks

`sd submit -c 12345` translates roughly to "I'm 100% confident that my changes are solid and am equally confident that I am submitting the changes that I think I am." You never can be 100% confident, don't take any chances: submit to a private branch which is being monitored by a build service. The server should detect your submission, run a full buddy build (with unit tests!), and then submit it to the shared branch only after it has passed.

If your builds take too long for this, people should be able to grab changes down from your unverified branch on an as-needed basis. If this happens too frequently: your builds take too long. Refactor your system into smaller components and formalize the contractual interface between those components. Then buddy build just the components which have changed. Verify that the interface hasn't changed with a unit test. This will stop compile errors at the boundaries.

Go lazy, be late

This one is a big of a pipe dream... software systems developed with dynamic languages do not experience build breaks nearly as often as those developed with static languages. This is because dynamic languages are typically lazy-loaded and utilize late bounded method invocations. If someone adds a new button to the UI which calls a method in a file they forgot to add to source control, that should only block people who want to click the button!

There is no firm technical reason why statically or JIT compiled systems can't simulate this. Most compile errors could theoretically be treated as a runtime error. Clearly, compile errors are valuable, but should be ignorable whenever possible. Don't ignore them in your known good builds, but ignore them when they stand between you and forward progress.

Since most of us aren't writing our own compilers are full tool chains, here is some more practical advice for projects using statically compiled languages: write code which fails gracefully. Runtime errors can be build breaks just the same as compile errors, but try to defer the negative consequences as much as possible. One broken feature shouldn't break the whole project; unless, of course, you are about to ship. In production builds, perform verification code at startup or in unit tests, but try to be lazy and late in development builds.