Many years ago (I will not reveal my age), I began working on my PhD thesis concerning the area of Domain-Specific Languages (DSLs). Research was booming at the time and many research articles stated in their introduction that DSLs are very useful and increase productivity, by reducing lines of code etc. All these claims seemed logical to me, but I always considered them something like urban legends. We all know that they are correct, but cannot easily prove it. Keeping that on the back of my mind, I searched for a way to bring the “legend” down to measurable facts that will provide solid motivation for the importance DSLs in every day programming. I decided to do a simple experiment that measures DSL usage in open source programs. My attack vector was simple; I narrowed down my data set to Java programs and measured package usage of application libraries that implemented DSLs. I selected Java for two reasons; first, it was easier to find and mine open source projects semi-automatically with a little scripting (python) by using the maven repository and second, the Java Standard Development Kit includes many application libraries that enable DSL support. At first, a snapshot (January 2012) of the Maven repository was downloaded locally. The repository contained various projects and their versions, which usually indicate a major release. All versions were filtered out and only the most recent was kept. The final project count included 12,959 projects, and more than 110 million lines of code. The table above shows various size metrics for the selected corpus. Note that Lines of Code (LoC) metrics include blanks. Source Lines of Code (SLoC) and Comment Lines of Code (CLoC) do not include blank lines. This is the reason that LoC != SLoC + CLoC. A set of standard DSL application libraries were identified and the source was scanned for specific import statements e.g. java.util.regex, which indicated that the standard package that implements regular expressions was used, thus regular expressions were used. If a package is detected in the source code, then the project will be tagged as using one DSL. So, if a project has as DSL count four, then it means that four different application libraries were detected during the source code scan.
The initial goal of this experiment was simply to provide quantifiable results that are indicative regarding the usage of each DSL in Java; thus only files containing Java code were scanned and accounted. Build files or other resources that contain other DSLs e.g. Apache ant, were not included. One final assumption was also made; if XML Path (XPATH) or Extensible Stylesheet Language and Transformations (XSLT) were found in the source code, then the project would be marked as using also XML. This is logical, since those two languages are used for query and transformations on XML Document Object Model (DOM) trees. The results were not surprising and the legends seem to be true. More than one-third of the projects (4,655) are using one or more DSLs. The most popular DSL is XML, with a percentage of 75% usage in Java projects. Regular expressions follow with a 25%. Another interesting aspect of the results, refers to the most popular DSL usage combinations. As expected, XML rule in the DSL popularity contest, followed by regular expressions and SQL. The results of this experiment show that DSLs are commonly used and that they are very popular in the Java software ecosystem. Note that this experiment was not exhaustive. If more than 20 application libraries were detected and peripheral files were taken into account, the DSL usage would have been higher. Now I have my motivation, DSLs are really popular and with modern approaches, like Scala, SugarJ etc, it seems they will be more popular in the future.
I like either confirming or debunking urban legends.
My definition of a DSL is much more stringent, in which “domain” is much higher level than “XML” but has instead the semantics of a specific application domain. I agree that the boundary is fuzzy. A graphics language like OpenGL or WebGL is not as high level as a drafting language with primitives like draw_hidden_line or display_elevation. XML is not as high level as VoiceXML, nor as high-level as the tagging of the New Testament that one student and I did.
My test for a DSL is whether a person working in the domain area already knows the vocabulary, and so can use the DSL without learning computer programming in general.
Here are two examples from my research. A student of mine working for a software subcontractor created a process-control language (PCL) for managing a General Mills factory, producing one cereal for a few weeks, then automatically switching over to produce another for a few weeks as needed by demand, and so on. The verbs were like open_valve (32, 90) which signaled to valve #32 to open to 90%; set_temperature (5, 200) to set the temperature of mixer #5 to 200 degrees Fahrenheit; wash_mixer (5) which performed an automatic cleaning cycle for mixer #5; or shut_down (30) which shut the whole production line down in 30 seconds. The domain expert knew about stirring speeds, temperatures, and shutting down the factory, but not about computers.
Second, a student and I have been involved in tagging New Testament (NT) Greek passages with domain-specific linguistic tags like morphology, word, clause, and sentence level XML tags. For example, to mark a present active indicative 3rd person plural finite verb. They nest properly among themselves, but not with another set of XML tags marking book, chapter, verse, and part to refer say to 1 Peter 2:9a, since a sentence can cross a chapter boundary. And not with yet another set of XML tags marking embedded quotations, typically from the Old Testament, such as , which also may cross NT chapter boundaries.
Others are still catching up with the NT work that we did in 2001, as evidenced by this site as of 2005.
Oops … my embedded XML was all eaten by the WordPress parser, and I don’t think WordPress supports escaping the XML. Fortunately, the above post still reads grammatically and coherently without the example lines of XML.