The Repository Structure
I wanted to get this discussion started so we can begin to explore the pros and cons of various means of storing different sorts of modules in the repository. This starting point will be rather long, so I am going to number the sections according to the topic discussed.
1. Naming of Module Categories
First, there has been some discussion of physically separating modules based on their classification. The three initial classifications that have been proposed are, "standard," "stable" and "experimental." As an initial matter, I do not think these are the best names for these categories. In particular, the term "experimental" perhaps implies that the code is not reliable and is not suitable to be used in real world projects. Yet, this is the area where we will be typically developing code for our end users who will then make use of that code in real world projects. Also, it should be kept in mind that there is no reason to think that in some cases, the code in the "experimental" section is not in some way equal to or even superior to code in other classifications. I think that we should use a more neutral term that does not have any tendency to disparage the value of code developed in that category.
So, here is my proposal. I think we should have three or four categories. Instead of "standard" we should have "core" or alternatively we should have both "core" and "standard." For now, since we do not have anything approximating the idea of a "standard extension" I think we should just have "core" and explore adding a "standard" category later, if and when we decide the it is useful to designate modules as such.
Instead of "stable" I think we should call it "community." Code here is not necessarily more stable than code in the "experimental" area, which may in fact be quite mature and have a limited set of real world users dependent on it. I think the main idea is not just that the code in this area is stable, the idea is that the maintainers of code in this area are able and willing to back that code up and provide wider support for it. Clearly, in order to provide this sort of support, it would be necessary for a developer or development team to have a stable tag or branch. However, that same development group might have other branches and tags that are in development, and are unstable. It seems strange to me to have any code whatsoever in the "stable" area that is not in fact stable. But that is precisely what you will have, as actual development is likely to occur here as well, perhaps in various "experimental" branches. I think that overall, we already have directory structures that are intended to indicate the stability of code. Code in the "trunk" or in a "branch" can be expected to change. Code in a "tag" should be stable and probably should never change. I do not think we need any more indicators than this, although we could conceivably want to add the concept of a "stable branch" to differentiate branches where active development occurs from branches were changes are made more conservatively. Regardless, the indication of stability should be in the trunk, tags, branches area, not at a higher level of the tree where it is sure to be misleading. The term "community" in contrast is not misleading. This is code that is intended to be shared. There should be some stable code here, preferably in a tag. But, there might also be some unstable code that you should not be using without close collaboration with the developer.
Coming up with an alternative name to "experimental" is more challenging, but I would propose as an initial matter the term "project-specific." Here, we have modules that are developed with a particular individual or team in mind, and which is not ready (and may never be ready) to be shared with the wider community. There might be very stable code here, and there also might be more experimental code here.
Overall, I think using the terms "stable" and "experimental" would be both misleading (because there would be stable code in the experimental area, and experimental code in the stable area) and redundant (we already have the concept of trunk, tags, and branches to indicate stability). I think it would be preferable to use the abstract terms "community" and "project-specific" because they more accurately convey what is intended to be stored in each category and also because the term "experimental" is simply too disparaging for code that will in fact be depended on by specific individuals and teams for their important, reliable, and serious scientific investigations.
2. Migrating Modules
There is an issue regarding migration of modules. It may be desirable from time-to-time to change the classification of a module. A "project-specific" module may have much functionality that a developer thinks could be widely applicable, and they want to share it with the wider community. They are also able and willing to commit resources to supporting the community in the use of that module. As such, they would like to change the classification of the module from "project-specific" to "community." Alternatively, one can imagine a case of a module that was classified as "community" but which simply did not take off. The maintainer of that module may no longer have the resources to provide support for the wider community. They thus may wish to reclassify the module from "community" to "project-specific."
What are some of the technical challenges involved in changing the classification of a module? Well, because the configuration of which modules are meant to be used together is distributed in order to allow for decentralized development, moving an entire module from one category to another is not a good idea. When a module is reclassified, each and every suite or module group that uses that module will have to be updated to reflect that change. Given that a module group could even exist outside of our own repository, such changes have the potential to be very disruptive. Since not all suites or module groups are necessarily stored in our repository, there is no way for an individual who decides to reclassify a module by physically moving it to know exactly who will be impacted.
There are various hacks that we might use to get around this. Instead of modules referring to specific locations, we could have them refer to a centralized module registry which tracks the location. This would be a huge headache and would add unnecessary bugs and complication to the system. It would compromise the more decentralized structure that we have now, as any new modules would be required to register themselves in our centralized module registry. Overall, in my view, this or any other solution that relies on some sort of centralized registry of some sort is a bad idea.
I think we need to ask ourselves, exactly what are we buying by having "community" and "project-specific" (or "stable" and "experimental") modules stored in different areas in the repository? My answer is that we are not buying ourselves much of anything. This is nothing more than a particularly expensive means of documentation. Only, when you decide to update the documentation (i.e. reclassify a module), it actually has the potential to physically break the entire system and cause no end of headaches. A simple alternative is to simply have a listing of non-core modules that we consider to be "community" modules. Or better yet, I propose that we do away with the distinction altogether! The line between a project-specific module and a community module is rather fuzzy. Probably most people who develop modules would not mind contributing to the community, but they also probably have limited and varying resources regarding the extent to which they are able and willing to support that module's use by others. What we have in essence is a continuous variable (i.e. intention, willingness, and ability to provide support for a module) classified in a stark and simplistic binary scheme (either "community" versus "project-specific" or "stable" versus "experimental") that distorts as much as it enlightens. Not only that, we are to develop either a complex centralized registry or break peoples' code in order to provide this sort of documentation? I would classify this firmly in the bad idea category. (I don't mean to be binary here. I am not saying that there are only bad ideas and good idea, or that all bad ideas are equally bad!) I don't think we should necessarily have this classification of modules at all. But, if we do, may I suggest that we keep it to pen and paper rather than in the repository structure???
3. The Use of Multiple Strategies in the Structuring of Modules
It has been proposed that different modules be structured according to multiple standards in the repository. In particular, it has been proposed that all modules of a certain category be stored in only three folders. As a simple example, consider the case where you only have 4 modules, A, B, C and D. The proposal is to store code associated with those modules in the following manner:
trunk/moduleA
trunk/moduleB
trunk/moduleC
trunk/moduleD
tags/moduleA-tag1
tags/foo-moduleA-tag2
tags/moduleA-tag3
tags/moduleA-tag4
tags/moduleB-tag1
tags/beta-moduleB-tag2
tags/moduleB-tag3
tags/moduleB-tag4
tags/moduleC-tag1
tags/alpha-moduleC-tag2
tags/moduleC-tag3
tags/moduleC-tag4
tags/moduleD-tag1
tags/bar-moduleD-tag2
tags/moduleD-tag3
tags/moduleD-tag4
branches/moduleA-branch1
branches/moduleA-branch2
branches/moduleA-branch3
branches/baz-moduleA-branch4
branches/moduleB-branch1
branches/moduleB-branch2
branches/foo-moduleB-branch3
branches/moduleB-branch4
branches/moduleC-branch1
branches/moduleC-branch2
and so on...
Now, you may notice in this simplified example, there are only 4 modules, A, B, C, and D. Yet, these three directories are already getting rather large. And even a bit confusing. It sort of reminds me of the somewhat confusing (to me at least, and I know I am not the only one) structure that already exists in tags and branches under kepler at the following locations:
https://code.kepler-project.org/code/kepler/branches
https://code.kepler-project.org/code/kepler/tags
Go ahead, have a look... I will be waiting for you when you get back.
Are you confused yet? Lets see, lets look at a couple of the branches:
KEPLER_1_0_0_BRANCH/
RELEASE-BRANCH-1-0-0/
Which branch do you think is associated with Kepler 1.0 that was released? Well, I suppose after thinking about it, you say, well, it was a release, so, I bet that RELEASE-BRANCH-1-0-0 is the one! Besides, I much prefer hyphens rather than underscores to separate numbers and words, so it must be that one! And you would be right. Okay, you may have guessed that one (I had a much harder time than you). But how about this.
Consider the following tags:
RELEASE-1-0-0-beforeJarRemoval/
RELEASE-BRANCH-1-0-0-beforeJarRemoval/
RELEASE-TAG-1-0-0/
KEPLER_1_0_0_BRANCH_MERGE1/
POST_MERGE1_KEPLER_1_0_0_BRANCH/
v1_0_129/
v1_0_170/
v1_0_174/
Now, if you are anything like me, you are thinking. What the heck are all these tags named BRANCH? Is it a tag, or is it a branch? Well, which of these tags do you think is associated with the 1.0 release of Kepler? You probably guessed RELEASE-TAG-1-0-0. And you would be right! However, since I am not as smart as you, I will admit that I wasn't able to figure it out before I asked around. Oh, one other really interesting thing. The tag RELEASE-TAG-1-0-0 is exactly the same as the branch RELEASE-BRANCH-1-0-0. Why is the exact same thing stored as both a tag and a branch? It is a mystery to me to. Why in the tag does release come first, but in the branch, it comes second? Another mystery. Oh, and what the heck does v1_0_129 and v1_0_170 and all the rest of the 100 or so similar tags refer to? I think I knew once, but I don't remember. I do know they were supposed to be deleted, but never were. I consider them eye candy! It is nice to live in a world with so many mysteries.
Anyway, why spend all this time reviewing the old repository structure. We are supposed to be forward thinking here. Its off to a new (and hopefully better) repository structure in the future! Well, not so fast. This mess in the old repository is the result of branching and tagging only one piece of software. Namely, Kepler. But, when we break-up Kepler into little parts called modules, and each of those modules has its own branches, tags, and trunks, when we store all those branches, tags, and trunks in the same three folders, this old repository structure will seem like a pleasant fantasy of wondrous simplicity! This is, after all, nothing more than the branches and tags from one code base, not several as has been proposed.
Well, I have a few more criticisms for this proposal. First and foremost, we really do not need to have two different ways of storing modules. This is just going to cause unnecessary confusion for new developers. Why are some modules stored this way, and others stored that way, they will ask. Second, this deviates from the SVN standard. You are supposed to have the name of the software, then trunk/ then straight to the code. You are not supposed to have anything between trunk and the code. It is not supposed to be kepler/trunk/foo-module/<code> its supposed to be kepler/trunk/<code>. Moving away from the standard will add unnecessary confusion for new developers who are already familiar with the standard.
What about when I want to rename a module? That use case is going to be a huge mess too.
What is all this buying us exactly? Oh, right. It is allowing us to check out the trunk of all the core modules with one svn command. That way, we don't have to depend on the build to do it.
I have a couple of issue with that.
First, the build is simply superior at doing checkout than a single SVN command from the command-line. Imagine this scenario. Your internet connection fails during your checkout of all the core modules. With the svn command-line, you have to start all over (I believe). With the build system, one or more modules have probably been downloaded and added to the build system module registry. The build system is smart enough not to try to retrieve this code again.
Second, like it or not, we need to make a commitment here! We don't trust the build system or any other system to do checkout, but its okay for the vast majority of non-core modules to depend on that "unreliable" (actually, more reliable) system? I don't think so! We need to eat our own dog food, as it were, and use the same build system for the core that we promote for other non-core developers.
Third, you check out the core using SVN instead of the build system. Then what? You are going to be working with a bunch of rogue modules that aren't registered in the build system module registry. Not a good idea.
Fourth, while the build system and other systems do not currently explicitly depend on the repository structure, it would be nice if, in the future, we keep open at least the option of depending on the repository structure. When you decide to go with a multiplicity of standards for storing modules in the same system, it adds a whole level of new complexity to any code that would like to make use of the standard used to store modules in the repository. Why don't we commit to a single way of storing modules?
Overall, I am again going to have to classify this idea that we should have multiple standards for storing modules as a very bad idea. What are we buying here? SVN checkouts of the core with a single command? Write a darn script for goodness sake if you want that for your own use. Don't corrupt the repository and add unnecessary complexity to our development efforts which we will have to live with for a very long time because you don't trust the build system. If we don't trust the build system for the core, perhaps we shouldn't use it at all! We are going to promote this build system to the community, but it is not good enough for us? I don't buy it, especially since the build system is actually a better way to do checkouts anyway, because it is more robust in cases of network failure. Regardless, we shouldn't be adding huge amounts of unnecessary complexity, both from a mental perspective (new developers saying, what is going on with multiple ways of storing modules in the repository and what the heck is going on with all of these branches in the same directory and why aren't we following the SVN standard) and technically (we have to deal with the complexity of storing modules in two different ways in the repository and renaming a module - when for example, you decide to break one module into two - is a nightmare).
4. My Proposal for a Repository Structure
This is my proposal:
modules/
core/
core-module1/
trunk/
tags/
branches/
core-module2/
trunk/
tags/
branches/
etc...
ppod/
ppod-suite/
trunk/
tags/
branches/
ppod-actors/
trunk/
tags/
branches/
prov-gui/
comad/
comad-suite/
trunk/
tags/
branches/
comad/
First, let me describe some basics about this proposal. We have a core area, so that core modules can be stored together. For each suite, (previously known as a master module), we have folder. For example, ppod/. In that folder there is a folder called ppod-suite/ which contains the meta-data specifying which modules are meant to work together. Finally, there are additional folders containing all the modules that are unique to that suite. Just by looking at the repository, you can get an idea of who is responsible for developing a module. You know that ppod-actors is maintained by whatever people are in charge of ppod. Whereas the comad module, which ppod also depends on, is probably maintained by the people in charge of developing comad-suite. In this particular case, those people happen to be one and the same. But they might not be.
Let me describe the characteristics of this proposal that address the objections I raised to alternatives above. Here, there is no distinction between "stable" and "experimental" modules or between "community" and "project-specific" modules. Instead, modules might consist of stable and experimental parts. There is no added complication arising from having to change the classification of a module according to some artificial scheme. Any such artificial scheme that exists is stored in documentation elsewhere (including perhaps in a standard location in the modules themselves, so anyone curious about the classification of a particular module could click on a standard file and see.) There is no multiplicity of conventions utilized for storing a module. Once a developer understands how one module is stored, that developer understand how all modules are stored. It follows the SVN standard, so developers familiar with that standard are not confused. Tools are also able to, when and if necessary, easily rely on one standard convention for how modules are stored. You do not mix the branches of different modules into the same folder, which is nothing more than a recipe for chaos and confusion. Instead, this is a very simple and straightforward structure that allows modules to be grouped so as to manage the complexity of the modules directory, but does not sacrifice simplicity and understandability in doing so.
I look forward to hearing the thoughts and opinions of others!
Interesting. One thing we're running into here is the limitation intrinsic in trying to represent information in nested folder structure when the information doesn't cleanly nest. As you point out, the experimental/stable/standard classification is in some cases nearly orthogonal to the project-specific/community/core classification. It is difficult to represent both using a single set of nested folders. However, I don't think that means we should reject representing one classification or the other this way. The question may be, who has more use for each classification scheme? Or perhaps one can be nested in the other? Maybe only the Kepler base system developers need the first classification? The term "Experimental" makes sense from the point of view of deciding what should be included in the next release of the Kepler base system, and what needs more work before distributing it as a standard component. At the same time, I agree that many modules will be very stable, useful, and well-supported even if they are never "standard" and never shipped as part of the Kepler base system. So classifying such modules as "Experimental" indeed would be misleading.
I think I proposed the idea of organizing branches and tags for different kinds of modules differently, and I agree with the limitations and disadvantages you point out.
---build.xml
----build-common-targets.xml
----build.properties
trunk/modules/core
tags/modules/core-release-1.0.0
tags/modules/core-release-1.0.1
tags/modules/core-checkpoint-1
tags/modules/core-checkpoint-chadexp
branches/modules/core-1.0.x
branches/modules/core-1.1.x
branches/modules/core-semtools
I want to pipe in here as well. I had a hard time digesting David's initial post in this thread -- it was an incredibly long rationalization, and therefore I didn't really even know how to start. I have no desire to write my own novella here, so instead I'll just make two points:
1) Having trunk/branches/tags at the root of the repository allows us to check out the head with no further tool support, and yet also allows the build system to pick specific modules or branches/tags of modules based on configuration information. This is the approach advocated in the SVN book, and I think it will be familiar to most SVN users. I see no downside to this approach, and several simplicity advantages. Having a large collection of branches and tags in one directory is not an issue, as a good naming system will work as well or better than additional directory hierarchy without the annoyances that the additional directories introduce.
2) The current problems in understanding the repository structure are due to lack of information about project history, not due to the inherent organization. I can make similar arguments that David did about how confusing the current modules list is because of poor naming choices (e.g., what is the difference between vanilla-1.0, and vanilla-trunk, and why does vanilla-1.0/trunk exist, and what is an ustan-master module, ad infinitum). The problem is that we don't have metadata about what these modules are in the source code system. This is the same as the previous tags in CVS. I have no problem with the hundreds of v_### tags simply because I know they were generated by Cruise Control as part of our previous nightly build system. People with a shorter time horizon on Kepler might have a harder time understanding these artifacts. I think the solution in the new system is to have a consistent module naming system that is a) predictable to be able to classify various types of tags and branches, and b) associated with some more descriptive metadata that explains the content and purpose of a particular module or branch or tag. For this reason, I think the branch and tag naming scheme Chad proposed is definitely a major improvement over our current system or the current modules proposal.
Looking forward to movement on these items,
Matt
I have a proposal that I hope will make all this a bit simpler.
First, I'd like to point out that I think we all may be talking past each other because we have different ideas about what a 'module' represents. To me, a module is a unit of compilation. Everything in a particular version of a module is meant to be compatible with everything else in that module. However, the latest (trunk) version of one module generally cannot be assumed to be compatible with the trunk of some other module. The trunk of the ppod-actors modules might be based on the 1.2 tag of the comad module (rather than the trunk of comad), for example, and for very good reasons. Modules of this sort are thus very loosely coupled, and it is not necessarily that useful to download all the trunks of every module (in this sense of the word), because they generally won't be guaranteed to work with each other, and many of their dependencies will not satisfied.
The above is the concept of modules that the new build system is based upon. And I think we very much need this kind of module and the support for developing such modules that the new build system provides. It is critical for the projects I represent.
An alternative concept of 'modules' is that they represent conceptual chunks of software that are developed in concert such that the latest version (trunk) of each module is guaranteed to work with (or depend on) the trunk of all other modules. Checking out all of the trunks at one time in this case can be very useful (as can tagging all of these modules with the same tag). This, I think, is an appropriate way of managing the core of Kepler in the repository, the part of Kepler that is shipped in every standard release. The core is also the base system that developers adding new capabilities either to the core, or via a non-core module, would start with.
Fortunately, I think we can easily accomodate both kinds of modules in our repository and build system. Let's say we use the word 'module' (and a directory named 'modules' in the repository) to stand for the first concept: a unit of compilation and internal consistency. We then put the entire core of Kepler, or base system, inside just *one* of those modules. Inside that module we are of course free to be as modular as we please (I'd prefer using the Java package scheme rather than multiple source trees for this kind of modularity, but I could be persuaded). Checking out the trunk of the 'core' module would thus grab all of the code that comprises the Kepler base system. This would take only a single svn command. Checking out a tagged version of the core module would similarly check out all of the code for that version of the Kepler base system.
Does this distinction between these two concepts of modules make sense? If so, does the current approach of putting all modules (in the first sense) at the same level of the repository (where they are now), with their own trunk, tags, and branches directory in each, also make sense?
The core module would have a single trunk, and this would be all that is needed when building the Kepler base system.
Now, what if another module at some point needs to be added to the base system? Simple--we just do an svn copy of the appropriate version of this module's source code and resources into the right places in the core module. I'm hoping that the comad module can be made a part of the Kepler base system someday, and why shouldn't its source files be merged into the source tree for the base system at that point? The comad classes are all in the org.kepler package hierarchy anyway. Once comad becomes a part of the base system it *will* need to be kept in sync with rest of the base system, so it really does belongs with the files in the core module (again, in the first sense of a 'module').
I hope this doesn't count as another novella. If there is any confusion at all about what I'm saying, let's have a telecon soon to talk this through.
Here's another idea for organizing the repository that may meet all of our needs and desires, and that takes advantage of the svn:external feature that, some time ago, Matt suggested might help us integrate our repository with Ptolemy's. Matt and I also chatted very briefly about the idea of exploiting svn:external within our own repository a couple days ago. Here's the proposal:
1. We place all the modules in their own directories as is currently done at https://code.kepler-project.org/code/kepler/modules/. Each module has it's own trunk, tags, and branches subdirectories. This allows each module to be branched and tagged independently as needed by experimental and non-standard extensions to Kepler. The build system is it's own 'module' with tags and branches like the rest. So is 'core', 'util', and other modules that comprise the Kepler base system.
2. We create an additional directory in the repository that represents a module 'suite' comprising the modules of the 'Kepler base system.' Within this directory we use svn:external properties to point to the other modules in the repository that make up the Kepler base system, including a version of the build system. Checking out this virtual module thus grabs a copy of all the modules needed to build and run the Kepler base system as well as the version of the build system that goes with it. One svn checkout command, ant run, and Kepler is running for the first time on a developer's machine. All source code directories are version-controlled and can be updated; changes can be committed without any fuss.
3. On the local machine, checking out the base system is analogous to checking out the build system module now. You can check out additional modules (either via svn or the 'get' target in the build system) into the top-level directory created in that first step. The build system is configured as now, and one can switch quickly between configurations to be built as can be done now. The difference is that the Kepler base system modules are checked out from the get-go, along with the build system. One can also switch the version of a particular base system module on the local machine if needed.
4. Adding a module to the base system is easy. Just add another svn:external property to the Kepler base system directory. We can also create alternative configurations of the base system by creating different module suites defined using the same mechanism.
Thoughts?
David played around with the idea of using svn:externals described in the previous post. He created an externals.test module (https://code.kepler-project.org/code/kepler/modules/externals.test/trunk) and applied the following svn:externals property to it:
loader https://code.kepler-project.org/code/kepler/modules/loader/trunk/
comad https://code.kepler-project.org/code/kepler/modules/comad/trunk/
If you check out this directory from the repository, the loader and comad modules are included within it. If you do an svn status or svn update in the local copy of this directory (above the loader and comad sub-directories), these commands descend into the loader and comad directories automatically.