Qi Qiao Ban: Component Scan

Posts mit dem Label Component Scan werden angezeigt. Alle Posts anzeigen

Freitag, 25. Juli 2014

150 Lines just for one Collection

Some 15 years ago I left the Java Enterprise Edition train and am now pushed by things like CapeDwarf on JEE Servers to the evaluation of running Tangram somehow in a JEE environment. I read the promise, that things have become a lot easier and that JEE has adopted the annotation and convention based style I now feel familiar.
I came across the JSR330 annotations from the javax.injection package which were introduced in the JEE world. JSR330 Annotations are recognized by several Dependency Injection frameworks like Guice, Springframework, my homegrown dinistiq, and obviously the JEE Containers.

Step 1: Get rid of Spring

This is were I started to make Tangram less depending on the Springframework directly or provide interfaces to be implemented in a spring and a non-spring way. For the dependency injection part it was mostly as easy as to migrate from

    @Autowired
    private ClassRepository classRepository;

to

    @Inject
    private ClassRepository classRepository;
I lost the required = false option at that time but was able to live with that given the advantages of portability.

Step 2: Find Alternatives

The spring parts of the application where re-implemented using a plain servlet based solution, and springsecurity was replaced by Apache Shiro for those scenarios.
After having migrated to the new annotations most of my Components where portable across containers. So I don't have to sell my soul to a special DI framework for very many cases, which helps choosing the right infrastructure for every project. But there are several details which still remain problematic, most notably the missing required = false option and the missing collection of instances to be able to automatically let the whole set of instances be injected.

Step 3: Annoying Details

The annoying collection problem looks like this:

@Autowired(required = false)
private Collection<ControllerHook> controllerHooks = new HashSet<ControllerHook>();

Later I migrated this to the JSR-330 Annotations - which also works with spring.

@Inject
private Collection<ControllerHook> controllerHooks = new HashSet<ControllerHook>();

I made this work with dinistiq as well.
With Guice I learned, that it didn't like to collect the instances for me and provide me with the collection out of the box.
I'm expecting here, that any instance implementing ControllerHook from the application context gets added to a collection which then is injected into the consuming component. Spring does it, dinistiq does it, Guice can be convinced to do it by the additional Multibinder:

Multibinder<ControllerHook> multibinder =
Multibinder.newSetBinder(binder(), ControllerHook.class);
multibinder.addBinding().to(UniqueUrlHook.class);
multibinder.addBinding().to(ProtectionHook.class);

Step 4: Show Stopper

Inspired by the CapeDwarf project, promising to be able to deploy applications developed for the Google App Engines and their APIs on a JEE infrastructure, I tried to deploy GAE based Application to such an Infrastructure based on a JBoss AS7 or Wildfly 8 Server using CapeDwarf Versions 1 or 2 respectively.
This approach doesn't work for Tangram since the used DI solution within the application - be it dinistiq or the Springframework - interferes with the JEE Container's CDI implementation. The Application Server starts to interpret the annotations intended for the other DI framework. This would instantiate the application components twice and right at the moment it also fails on some of those components.

Step 5: JEE Wordiness

So quite naturally I now went over to directly migrate Tangram to the JEE world. But also JEE has the same problem as plain Guice has.

@Inject
private Collection<ControllerHook> controllerHooks = new HashSet<ControllerHook>();
This time the solution really is ugly. It seems I have to implement a different consuming component and thus instead of the two lines, a separate ControllerHookProvider is needed

public interface ControllerHookProvider {

    Collection<ControllerHook> getControllerHooks();

} // ControllerHookProvider
Of course the generic implementation looks like the good old two lines which did the whole job for me so far.

public class GenericControllerHookProvider implements ControllerHookProvider {
    @Inject
    private Collection<ControllerHook> controllerHooks;

    public Collection<ControllerHook> getControllerHooks() {
        return controllerHooks;
    } // getControllerHooks()

} // GenericControllerHookProvider
For JEE an alternative implementation has to be chosen, which made the interface necessary in the first place, and it is surprisingly wordy for that common and simple scenario.

@Named("controllerHooksProvider")
@Singleton
public class JeeControllerHooksProvider implements ControllerHookProvider {

    private Collection<ControllerHook> controllerHooks = new HashSet<>();

    @Inject
    public void setControllerHooks(@Any Instance<ControllerHook> hooks) {
        for (ControllerHook hook : hooks) {
            controllerHooks.add(hook);
        } // for
    } // setControllerHooks()

    public Collection<ControllerHook> getControllerHooks() {
        return controllerHooks;
    } // getControllerHooks()

} // JeeControllerHooksProvider
This really didn't invite me to go deep into JEE again. Additionally it is missing any of the configuration file based bean definitions avoiding much of the interfacing of classes where in fact just two injected values make the difference. With the Springframework configuration files and auto scanning of beans go hand in hand. JEE intentionally left out this part.

Donnerstag, 24. Juli 2014

Fast on the first Request

It seems to be common sense with so many developers, that "expensive" things that only happen once are not that bad. And when those things happen at the wrong point in time that it is a good idea to move them to the "application startup".
So most of us are not very surprised and consider it no problem, if this startup takes some time.

Welcome to the Cloud

At a first sight, deploying my applications in the cloud is exactly the same as it used to be. I follow accepted standards, use common frameworks and respect many best practices. So any runtime platform supporting this should do the job. Cloud in that case means, that many more aspects of deployment and operations get automated - including that starting and stopping of additional instances, load balancing and so on.
But wait... When does the platform get the idea to start new instances?

Let the Customer wait?

It's just load. And load means, that Customers issued HTTP Requests and are awaiting responses in time. From research we know that in time means something around 3s. But when a user request leads to the start of a new instance to handle the load it's quite too late to do all the expensive stuff. Application startup is not the right point in time anymore.
Additionally you can learn, that all those nice precomputations simply re-calculate values the instances from yesterday or the other instance on the other machine already learned. And of course it still is a good idea to have all those values at hand - meaning to "cache" them. But very many of the values which are not coded or configured into the application deployment stay the same as long as the deployment environment doesn't change. Or they stay the same unless the application itself changes them and thus is able to tell when really to re-calculate derived values in the caches.

Examples for all this are:

Database derived value caches like query results with calculations based on the result, where just this application writes to the database. Those values might be needed at startup but stay constant until the application itself changes the database contents.
Classpath scanning to auto-discover software components. The developers and deployers wont want to collect the lists manually or at build time but the components don't change after the deployment. And definitely not on every startup.
Webtemplates and other codes which need to be fetched and prepared for use (e.g. get compiled). Those codes get prepared on every startup of the application while they change not that frequently and especially not on application startup. Obviously they add a big amount to the the response time since those codes need to be executed for the generation of the response.

Just use a Cache

What? Oh, well not that Cache. This Cache. This cache doesn't need to be as fast as the in memory caches already available and I didn't want to re-invent the wheel. They just need to have some values at hand some other instance already calculated and they have to be changed when these values change - which is - as already pointed out - not as frequent as other runtime values.
The values in the cache still are volatile and can be re-calculated by the application at any time. But they should be persisted for some time to be available at startup and reduce the time for the first request in web applications.
The jsr107 cache implementation in the Google App Engine is an example of exactly this approach. I wrapped this cache as a PersistentRestartCache in Tangram and cut reponse times for the first request to a third (or half as the worst case scenario) from around 30s to 11s. For all other platforms available for the Tangram dynamic webapplication framework I presented a simple (maybe too simple) file based implementation which does most of the job as well.
Still this leaves out the classpath scanning of the Springframework or dinistiq. In my environment the Springframework uses between 3s and 7s and dinistiq around 4s to do the basic application setup based on classpath scanning and additional configuration files. So half of the time the startup deals with the framework code and not the applicaiton startup itself. And this value only was achieved by nearly brutal reduction of the portion of the classpath to be scanned creating a "components" package where all autoscanned components reside for the Springframework and dinistiq.

Resume

Caches are a good Idea. Applications with a dynamic deployment are a good idea. Thinking of the application startup as a point in time where long calculations might take place is not that much of a good idea (at least anymore) where instances should be automatically braught up depending on load and not human decisions.
So simply bring together caches with persistence and knowledge when those caches can be invalidated. As an example I did this for the tangram web application framework and had quite some success on the Google App Engine, run@cloudbees, and OpenShift cloud platforms for the Java world.
The last but also important point: Optimizing the application startup in general is worth the time nowadays.

Donnerstag, 17. Juli 2014

Not that Groovy

Ist es eigentlich schon allgemein akzeptiert, daß Web Anwendungen nicht mehr davon ausgehen dürfen, daß es für sie eine entspannte Startphase gibt? Daß sie auch beim ersten Request für den Endbenutzer ausreichend schnell antworten sollten? Es klingt immer noch häufig so, als wären "teure" Dinge, die nur einmal in der Anwendung passieren kein so großes Problem und das man dieses "teure" einfach auf die Startphase verschiebt.
Alles was ich hier schreibe, ist nur für Projekte relevant, in denen ein Deployment nennenswerte Kosten verursacht oder es aus anderen Gründen keine gute Idee ist, ständig die gesamte Software installieren. Mir war diese Stabilität mit großen Systemen auf Basis CoreMedia immer ein so großer Vorteil, daß ich das auch mit Tangram im kleinen nachgezeichnet habe (bzw. dort sogar für die großen Projekte vorgezeichnet habe). Dennoch muß man auf wechselnde Anforderungen genauso dynamisch reagieren, wie auf die Anfragen der Kunden an das System.

Messungen mit Groovy-Code in der Datenbank

Um die Software einfacher und schneller an Anforderungen anzupassen, habe ich mich ja vor Jahren auf Groovy-Code in einem irgendwie gearteten Repository festgelegt. Wir kennen das von Stylesheets, JavaScripts, Templates (z.B. Freemarker, Apache Velocity).
Damals habe ich für einen großen deutschen Telekommunikationskonzern eine Reihe von Messungen durchgeführt, die deutlich belegt haben, daß der Zugriff auf eine fertig kompilierte Groovy-Klasse - oder gar eine vorher abgelegte Instanz davon - keine Nachteile gegenüber dem Java-Code des statischen Programmteiles hat.
Für den einzelnen Request einer Webanwendung ist es also vollkommen egal, ob der Code nun deployed wurde oder in der Datenbank liegt.
In der Folge hat die typische Tangram Webanwendung kaum noch Anteile an Java-Code (siehe z.B. amor oder dragon-map usw). Es wird alles über Templates mit Apache Velocity für die Darstellungsschicht und Groovy-Code für die Geschäftslogik oberhalb der Datespeicherung bis hin zur URL-Struktur erledigt. (Mit JDO kann sogar ein Teil des Datenmodells so in den dynamisch veränderbaren Teil verlagert werden.)
Die Effizienz bei der Ausführung wird durch einfaches Kompilieren und Instanziieren beim Speichern der Codes erreicht.
Und hier genau schlummert ein Problem, das sich mir nun richtig stellt, wo einiges an "dynamischen" Codes (also solchen in der Datenbank, die potenziell häufiger geändert werden) zusammengekommen ist:
Im Gegensatz zu den Aufrufen der Anwendung und dem direkten Benutzen der Instanzen erhalten wir beim Start des Systems einen deutlichen Performance-Nachteil, da hier nun einiges an Datenbankabfragen und Kompiliervorgängen sowie Instanziierungen zusammenkommt. In der Summe ist das deutlich aufwendiger als die entsprechenden Vorgänge mit statischem Java-Code.
Warum interessiert der Start aber nun gerade wieder? In den meisten Szenarien kommt so etwas doch nicht gerade häufig vor.

Willkommen in der Wolke

Für einen selbstbetreuten "on premise" Servlet/JSP Container mag das stimmen, aber gerade für die "billigen" Modelle in der Wolke und z.B. auch die Google App Engine, OpenShift oder run@cloudbees stimmt das überhaupt nicht, da hier zum Skalieren und bei Nichtbenutzung der Trick gerade darin besteht, Instanzen der Anwendungen herunterzufahren oder neue hinzuzustellen.
Bei Instanzen auf der Google App Engine, bei denen keine sonstigen Zusicherungen eingestellt sind, führt das zu einem netten Hinauf- und Herunterfahren wie ein Jojo. Bis Version 0.7 beinhalten alle Tangram Anwendungen einen Cron-Job, der genau das verhindern sollte, indem die Anwendung sich selbst aufgerufen hat. Vor einiger Zeit habe ich mich dem Problem nun endlich gestellt und versucht, die Startup-Sequenz einmal zu tunen. In der Ausgangslage sind wir bei einem Erst-Start im Bereich von gerne 30s abgelaufene Zeit.

Spring Tuning

Das ist nicht das erste Mal: Bereits früher hat das Springframework gezeigt, daß es den Komfort für den Entwickler mit "Kosten" beim Start belegt. Der Scan der Klassen nach Annotationen und Komponenten sowie deren vollautomatisches Zusammensetzen dauert auf einem mittelprächtigen Rechner zwei bis drei Sekunden, in der App-Engine jedoch gerne bis zu acht Sekunden, die ein Kunde am Browser warten müßte. Das konnte damals durch das Eingrenzen beim Scan reduziert werden:

<context:component-scan base-package="org.tangram" />

Die Angabe des Base-Package ist hier wichtig und grenzt den Bereich der Klassen, die geprüft werden, nachhaltig ein, was einige Sekunden bringt. In Zukünftigen Versionen sollten hier noch weitere Einschränkungen greifen, weil das "base package" org.tangram immer größer wird und über eine Anzahl von Jars verteilt ist. Ab Version 0.8 müssen daher all Komponenten, die gerne gescannt werden möchten "org.tangram.components" mit Vornamen - also Package-Namen - heißen.

<context:component-scan base-package="org.tangram.components" />

Dazu wurden einige Klassen in den Paketen verschoben und es bringt wieder einmal ein paar Bruchteile von Sekunden ein. Lohnend, aber es durchbricht die fachlich Sortierung der Klassen ein wenig. (Aus org.tangram.jdo wird org.tangram.components.jdo - man findet sich also schon zurecht)

Caching

Moment! Wieso Caching? Das Füllen der Caches - soweit zu dem Zeitpunkt hilfreich oder notwendig - ist doch gerade das, was den Start u.a. so langsam macht. Aber die Inhalte der Caches werden dann bei Start genau dieselben sein, wie beim letzten Herunterfahren - oder dieselben, die schon eine andere Instanz berechnet hat. Es wäre also ganz nett, wenn man einfach auf diese Wert zurückgreifen könnte.
Das tut Tangram letztlich auch bereits seit langer Zeit in der Google App Engine: Der Scan nach Klassen, die zur Persistenz-Schicht gehören, fällt in die gleiche Rubrik wie der Scan des Spring-Framework und die Daten, die sich die BeanFactory hier merken muß, werden in der Google App Engine im Memory Cache über die JSR107 Schnittstelle gespeichert.
Diese Schnittstelle soll uns nun helfen, noch mehr Vorgänge nur so häufig auszuführen, wie es notwendig ist, unabhängig davon, ob es wir von einem Start-Request sprechen oder mitten in der Arbeit sind. Das macht die Webanwendungen Wolken-tauglicher als sie es bisher sind.
Die Benutzung dieses Caches ist denkbar einfach:

try {
CacheFactory cacheFactory = CacheManager.getInstance().getCacheFactory();
jsrCache = cacheFactory.createCache(Collections.emptyMap());
} catch (CacheException ce) {
log.error("()", ce);
} // try
Und danach ist es mit put() und get() getan, wobei man natürlich immer gewahr sein muß, daß einmal nichts im Cache ist. - Und einige kompliziertere Sache - obwohl java.lang.Serializable markiert - kann man zwar hineinstopfen, findet sie in der Admin-Oberfläche, bekommt sie aber nicht mehr heraus.
Das gilt leider insbesondere für per JDO persistierbare Objekte und kompilierte Java Klassen.

Tunen wo es wehtut

Langsam ist bei einem leeren Cache das beziehen aller Codes, die für die Website notwendig sind, aus dem Datastore. Hier geht es schon bei wenigen Codes eher um Sekunden als die oben zitierten Bruchteile. Also enthält Tangram ab Version 0.8 einen Simplen, persistenten Query-Cache, der genau wie der transiente Cache die IDs der Objekte der Ergebnislisten den Queries erfaßt und ablegt. Dieser Cache bringt der gesamten Anwendung bei ersten Aufrufen einer Instanz eine deutlichen Schub bei abfragebasierten Anwendungen.
Mit diesen ID-Listen muß man sich leider immer noch an den Datastore wenden, um die Objekte zu beziehen, da diese Objekte ja nicht in den Cache gelegt werden konnten.
Diesen Schritt aber wiederum kann man sich bei den Codes gut sparen, in dem man sie beim Lesen in reine transiente Objekte umwandelt (hey, es sind Codes!) und sie dann im persistenten Startup Cache ablegegen zu können.
Jetzt ist eine der längsten Phasen beim Starten das Kompilieren der Groovy-Klassen. Hier den Binärcode zu cachen hat sich bisher als nicht umsetzbarer Ansatz erwiesen. Aber evtl. kommt das ja auch noch. So ist der limitierende Faktor beim Startup auf die Anzahl der Groovy-Codes und den Compiler gesetzt. Mehr ging bisher nicht, aber es hat den Startup in der Zeit gedrittelt. - Und die Ansätze daraus sind verallgemeinerbar für Anwendungen in der Wolke.