This article is part two in a series describing the evolution of my knowledge and attitude about creating Drupal custom entities. In part 1 of this series, I explained what caused my interest in Drupal custom entities, my first, rather extreme, use of them where nearly all content on the site was a custom entity, and my conclusion from that experience. In this article, I will continue describing the other positions I tried and the resulting final position on when and how I'll use them in the future.
Phase 2: Custom entity avoidance
A few months after finishing the project described in part 1, I began work on another site that potentially seemed like a good candidate for custom entities. The site stored many calculation results that would be mostly "under the hood" information. By that I mean each of those individual calculations wasn't something you'd think of as a Drupal node with its own view and edit page, for example. In fact, this information would be calculated once and never subsequently edited at all, and it was hard in the analysis phase of the project to imagine a use case for viewing a calculation result as a standalone page either. So, this would seem an ideal candidate for being implemented with a custom entity.
However, at this time I still had a pretty bad taste in my mouth from the experience of maintaining the custom entities on that prior project. Because the project itself was planned to change significantly over time, there was no doubt that would include many changes to these entities. I considered the maintenance cost of custom entities to be substantial for this project. (See part 1 for details on what I'm talking about.) What's more, by this time I had a much better understanding of the Drupal cache system and how well it worked.
I decided to just use regular Drupal content types and nodes to store the calculations and see how it went. The calculations the site did were so complex and numerous that I built the recalculate process to work in "chunks" using the Drupal batch system so it would not timeout. Because of that, the performance overhead of creating many hundred nodes was not problematic from a technical standpoint. From the user's perspective, the visual cues of the batch system, like the progress bar, also helped masked that impact.
The remaining question was if there would be any performance issues when reading that data. The calculations were used on a dashboard-like page, with several sections each using a large number of individual calculation results to display either a report or a graph. I anticipated that the reading of several thousand of those nodes, added to the time to build up HTML for each dashboard section, would make the page unacceptably slow.
To my surprise, it worked fine. That is, the response time for the dashboard pages was well within acceptable limits. What made that possible was the Drupal cache system. It turns out that in Drupal 8 it is used very aggressively (i.e whenever possible) and is efficient. The relevant gist of it for this application is that each of those calculations was stored in the cache as a single entity and in practice did not require the complex joining from all the individual field tables for display.
My conclusion from this project was that for most sites, even those involving large amounts of pure data, Drupal's regular content type nodes were probably adequate. Additionally, using this architecture did not require any additional coding or knowledge beyond what is already used on nearly every Drupal site.
Phase 3: Hybrid (partial Drupal entities?)
The next time I considered custom entities was another site with massive quantities of data and calculations. In this case, various users of the site would upload data to the site throughout the month totaling several hundred thousands items. The requirements almost demanded storage of non-normalized summary data for use on reports and dashboards. The raw data entities would never be used other than for calculating the summary data.
However, the summary data would be recalculated after every upload to include the newly added information. The upload process stored the "raw" data from the file in the Drupal database and then launched the recalculation batch process, which in turn consisted of three steps each creating a different type of summary data from all the raw data records.
I created custom entities for both the raw data and the summary data. Even with the data summarized, the dashboard and reports would include thousands of pieces of data. The Views reports worked fine though, largely because pagination limited the amount of data dealt with for one page load.
However, the recalculation process took over an hour, which was not acceptable. Initially, it was using all Drupal entity infrastructure for all reading and writing of these custom entities. The summary was essentially a looping through all the hundreds of thousands of rows of data and writing thousands of rows of summary data. The Drupal cache was not of great help here since the recalculation by definition was including a vast amount of new data and writing brand new summary data. So, this was a case where all the overhead of Drupal entities, as compared to the base database operations, was substantial and crippling.
My solution here, short of gutting the entire architecture, was to use raw database updates at key points of the upload process and for the entirety of the summary calculation. In the latter case, it turns out the entire operation could be boiled down to a single SQL statement using GROUP BY to get the summary data from the raw data. The SQL was pretty complex - some 60 or so lines of PHP code to build up. But to run it directly was a simple:
Drupal::database()->query($queryText)
The entire summarizing data operation, deferring all the work the database, took under 10 seconds!
Caveats:
It should be noted that because these were custom entities with only base fields, the entities for both the raw and summary data were each stored in a single corresponding database table. This made the raw SQL relatively simple compared to fieldable entities where each field is its own table. As that complexity increases, the risk of causing integrity issues in the database when creating or updating increases rapidly.
It was also the design that the summary data was always considered a "current snapshot." Therefore, the summarizing process began by removing any existing summary data. This greatly mitigated any risks from the raw updating causing problems.
The Drupal cache system is still in play for custom entities. This must be taken into consideration when doing anything directly to the database. You can easily end up in a situation where Drupal uses a cached version of data because the Drupal cache system doesn't know anything about changes you make this way. Dealing with this was fairly simple because all the summary data was being deleted every time - a very coarse clearing of cache data was adequate. The two statements to wipe the slate clean looked similar to the following (with table and tag names specific to this case):
Drupal::database()->query('TRUNCATE TABLE summaries');
Cache::invalidateTags(['summary_data']);
Conclusions
For that last project, had I known I would end up with the raw SQL updating, I wouldn't have made Drupal entities at all. I would have just created the tables in the database directly, probably in an update hook in a module, and handled all the database operations directly. But what about using the data in Views you say? Well, it turns out it's pretty easy to add a custom database data source for Views. This can make a table available to Views that Drupal otherwise knows nothing about. This can be done with a hook(opens in a new window) or in a class that implements EntityViewsDataInterface(opens in a new window) and includes a particular annotation.
I'm doubtful I will come to the conclusion that I need custom entities on future projects. Even with large amounts of data, the Drupal cache adequately addresses the performance concerns, at least when retrieving the data. For cases where the amount of data operations is so massive that Drupal's overhead is a blocking factor, I would strongly consider not using Drupal entities at all for that data and exposing it to Views directly.
It should be noted that my motivation in these cases, namely trying to mitigate database performance concerns, is not the only reason to use custom entities. Another factor, for example, is that a custom entity, even a fieldable one that is otherwise used just like built-in Drupal content entities, could be created without versioning, thereby eliminating all the overhead in Drupal and the database structure associated with that. This article was meant to focus simply on the database performance consideration.
Need a fresh perspective on a tough project?
Let’s talk about how RDG can help.