There are many systems that take a native data structure in your favorite language and, using some sort of reflection, makes an on-disk structure that resembles it. Python pickles and Java’s serialization system are infamous examples, and rkyv is a less alarming one.
I am quite strongly of the opinion that one should essentially never use these for anything that needs to work well at any scale. If you need an industrial strength on-disk format, start with a tool for defining on-disk formats, and map back to your language. This gives you far better safety, portability across languages, and often performance as well.
Depending on your needs, the right tool might be Parquet or Arrow or protobuf or Cap’n Proto or even JSON or XML or ASN.1. Note that there are zero programming languages in that list. The right choice is probably not C structs or pickles or some other language’s idea of pickles or even a really cool library that makes Rust do this.
(OMG I just discovered rkyv_dyn. boggle. Did someone really attempt to reproduce the security catastrophe that is Java deserialization in Rust? Hint: Java is also memory-safe, and that has not saved users of Java deserialization from all the extremely high severity security holes that have shown up over the years. You can shoot yourself in the foot just fine when you point a cannon at your foot, even if the cannon has no undefined behavior.)
> Protobufs definitely doesn’t solve the problems described. Capnproto may solve it but I’m not 100% sure. JSON/XML/ASN.1 definitely don’t.
I'm not sure you are serious. What open problem do you have in mind? Support for persisting and deserializing optional fields? Mapping across data types? I mean, some JSON deserializers support deserializing sparse objects even to dictionaries. In .NET you can even deserialize random JSON objects to a dynamic type.
Can you be a little more specific about your assertion?
> Depending on your needs, the right tool might be Parquet or Arrow or protobuf or Cap’n Proto
I think parquet and arrow are great formats, but ultimately they have to solve a similar problem that rkyv solves: for any given type that they support, what does the bit pattern look like in serialized form and in deserialized form (and how do I convert between the two).
However, it is useful to point out that parquet/arrow on top of that solve many more problems needed to store data 'at scale' than rkyv (which is just a serialization framework after all): well defined data and file format, backward compatibility, bloom filters, run length encoding, compression, indexes, interoperability between languages, etc. etc.
But if you use complicated serialisation formats you can't mmap a file into memory and use it directly. Which is quite convenient if you don't want to parse the whole file and allocate it to memory because it's too large compared to the amount of memory or time you have.
> Sometimes the best optimization is not a clever algorithm. Sometimes it is just changing the shape of the data.
This is basically Rob Pike's Rule 5: If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident.(https://users.ece.utexas.edu/~adnan/pike.html)
I (deep, deep in embedded systems) have seen this too often, that code is incredibly complex and impossible to reason around because it needs to reach into some data structure multiple times from different angles to answer what should be rather simple questions about next step to take.
Fix that structure, and the code simplifies automagically.
I wouldn't give too much credit to rules like this. Data structures are often created with an approach in mind. You can't design a data structure without knowing how you will use it.
If anything it's the other way round, if you're not talking about business domain modeling (where data structures first is a valid approach).
To elaborate on @jeswin's point above (IDK why it got downvoted)... a data structure is basically like a cache for the processing algorithm. The business logic and algorithm needs will dictate what details can be computed on-the-fly -vs- pre-generated and stored (be it RAM or disk). Eg: if you're going to be searching a lot then it makes sense to augment the database with some kind of "index" for fast lookup. Or if you are repeatedly going to be pllotting some derived quantity then maybe it makes sense to derive that once and store with the struct.
It's not enough for a data structure to represent the "fundamental" degrees of freedom needed to model the situation; the algorithmic needs (vis-a-vis the available resources) most definitely matter a lot.
I'm saying that if you care about performance, data structures should be designed with approach specific tradeoffs in mind. And like I've said above, in typical business apps, it's ok to start with data structures because (a) performance is usually not a problem, (b) staying close to the domain is cleaner.
You said: "You can't design a data structure without knowing how you will use it."
But the whole discussion involves knowing how you will use it; the advocacy is for careful consideration of data structures (based on how you will use them) resulting in less pain when designing/choosing algorithms.
"Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious."
> But SQL schemas often look like this. Columns are nullable by default, and wide tables are common.
Hard disagree. That database table was a waving red flag. I don't know enough/any rust so don't really understand the rest of the article but I have never in my life worked with a database table that had 700 columns. Or even 100.
Some businesses are genuinely this complicated. Splitting those facts into additional tables isn't going to help very much unless it actually mirrors the shape of the business. If it doesn't align, you are forcing a lot of downstream joins for no good reason.
I saw tables with more than a thousand columns. It was a law firm home-grown FileMaker tool. Didn't inspect it too closely, so don't know what was inside
I remember a phrase from one of C. J. Date's books: every record is a logical statement. It really stood out for me and I keep returning to it. Such an understanding implies a rather small number of fields or the logical complexity will go through the roof.
The blog post is an entertaining read, but I was left with the impression the author might have tried do embellish, particularly in it's disbelief angle.
Take this passage:
> The app relied on a SOAP service, not to do any servicey things. No, the service was a pure function. It was the client that did all the side effects. In that client, I discovered a massive class hierarchy. 120 classes each with various methods, inheritance going 10 levels deep. The only problem? ALL THE METHODS WERE EMPTY. I do not exaggerate here. Not mostly empty. Empty.
> That one stumped me for a while. Eventually, I learned this was in service of building a structure he could then use reflection on. That reflection would let him create a pipe-delimited string (whose structure was completely database-driven, but entirely static) that he would send over a socket.
Classes with empty methods? Used reflection to create a pipe-delimited string? The string was sent over the wire?
Why congratulations, you just rediscovered data transfer objects, specifically API models.
If lots of columns are a red flag then red flags are quite common in many businesses. I’ve seen tables with tens of thousands of columns. Naturally those are not used by humans writing sql by hand, but there are many tools that have crazy data layouts and generate crazy sql to work with it.
As to your hard disagree, I guess it depends... While this particular user is on the higher end (in terms of columns), it's not our only user where column counts are huge. We see tables with 100+ columns on a fairly regular basis especially when dealing with larger enterprises.
Can you clarify which knowledge domains those enterprises fall under with examples of what problems they were trying to solve?
If it's not obvious, I agree with the hard disagree. Every time I see a table with that many columns, I have a hard time believing there isn't some normalization possible.
Schemas that stubbornly stick to high-level concepts and refuse to dig into the subfeatures of the data are often seen from inexperienced devs or dysfunctional/disorganized places too inflexible to care much. This isn't really negotiable. There will be issues with such a schema if it's meant to scale up or be migrated or maintained long term.
Normalization is possible but not practical in a lot of cases: nearly every “legacy” database I’ve seen has at least one table that just accumulates columns because that was the quickest way to ship something.
Also, normalization solves a problem that’s present in OLTP applications: OLAP/Big Data applications generally have problems that are solved by denormalization.
We have many large enterprises from wildly different domains use feldera and from what I can tell there is no correlation between the domain and the amount of columns.
As fiddlerwoaroof says, it seems to be more a function of how mature/big the company is and how much time it had to 'accumulate things' in their data model.
And there might be very good reasons to design things the way they did, it's very hard to question it without being a domain expert in their field, I wouldn't dare :).
> I can tell there is no correlation between the domain and the amount of columns.
This is unbelievable. In purely architectural terms that would require your database design to be an amorphous big ball of everything, with no discernible design or modelling involved. This is completely unrealistic. Are queries done at random?
In practical terms, your assertion is irrelevant. Look at the sparse columns. Figure out those with sparse rows.
Then move half of the columns to a new table and keep the other half in the original table. Congratulations, you just cut down your column count by half, and sped up your queries.
Even better: discover how your data is being used. Look at queries and check what fields are used in each case. Odds are, that's your table right there.
Let's face it. There is absolutely no technical or architectural reason to reach this point. This problem is really not about structs.
> Normalization is possible but not practical in a lot of cases: nearly every “legacy” database I’ve seen has at least one table that just accumulates columns because that was the quickest way to ship something.
Strong disagree. I'll explain.
Your argument would support the idea of adding a few columns to a table to get to a short time to market. That's ok.
Your comment does not come close to justify why you would keep the columns in. Not the slightest.
Tables with many columns create all sorts of problems and inefficiencies. Over fetching is a problem all on itself. Even the code gets brittle, where each and every single tweak risks beijg a major regression.
Creating a new table is not hard. Add a foreign key, add the columns, do a standard parallel write migration. Done. How on earth is this not practical?
There are sometimes reasons this is harder in practice, for example let’s say the business or even third parties have access to this db directly and have hundreds of separate apps/services relying on this db (also an anti-pattern of course but not uncommon), that makes changing the db significantly harder.
Mistakes made early on and not corrected can snowball and lead to this kind of mess, which is very hard to back out of.
I think you believe the average developer, especially on enterprise software where you see this sort of shit, is far more competent or ambitious than they actually are. Many would be horrified to see the number of monkeys banging out nasty DDL in Hibernate or whatever C# uses that have no idea what "normal forms" or "relational algebra" are and are actively resistant to even attempting to learn.
I’m working on migrating an IBM Maximo database from the late 90s to a SQL Server deployment on my current project. Also charged with updating the schema to a more maintainable and extensible design. Manufacturing and refurbishing domain - 200+ column tables is the norm. Very demoralizing.
Data from measurement tools. Everything about the tool configuration, time of measurement, operator ID, usually a bunch of electrical data (we make laser diodes) like current, potential, power, and a bunch of emission related data.
Sounds like a generic form of single table inheritance. I don't honestly see any other way to do it (punting to a JSON field is effectively the same thing) when you have potentially thousands of parts all with their own super specific relevant attributes.
I've worked on multiple products that have had a concept of "custom fields" who did it this way too.
> it sounds like helping customers with databases full of red flags is their bread and butter
Yes that captures it well. Feldera is an incremental query engine. Loosely speaking: it computes answers to any of your SQL queries by doing work proportional to the incoming changes for your data (rather than the entire state of your database tables).
If you have queries that take hours to compute in a traditional database like Spark/PostgreSQL/Snowflake (because of their complexity, or data size) and you want to always have the most up-to-date answer for your queries, feldera will give you that answer 'instantly' whenever your data changes (after you've back-filled your existing dataset into it).
771 columns (and I've read the definitions for them all, plus about 50 more that have been retired). In the database, these are split across at least 3 tables (registry, patient, tumor). But when working with the records, it's common to use one joined table. Luckily, even that usually fits in RAM.
Not everyone understands normal form, much less 3rd normal form. I’ve seen people do worse with excel files where they ran out of columns and had to link across spreadsheets.
It's OLAP, it very common for analytical tables to be denormalized. As an example, each UserAction row can include every field from Device and User to maximize the speed at which fraud detection works. You might even want to store multiple Devices in a single row: current, common 1, 2 and 3.
It is very common to find tables with 1000+ columns in machine learning training sets at e-commerce companies. The largest I have seen had over 10000 columns.
That statement jumped out at me as well. I've worked as a DBA on tons of databases backing a wide variety of ERPs, web apps, analytics, data warehouses...700 columns?!? No.
You've never seen an SAP database where the business object had a couple hundred fields? Its pretty much required if you're touching international data.
I have seen tables (SQL and parquet, too) that have at least high hundreds of optional columns, but this was always understood to be a terrible hack, in those cases.
> Hard disagree. That database table was a waving red flag.
Exactly this.
This article is not about structs or Rust. This article is about poor design of the whole persistence layer. I mean, hundreds of columns? Almost all of them optional? This is the kind of design that gets candidates to junior engineer positions kicked off a hiring round.
Nobody gets fired for using a struct? If it's an organization that tolerates database tables with nearly 1k optional rows then that comes at no surprise.
Here is an article I wrote this week with a section on Feldera - how it uses its incremental compute engine to compute "rolling aggregates" (the most important real-time feature for detecting changes in user behavior/pricing/anamalies).
Strictly speaking, Isn't there still a way to express at least one Illegal string in ArchivedString? I'm not sure how to hint to the Rust compiler which values are illegal, but if the inline length (at most 15 characers) is aliased to the pointer string length (assume little-endian), wouldnt {ptr: null, len: 16} and {inline_data: {0...}, len: 16} both technically be an illegal value?
I'm not saying this is better than your solution, just curious :^)
Why is rust allowed to reorder fields? If I know that fields are going to be generally accessed together, this prevents me from ordering them so they fit in cache lines.
I feel like I'm missing something, but the article started by talking about SQL tables, and then in-memory representations, and then on-disk representation, but...isn't storing it on a disk already what a SQL database is doing? It sounds like data is being read from a disk into memory in one format and then written back to a disk (maybe a different one?) in another format, and the second format was not as efficient as the first. I'm not sure I understand why a third format was even introduced in the first place.
Feldera is an incremental query engine, you can think of it as a specialized database. If you have a set of question you can express in SQL it will ingest all your data and build many sophisticated indexes for it (these get stored on disk). Whenever new data arrives feldera can instantly update the answers to all your questions. This is mostly useful when the data is much larger than what fits in memory because then the questions will be especially expensive to answer with a regular (batch) database.
I would assume because then the shape of the data would be too different? SOAs is super effective when it suits the shape of the data. Here, the difference would be the difference between an OLTP and OLAP DB. And you wouldn't use an OLAP for an OLTP workload?
I wasn't sure about writing the article in the first place because of that, but I figured it may be interesting anyways because I was kind of happy with how simple it was to write this optimization when it was all done (when I started out with the task I wasn't sure if it would be hard because of how our code is structured, the libraries we use etc.). I originally posted this in the rust community, and it seems people enjoyed the post.
Just cus structs and classes work differently, and classes are much more common. I tend to make everything a class, unless there is a really good reason to make it a struct.
I am quite strongly of the opinion that one should essentially never use these for anything that needs to work well at any scale. If you need an industrial strength on-disk format, start with a tool for defining on-disk formats, and map back to your language. This gives you far better safety, portability across languages, and often performance as well.
Depending on your needs, the right tool might be Parquet or Arrow or protobuf or Cap’n Proto or even JSON or XML or ASN.1. Note that there are zero programming languages in that list. The right choice is probably not C structs or pickles or some other language’s idea of pickles or even a really cool library that makes Rust do this.
(OMG I just discovered rkyv_dyn. boggle. Did someone really attempt to reproduce the security catastrophe that is Java deserialization in Rust? Hint: Java is also memory-safe, and that has not saved users of Java deserialization from all the extremely high severity security holes that have shown up over the years. You can shoot yourself in the foot just fine when you point a cannon at your foot, even if the cannon has no undefined behavior.)
It’s like you listed a bunch of serialization technologies without grokking the problem outlined in the post doesn’t have much to do with rkyv itself.
I'm not sure you are serious. What open problem do you have in mind? Support for persisting and deserializing optional fields? Mapping across data types? I mean, some JSON deserializers support deserializing sparse objects even to dictionaries. In .NET you can even deserialize random JSON objects to a dynamic type.
Can you be a little more specific about your assertion?
I think parquet and arrow are great formats, but ultimately they have to solve a similar problem that rkyv solves: for any given type that they support, what does the bit pattern look like in serialized form and in deserialized form (and how do I convert between the two).
However, it is useful to point out that parquet/arrow on top of that solve many more problems needed to store data 'at scale' than rkyv (which is just a serialization framework after all): well defined data and file format, backward compatibility, bloom filters, run length encoding, compression, indexes, interoperability between languages, etc. etc.
BS. Nothing can be faster than a read()/write() (or even mmap()) into a struct, because everything else would need to do more work.
This is basically Rob Pike's Rule 5: If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident.(https://users.ece.utexas.edu/~adnan/pike.html)
I (deep, deep in embedded systems) have seen this too often, that code is incredibly complex and impossible to reason around because it needs to reach into some data structure multiple times from different angles to answer what should be rather simple questions about next step to take.
Fix that structure, and the code simplifies automagically.
If anything it's the other way round, if you're not talking about business domain modeling (where data structures first is a valid approach).
It's not enough for a data structure to represent the "fundamental" degrees of freedom needed to model the situation; the algorithmic needs (vis-a-vis the available resources) most definitely matter a lot.
I'm saying that if you care about performance, data structures should be designed with approach specific tradeoffs in mind. And like I've said above, in typical business apps, it's ok to start with data structures because (a) performance is usually not a problem, (b) staying close to the domain is cleaner.
But the whole discussion involves knowing how you will use it; the advocacy is for careful consideration of data structures (based on how you will use them) resulting in less pain when designing/choosing algorithms.
> If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident.
This is what I was responding to.
"Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious."
https://en.wikiquote.org/wiki/Fred_Brooks
Hard disagree. That database table was a waving red flag. I don't know enough/any rust so don't really understand the rest of the article but I have never in my life worked with a database table that had 700 columns. Or even 100.
But OLAP tables (data lake/warehouse stuff), for speed purposes, are intentionally denormalized and yes, you can have 100+ columns of nullable stuff.
I remember a phrase from one of C. J. Date's books: every record is a logical statement. It really stood out for me and I keep returning to it. Such an understanding implies a rather small number of fields or the logical complexity will go through the roof.
I think this company was ahead of the curve.
Take this passage:
> The app relied on a SOAP service, not to do any servicey things. No, the service was a pure function. It was the client that did all the side effects. In that client, I discovered a massive class hierarchy. 120 classes each with various methods, inheritance going 10 levels deep. The only problem? ALL THE METHODS WERE EMPTY. I do not exaggerate here. Not mostly empty. Empty.
> That one stumped me for a while. Eventually, I learned this was in service of building a structure he could then use reflection on. That reflection would let him create a pipe-delimited string (whose structure was completely database-driven, but entirely static) that he would send over a socket.
Classes with empty methods? Used reflection to create a pipe-delimited string? The string was sent over the wire?
Why congratulations, you just rediscovered data transfer objects, specifically API models.
As to your hard disagree, I guess it depends... While this particular user is on the higher end (in terms of columns), it's not our only user where column counts are huge. We see tables with 100+ columns on a fairly regular basis especially when dealing with larger enterprises.
If it's not obvious, I agree with the hard disagree. Every time I see a table with that many columns, I have a hard time believing there isn't some normalization possible.
Schemas that stubbornly stick to high-level concepts and refuse to dig into the subfeatures of the data are often seen from inexperienced devs or dysfunctional/disorganized places too inflexible to care much. This isn't really negotiable. There will be issues with such a schema if it's meant to scale up or be migrated or maintained long term.
Also, normalization solves a problem that’s present in OLTP applications: OLAP/Big Data applications generally have problems that are solved by denormalization.
We have many large enterprises from wildly different domains use feldera and from what I can tell there is no correlation between the domain and the amount of columns. As fiddlerwoaroof says, it seems to be more a function of how mature/big the company is and how much time it had to 'accumulate things' in their data model. And there might be very good reasons to design things the way they did, it's very hard to question it without being a domain expert in their field, I wouldn't dare :).
This is unbelievable. In purely architectural terms that would require your database design to be an amorphous big ball of everything, with no discernible design or modelling involved. This is completely unrealistic. Are queries done at random?
In practical terms, your assertion is irrelevant. Look at the sparse columns. Figure out those with sparse rows. Then move half of the columns to a new table and keep the other half in the original table. Congratulations, you just cut down your column count by half, and sped up your queries.
Even better: discover how your data is being used. Look at queries and check what fields are used in each case. Odds are, that's your table right there.
Let's face it. There is absolutely no technical or architectural reason to reach this point. This problem is really not about structs.
Strong disagree. I'll explain.
Your argument would support the idea of adding a few columns to a table to get to a short time to market. That's ok.
Your comment does not come close to justify why you would keep the columns in. Not the slightest.
Tables with many columns create all sorts of problems and inefficiencies. Over fetching is a problem all on itself. Even the code gets brittle, where each and every single tweak risks beijg a major regression.
Creating a new table is not hard. Add a foreign key, add the columns, do a standard parallel write migration. Done. How on earth is this not practical?
Mistakes made early on and not corrected can snowball and lead to this kind of mess, which is very hard to back out of.
I've worked on multiple products that have had a concept of "custom fields" who did it this way too.
So it sounds like helping customers with databases full of red flags is their bread and butter
Yes that captures it well. Feldera is an incremental query engine. Loosely speaking: it computes answers to any of your SQL queries by doing work proportional to the incoming changes for your data (rather than the entire state of your database tables).
If you have queries that take hours to compute in a traditional database like Spark/PostgreSQL/Snowflake (because of their complexity, or data size) and you want to always have the most up-to-date answer for your queries, feldera will give you that answer 'instantly' whenever your data changes (after you've back-filled your existing dataset into it).
There is some more information about how it works under the hood here: https://docs.feldera.com/literature/papers
771 columns (and I've read the definitions for them all, plus about 50 more that have been retired). In the database, these are split across at least 3 tables (registry, patient, tumor). But when working with the records, it's common to use one joined table. Luckily, even that usually fits in RAM.
100s is not unusual. Thousands happens before you realise.
Exactly this.
This article is not about structs or Rust. This article is about poor design of the whole persistence layer. I mean, hundreds of columns? Almost all of them optional? This is the kind of design that gets candidates to junior engineer positions kicked off a hiring round.
Nobody gets fired for using a struct? If it's an organization that tolerates database tables with nearly 1k optional rows then that comes at no surprise.
They don’t have the option to clean up the data.
https://www.hopsworks.ai/post/rolling-aggregations-for-real-...
I'm not saying this is better than your solution, just curious :^)
There may be good reasons (I don't know any) why it wasn't done like this, but from a high-level it looks possible to me too yes.
Feel free to try it out, it's open source: https://github.com/feldera/feldera/
https://en.wikipedia.org/wiki/Data-oriented_design
I would assume because then the shape of the data would be too different? SOAs is super effective when it suits the shape of the data. Here, the difference would be the difference between an OLTP and OLAP DB. And you wouldn't use an OLAP for an OLTP workload?
https://www.postgresql.org/docs/current/storage-page-layout....
I wasn't sure about writing the article in the first place because of that, but I figured it may be interesting anyways because I was kind of happy with how simple it was to write this optimization when it was all done (when I started out with the task I wasn't sure if it would be hard because of how our code is structured, the libraries we use etc.). I originally posted this in the rust community, and it seems people enjoyed the post.