It’s frustrating that every system requires a data import before data can be useful. Too much data is outside databases so it cannot be queried. Data has to be duplicated to be queryable. For example, if you store data in S3, you have to download the data before you can query it. Or you use MySql you have to upload your data to Rockset to query it fast for analytic queries.
Ideally we should be running queries where the data lives. You cannot run queries on S3. TiDB has a feature whereby it can push computation to the data storage layer, closer to where the data lives in TiKV (a keyvalue store), this is called operator pushdown.
Data is outside databases and trapped in binary formats. It would be nice to query my file system for files that were modified between dates and get results back immediately. This is where my idea to index everything comes from. Storage is cheap – we can index everything remotely and get the best of both worlds.