I entered the Azure Data Explorer (ADX) world recently after primarily working with Azure Tables, SQL Databases, and Cosmos DBs over the past few years.
When I moved onto ADX I approached it like I was working with the databases I’ve become so familiar with. ADX is not like the databases I had been working with and required that I adjust my though process around working with it.
Querying data in ADX was straight forward enough but getting data into ADX was not. While there are methods of synchronously inserting handfuls of records at a time, these methods are not intended for production use. If you plan on using ADX for individual or small create, update, and delete transactions you’re probably better off looking elsewhere.
Quickly, while we’re on the subject of updating and deleting. For the most part this is not supported in ADX. Tables and databases support retention polices where on insertion to ADX you can specify the time to live of the data being inserted, so if you know you only need the inserted data for a specified amount of time before being cleaned up, than this is natively supported. There are very inefficient methods of deleting a single record, but this is only implemented for GDPR use cases, and should be treated as a last resort. There are also methods of deleting whole blocks of data; you don’t need to worry about this until you get into advanced use of ADX.
Anyways, moving onto what this is really about. Getting your data into ADX.
When thinking about ingestion you first have to decide if you’re going to tell ADX when to ingest your data, direct ingestion, or if you’re going to tell ADX where your data is and let it decide when the best time to ingest is, managed ingestion. For most use cases, if your workload supports the managed model this is the most efficient and recommended method of getting your data into ADX. The managed model allows ADX to handle the distribution, rate of ingestion, and clean-up of ingested data. If you opt for a direct model, you will be forced to handle orchestration of your ingestion to prevent overloading the cluster with ingestion requests, but you will have more control over every aspect of ingestion.
Managed and direct are just the models, which are implemented by a wide range of SDKs. The following are the most common methods of ingestion:
Inline ingestion (Direct Ingestion)
Specify the data to be ingested (inserted) right in the query itself. This method is not intended for production use and is commonly used in the development phase or for small one-time ingestions.
Ingest from query (Direct Ingestion)
Write a data explorer query and ingest the results from that query execution into another table. This can be used across databases and clusters. This is typically used for generating reports and storing them into another table. For best performance the result sets should be less than 1GB, if you need to work with larger sets, batch it into multiple smaller sets.
Ingest from storage (Direct Ingestion)
Tell data explorer where the data is that needs to be ingested; a blob URI for example, and it will pull it in to be ingested.
Queued Ingestion (Managed Ingestion)
Similar to ingest from storage, however with this method you queue up a request with the location of the data to be ingested. This allows the cluster to determine the best time to do ingestion. Once it has capacity it will pull an item from the queue to be ingested. This is the most efficient and best method for ingesting large quantities of data, but could result in longer latency from when the ingestion request is sent to when the data is ready for querying.
For more info check out the Azure doc on this subject: ADX Data Ingestion
Confused? Need more info? Leave a comment and I’ll try to get you sorted out.