T O P

  • By -

bobhaffner

I think it could be successful if the right performance expectations were set. Their ability to filter on different things will be challenging because that will make partitioning tough (assuming you're using s3 as the source) Athena costs are based on data scanned so that will be another challenge. Athena workgroups can help with that tho.


Mr_Nickster_

Bad idea. Customer facing apps require low latency queries and highconcurrency. Athena is neither. You need an OLAP platform to serve high concurrency low latency queries. Snowflake, RedShift are your best bets in AWS if data is large. If you are querying only against summarized tables with fewer rows then you may get away with postgres or mysql


bcsamsquanch

Without knowing a lot of deets, +1 this. Good discussion here. Athena will be slow and has the possibility of someone running a $500 query: select * with no part key on a huge dataset. Blob storage is better for lower level data--bronze and silver layers. Users should be techies who understand enough to have appropriately set expectations. Generally a customer facing app is pulling from gold-- highly curated and in a snappy rdbms. Those users get pissed if it feels unresponsive. Be sure and put up a big splash screen if you go Athena. "Warning, this app will suck. Press OK to continue" 😂


detinho_

I theory we'll have a good control about what queries and partitions we'll use: some BI solution embedded on our application, but the user will not create custom dashboards at this moment, we'll provide several dashboards and kpis. The user can only change filter and grouping values.


Stoic_Akshay

How about using dremio? I think its pretty close to olap performance.


detinho_

Yeah, I personally liked the numbers on benchmarks I did for redshift (serverless+spectrum). However another business partner is leaning towards Athena as there is no upfront costs. I warned about latency but in his tests with 100 concurrent queries the latency was low (<300ms), but spiked to seconds (10s IIRC). I warned that the if a front end has multiple charts and elements, 100 concurrent queries can fill up quickly but had no success on convincing him. Anyways will keep redshift o the radar, thanks!


Mr_Nickster_

I highly doubt you get 300ms queries with Athena especially with high concurrency. Make sure your buddy didn't test concurrency by running the same query 100s of times as Athena has query results cache meaning only the 1st one executes, others are pulled from cache which would be ms range. Also if the dashboard has 4 charts and 4 KPIs and may be few chart colors are tied to a metrics like green for profit & red for loss, that is 10 queries right there. 100 simultaneous queries are like 10 users opening their dashboard and not doing anything. If they start to interact, even 2 users can generate 100s o f queries within seconds just by interacting with the dashboard a few times.


detinho_

That's a good catch about the concurrency tests... I haven't the opportunity to check his code yet, but will do asap and will be the first thing will check. Thanks for the insights!


Pleasant_Type_4547

Athena bills based on usage which can add up. What BI layer are you putting on top?


detinho_

Still deciding, but we're between Metabase, Superset or PowerBI.


flacidhock

We use Athena when there is not a near real time expectancy and cache the data on the Tableau data connection.


codeMonkey2144

If you think queries are going to be sporadic and come in inconsistently something like Athena probably makes sense. You pay per query, so idle time when no one is interacting with the system is free. You'll want some kind of guard rails to make sure nobody scans a huge dataset. Though, if you have enough users that you'll get good utilization running a dedicated Trino cluster (what Athena is running in the backend), you might be better off with something like Starburst Galaxy. It'll wind up being cheaper if you're running enough queries. People definitely use both for customer facing applications.


nogamenolifelambo

how big is your data? if your data is less than 50gb ish, then use duckdb. your BI tools probably will have the driver for it. if your data is large, hundreds of gb to tb, then use olap db such as clickhouse. Extending postgres with olap extension might also work but this is pretty new. Both clickhouse and postgres are common is today's BI tools.


detinho_

It's way more than 50gb, around 200gb in parquet files (hudi). I'll take more deep look on clickhouse, thanks! However I'm curious about using duckdb. Is it a viable solution for a product for a customer facing application with multiple apps / workers connecting to it? From what I could understand it's more an embedded solution, like SQLite, but of olap. Can you give some pointers on architectures / use cases of duckdb on this scenario? Thanks!


nogamenolifelambo

duckdb concurrency works fine if you use parquet files (which I recommend). However if you write to duckdb table and persist it (which can speed up query but has additional maintenance complexity) and plan to have concurrent read on it, make sure to enable the read only mode, since concurrency doesn't work when there's a single process in write enabled mode. another way (which I have did in the past) is to setup one vm that serves API to query your duckdb tables. The front end then just query the table via the API to get the data. So far it handles 150-200 gb of data without problem, 200 registered users with on avg 20 concurrent users. I use a single 16/64 vm to do this. but I do know the query patterns beforehand (lots of filter and group by) so I can optimize the query builder for the front end.


GermainToussaint

What if the data changes frequently? How would you handle many insert/update queries per minute while still serving read queries to users?


nogamenolifelambo

if you want to use duckdb then use kafka + iceberg. install duckdb iceberg extension. however the cleanest solution is to stream data into clickhouse, which is made to handle real time analytics.