JJ

@JJ

Just some thoughts, and some works in progress.

3,334 words

Guestbook
You'll only receive email when JJ publishes a new post

Resources exceeded during query execution

“Resources exceeded during query execution”, an article on the Memory limit problem faced in BQ.
In some cases you will get an “Resources exceeded during query execution” error when trying to build a large derived table in BQ(Big Query). This generally happen for queries that require huge amount of memory to run, or when a inner/derived query needs huge amount of memory to store temporarily while more operations is made on it.

Why does this happen?

This happens because even though BQ is a distributed Relational database and most queries are executed on multiple nodes, there are certain operations that requires all data to be processed on a single node due to calculation needs, which means that all the data needed for the calculation needs to exist on the same node. This is needed for those calculations even though BQ is a "distributed DB" and since BQ is a database running with a "serverless paradigm", it has several limitations for single node computation requests.
When your query requests for too much data to be processed on a single node and can no longer fit onto that node, you will get the “Resources exceeded during query execution” error and the whole query fails.
In the case of failure, even if your query already processed huge amount of data or took a really long time to run, your account will not be billed by Google for this query. Since all failed queries are not billed (Refer to BQ billing).


What are the Operations that Operate on a Single Node?

Operations that need to see all the data in the resulting table at once have to operate on a single node.
Un-partitioned window functions like RANK() OVER() or ROW_NUMBER() OVER() will operate on a single node.
Another operation that is mentioned to face a lot of problems is the ORDER BY clause. It is an operation that requires a single node to see all data, so it may hit the memory limit of a single.
ORDER BY clause:

  • If your query contains an ORDER BY clause, all the data is still going to be passed to a single node and then sorted.
  • If it doesn’t fit on the node then it fails. Since BQ is a scanning database (it scans the whole column for each query) sorting the data is unnecessary.
  • So don’t use ORDER BY in ordering derived table results.

When will it matter?

When you want to generate a primary key for a derived table. A common way to do this is to use ROW_NUMBER() OVER() as id to generate a sequence number for a each row in your table. If the table is big we fail with “Resources exceeded” because this function tries to operate on a single node.
Other situations is when ordering table that requires all data to be present in order to do proper sorting.
Or in fact any other tasks that is not based on grouping data, or tasks that cannot be easily split up to work in parallel will have a chance of facing this type of issue.


How to avoid operating on a single node, to prevent query from failing?

Unfortunately, there seems to be no way to increase the per node memory limit of BQ and it is not likely that Google would do something about it soon. At least at the time of writing, there seems to be no known way of customizing the node BQ uses for its calculation too. Thus in this case, it is better if we choose to work with this limitation in mind and design solutions that does not rely heavily on operations that will overload single node processes. Below are some ways to deal with the issue:

Partitioning using a generated key

Luckily, there’s an easy way around this. You can PARTITION the the window function by date and build a string as the primary key. For example:

CONCAT(CAST(ROWNUMBER() OVER(PARTITION BY eventdate) AS STRING),
'|',(CAST(event_date AS STRING)) as id

Splitting up the query and using "temporary tables"

You can try to split your query up.
Write the results of every individual sub/inner query to another table as a temporary storage space for further processing. Since Big Query also has a limitation for the maximum response size that can be processed.
You can do that either if you are using Legacy or Standard SQL, and you can follow the steps to do it in the documentation pretty easily.
By doing this, you effectively reduce the load of per query data scanning and may help to solve your memory limit issue.


More reading resources:

Some other resources on the internet that answers this question too. Can take a look at all of them to get different perspectives on this.

Data @Carousell "Meeting minutes"

Hi all, this article is a meeting's minutes during a visit to Carousell HQ, to meet and talk with the Data Engineering lead there! Some context; This was a meeting on how to do data engineering and analytics for a company in its early stages of data analytics. Hope you enjoy the read and be able to hopefully learn something new too!

Metrics is important, but only the important metrics are important. Focus on what matters rather than trying to come up with more metrics.
Note that this whole note is written in chronological order, meaning that the flow of the conversation is recorded from top to bottom. Scroll all the way to the bottom for the key points and summary :)

Data as a practice within the company.

Questions to think about when you build out data needs and thinking within your company:

  • What data foundation do you have right now?
  • How do you actually use data to help understand your users?
  • What do you focus on to drive the company?
  • Don't just look at things like revenue and more, look at metrics to improve the user experience and quality of your product.

The AAARRR framework

The AAARRR framework should be used to build out your data needs for the business. Variations of these are also useful, using the AAARRR framework is not a hard and fast rule.

  • Awareness
  • Acquisition
  • Activation
  • Retention
  • Revenue
  • Referral

For example, at Carousell, the "users" are always the buyer and seller. They are the major stakeholders for their app, because Users are not very valuable until they buy and sell.
Thus for Carousell, buyers and sellers are what the key metrics focuses on.

Frame all of AAARRR metrics together, so you can look at all of them together as a big picture. Do not over rely on just 1 or 2 metric from the AAARRR framework, it is only effective when viewed together.
Example is that, when you look at something like paid vs organic acquisition, you should map it to the other items in the AAARRR framework, and don't just see it as a acquisition thing.

Type of metrics to have

Focus on creating metrics that help to elevate user experience and drive the product.
Create metrics that help you understand your user base and their pain points.
Create metrics that help you identify and work with Magic moments rather than just have numbers to see.

Product market fit:

When does your product really fit? Rather than obsessing over getting to a specific number, make sure that you are driving improvements as measured by the AAARRR framework.
Driving awareness with referral source, so that you can easily attach a attribute to a specific user cohort/group.
When thinking about data analytics, it is important to look at the metric and Optimize for what is the best for users, rather than trying to hit a specific number in your metrics.

Magic moments:

For example in carousell, the magic moment for user retention is having 3 chats and 1 sale in the first month
You need to find the magic moment for user retention, then after finding it, you need to focus on driving that magic moment.
A/B testing to see if your magic moment is causality thing or if it is just a random chance.
For A/B testing, do remember to space them appropriately. Not too long and not too short.

Question --> How magic is magic?

Magic moment for carousell is when user makes a sale or purchase, because at that moment they become of value to carousell.
Magic moment, is not something that is fixed, it is something that should change and evolve following your app and your different types of users.
Meaning that for example, there can be different magic moment for different age groups or feature from A/B testing for your app.

Features and their target group

See if you are solving a strong problem for users.
Make sure the feature you push, is for as wide a audience amongst your current user base as possible.
You need to subgroup your analysis between different type of users and then compare it to your general user base.
Meaning that let say, you have a feature that is only meant for 18 or younger, you need to create the metric where the user base is 18 and below, and not the whole user base.
After you have the metric for that specific group, compare it with other features that are meant for other user groups, so that your features metrics can be put into context and compared correctly.

Use AAARRR to segment users based on that acquisition source
So meaning that paid acquisition vs organic acquisition will most likely have different magic moments, so you need to figure the best signals to create the best metrics.
Don't just have a pile of dashboard that nobody can use or understand just because it is shiny and cool.

When you are optimizing/improving the features for a certain group of users, make sure the feature does not worsen the experience of the other users in the user base.

User focus groups

Focus on the user group that have the most opportunities in the company.
Only do analysis after you have this focused group, so that when you do metrics, you know exactly what and who every feature is for.
Business users should focus on identifying the most important user group, before the product team can create/improve features for these users and finally have the data team to create metrics focused on those user groups, rather than across all data.
Although it is good to split your users for more detailed analysis, if you don't really think that there is a lot of different cohorts of users, than you should not split your user group too much as it will just cause you to create too many metrics that are hard to be viewed/compared in context to each other.
Remember that when you are grouping/segmenting your user bases, 1 user can be classified as more than "1" user. Because user may use the app for 2 or more different purposes, so that can fit under more than 1 group. Group your users into their different use cases and usage patterns rather than just simple head count.
You can think of it as users who fall under 2 circles in the same Venn diagram.

Feature cycles

Sometimes, changes are not reflected very fast on your metrics, so like let say something take 1 month before it shows the result, than in your data needs to be tuned to take into account of that.
Don't have success metrics that are clickbait like. Don't focus on things like, "users who click this button is most likely to retain". This is because you will just screw up your user pattern and data gathered after that.
Make sure that your metrics and magic moments are focused on customer success and experience.
In that case, then you can see the real improvements for the app and more.

Cause and effect not as important as improving customer experience to drive magic moment.
Optimize for magic moments, then let the experiment ride out for sometime.

Question --> How to identify issues that are not there?

Use UX touchpoints.
You can also use things like "Conversion from X into Y screen", to tell u if the journey is successful.
However the above method of looking at conversion is very low level, meaning that it should be fixed asap as you do not want to dwell in that context for too long as it is not important to the big picture.

Data Infrastructure at Carousell.

On the foundation of a good Data infrastructure, you should think about 2 critical elements first:

  • How is your data tracked?
    • What kind of strategy do you have for this?
    • Will this be changing in the future either in the short or long run?
      • What are the effects of it changing in the future? Does your whole data infrastructure needs to be overhauled?
  • How are you organizing your data?
    • How do you organize your data and split them up?
    • By tags / attributes? How are you coming up with these tags and attributes?

Roles of data engineers in a early stage company

Data engineers from early stage companies should have 2 main focus area:

  • Linking actions to data
    • Link actions to data and link data to actions means that you need to understand what your data means to your product, your company, and your users.
  • Tracking and organizing data
    • Tracking and organizing data, in simple terms means to know where and how your data is collected and how they are stored.
    • It is important to get this step right, as most work is built on this.

Other tasks and work of data engineers in early stage companies:

  • Building out the infrastructure before working on analysis tools.
  • Do audit checks
    • So that you can get a better understanding on what you already have and what you need to gather.
    • To plan for what are the data to be gathered next and what are the unnecessary items. (Essentially plan for the next step to do to make your metrics/dash more complete)
  • Create automations so that you are not doing things manually every single time.
  • A huge ongoing tasks for data engineers is tracking all the data and trying to automate as much of the tasks at hand as possible.

Whenever there is data, there needs to be a clear idea of how your product creates data.
Make sure you understand where your data comes from and how it is generated.
You want to make sure that all your data have enough meta data to have a good idea of what that piece of data means.

The data engineering team should focus on thinking about the data and the infrastructure with above 2 focal points rather than the product. Because their job is strictly about understanding the data and not the product.

Data engineers need to fundamentally understand the long term data structure and plan.
Because in the long term, there is a lot of things that you would need to change because most of the tools you use today, is just not as good or suited for the job as your data grows.

Events fired can have ambiguous meanings, thus it is important to understand the data, rather than just try to build dashboards. Think about, "is the data that I am collecting really what is needed for the analysis?"
Role of the data engineer is to make sure all the data needed is easily accessible for use, regardless if the data is needed for visualization or for the business users.
Find the gaps for user and product data, come up with a plan to gather those data and execute it.

Advice for early stage companies

Figure out when the time comes when you need a full data team.
You would need a full time data engineer/team when you need to do things like unify data from marketing and product.
This is because features like data auto tagging from systems like firebase can only help you so much until you need someone to do it.
At some point, you will probably hire an analyst too. An analyst is the person who gets the data out into a good and usable form to make a business point and to help make use of the data for a business decision.

Tools used by Carousell

Carousell uses alot of google tools, from things like data ingestion/collection with google pub/sub, Big Query for their data ETL process and to using looker for visualization.
I quote, "Figuring out the right tool is important, and making sure you made a good choice of tools before moving on. However at the same time, when you are just starting out you really dont need to over invest, the BI tool is not the most important thing."
The most important thing is to figure out what is your key metrics and use cases, and just get those up however is the most convenient, because 90% of dashboard goes unread and for usually for your most important metrics/dashboards they are not very complex and do not need alot of effort to setup. The key is thus to really understand what is really needed and focus on that instead.

Data processing

  • Having a preprocessing method is very important.
  • Choose to preprocess data rather than doing everything in the last mile with your custom SQL statements.
  • Querying directly from source is bad, and we should focus on things like incremental processing layers that are automated. This is because after a certain point in time, even your DB would not be able to handle all your complex queries as the data grows too big. Thus it helps to do incremental processing as soon as possible so as to avoid incurring too much technical debt. "Don't incur too much tech debt because of the ease of extracting directly from source."
  • Word of caution, before you design your processing layer and infrastructure out, make sure you fully understand the needs of your data and not just blindly use everything.

Analysis

In relation to the AAARRR framework

"Correlation between acquisition with retention". Map retention numbers with their acquisition source.
Really understand your cohort / groups of user, by sub grouping them by things like data and app version numbers.

Magic moment

"Analyze depth of activation and magic moment rather than quantity", meaning that you should focus more on how deep a user is in the activation cycle rather than trying to count the number of users who is just activated on the surface and using the count as a gauge.
However how deep into the magic moment you analyze really depends on how you identify your magic moment and engagement.
You should be analyzing it and trying to apply it from the user's perspective


Key takeaways:

  • Focus on identifying magic moments during user engagement for customer retention
  • Using the focused magic moments identified by business users to build out a data infrastructure that extracts and cleans the data to build metrics for these magic moments.
  • Make sure your metrics are focused on the user. What the user do, is what should drive your company.
  • Basically magic moment is important, figure out the magic moment for the different type of users as you do not want to generalize magic moment too much across the different users.
  • So after you do that you should focus on building data infrastructure and metrics that are specifically for understanding the magic moments.
  • Automate as much processes as possible

In essence, how you use data is really dependent on your product. Don't worry too much, just focus on improvements for your customer using the AAARRR framework!