JJ

Just some thoughts, and some works in progress.

Named exports are better than default ones (mostly)

This article talks about the difference between the different module export types, Default vs Named exports in ES6 modules, and why named exports are the preferred option in most cases.

tl;dr

Prefer named exports over default exports as named imports are explicit and have better type/editor support.

What is a default export?

ES6 module default exports comes from replicating the functionality of exporting modules in the CommonJS spec. Let's take a look at the following piece of code:

// src/components/SomeComponent.js
export default class SomeComponent {
    ...
}

You can import it this in another file with:

// src/App.js
import SomeComponent from './components/SomeComponent';

After importing it you can use it however you like in App.js. The important thing is that you don't have to use SomeComponent as the name of imported code. It can be literally anything, because there's no strict convention here.

What is a named export?

A named export is where you export a symbol like a variable or function name:

// src/components/SomeComponent.js
export class SomeComponent {
    ...
}

You can import it this in another file with:

// src/App.js
import { SomeComponent } from './components/SomeComponent';

To import the same component that is now defined as a named export, you would need to use the same exact name as it is defined in the source file.

The case for named exports

Named exports should be preferred over default exports for the vast majority of cases with some exceptions, and the main reason for preferring it is due to the naming requirements during import, where named exports enforce the use of the same name. This section explores all the implications of this difference to make the case for using named exports by default.

Importing named exports is less buggy

Named exports are explicit and are stricter because the names are defined in the exporting files and naming consistency is enforced.

The code works perfectly but is wrong on so many levels:

import Angular from "react";

function HelloComponent() {
  return Angular.createElement("div", null, "Hello World");
}

This can happen and diving deeper, it is becauase it's difficult to figure out what is being imported when a module only has a default import. If you are using an unfamiliar module or file, it can be difficult to figure out what is returned like const list = require("./list");. In this context, what would you expect list to be? It's unlikely to be a primitive value but it could logically be a function, class, or other type of object and how will you know for sure? You need to make a side trip by either opening the file and looking for the export or opening the documentation for it. In either case, this now becomes an extra bit of information you need in your brain to avoid a second side trip penalty when you need to import from list.js again. If you are importing a lot of defaults from modules then either your cognitive overhead increases or the number of side trips increases. Both of which are suboptimal and frustrating.

Names should be consistent throughout all files. If something is an Apple in one file, it shouldn't be called Orange in another file. An Apple should always be an Apple.

By using named exports, you can precisely indicate what you are importing without having to worry about importing the wrong thing as most editors will warn you about it and allow you to jump to its definition directly:

import { React } from "react";

function HelloComponent() {
  return React.createElement("div", null, "Hello World");
}

If you would like to use a different name for whatever reasons (e.g. avoid duplicate names in the same namespace), you can still do so explicitly:

import { React as SomethingElse } from "react";

function HelloComponent() {
  return SomethingElse.createElement("div", null, "Hello World");
}

You can also group imports with a new namespace to avoid naming collisions:

// src/components/UI.js
export function Button { return <button>Something</button> }

// src/App.js
import * as Buttons from './components/UI';

// A function with the same name as the one in Buttons.js
const Button = () => <button>Something</button>;

function App () {
    return (
        <Button />
        <UI.Button />
    )
}

In summary, these imports are explicit and have better type/editor support

It is confusing when the same package uses both default and named exports

Ever paused wondering if you have to do something like this:

import { React, createRef } from 'react';

Or this:

import React, { createRef } from 'react';

That pause is bad developer experience and can break your flow, worse still if you ignore it and import the wrong thing.

Cleaner export aggregation

Named exports are helpful if you use barrel files, i.e. combining exports in a index.js file within a folder to make import source easier.

With named exports

// index.js barrel file
export { Title } from "./components/title";
export { Navigation } from "./components/Navigation";

With default exports, it is a two step process:

// index.js barrel file
import Title from "./components/title";
import Navigation from "./components/Navigation";
export { Title, Navigation };

Name matching problems

Named exports require consuming modules to at least specify the name what they are importing from a module, allowing us to easily search symbols like LinkedList for example, and once found, should refer to the same LinkedList everywhere. Default exports do not enforce their imported names, this means that naming imports increases your cognitive load as you need to determine the correct naming convention and make sure that every other developer working on the same application will use the same name for the same thing. (You can, of course, allow each developer to use different names for the same thing, but that introduces more cognitive overhead for the team)

Importing a named export means at least referencing the canonical name of a thing everywhere that it's used and even if you choose to rename an import, it is done explicitly where the canonical name is first referenced:

import { LinkedList as MyList } from "./list.js";

When naming is consistent across a codebase, you're able to easily do things like:

  1. Search the codebase to find usage information.
  2. Refactor the name of something across the entire codebase.

It is also possible but alot more difficult, complicated and error-prone to do this when using default exports and ad-hoc import naming.

Importing the wrong thing

Named exports in JavaScript modules have a particular advantage over default exports in that an error is thrown when attempting to import something that doesn't exist in the module. Consider this code:

import { LinkedList } from "./list.js";

If LinkedList doesn't exist in list.js, then an error is thrown. Furthermore, tools like IDEs and ESLint can easily detect missing references before any code is executed.

Exceptions: The case for default exports

There are times where using default exports is just easier and just because most of the times named exports should be preferred, it does not mean that it is a hard and fast rule where default exports are 'banned'.

VueJS components

When using VueJS SFCs (Single File Components), these are compiled to give a single default export and cannot be changed.

See this github issue and this specific comment by Evan You on why Vue components will always be default exports.

Lazily loaded modules

When lazily importing modules with the async import function, the code is always alot cleaner to write, and these can happen quite often when doing code splitting / lazy loading different view components.

There is also the case when lazy loading view components in vue-router, where it is much easier to do a default import rather than parsing out the named export.

Other tools such as those that are used when developing React apps, does not accomodate named exports convention. For instance when lazy loading React components. It's a code design decision made by React team. At the end you have to write something like the following, in order to lazy load your React component:

// With named export
const LazyComponent = React.lazy(() => import("./LazyComponent").then(({ default: component }) => component));

// With default export
const LazyComponent = React.lazy(() => import("./LazyComponent"));

function App {
    return (
        <div>
            <Suspense fallback={<div>Loading ...</div>}>
                <LazyComponent />
            </Suspense>
        </div>
    )
}

Conclusion

Since it is generally better to be explicit than implicit, what something does, what something should be called, etc., should always be made explicit whenever possible. Therefore, use named exports wherever you can, and fallback to default exports only in very specific edge cases where it will be more cumbersome to do a named export.

More specific advice:

  • For Vue JS UI components where it is 1 file per component defined using SFCs, use default exports as this is set by Vue and cannot be changed.
  • When writing TS/JS modules, always use named exports.

References & Credits

Resources exceeded during query execution

“Resources exceeded during query execution”, an article on the Memory limit problem faced in BQ.
In some cases you will get an “Resources exceeded during query execution” error when trying to build a large derived table in BQ(Big Query). This generally happen for queries that require huge amount of memory to run, or when a inner/derived query needs huge amount of memory to store temporarily while more operations is made on it.

Why does this happen?

This happens because even though BQ is a distributed Relational database and most queries are executed on multiple nodes, there are certain operations that requires all data to be processed on a single node due to calculation needs, which means that all the data needed for the calculation needs to exist on the same node. This is needed for those calculations even though BQ is a "distributed DB" and since BQ is a database running with a "serverless paradigm", it has several limitations for single node computation requests.
When your query requests for too much data to be processed on a single node and can no longer fit onto that node, you will get the “Resources exceeded during query execution” error and the whole query fails.
In the case of failure, even if your query already processed huge amount of data or took a really long time to run, your account will not be billed by Google for this query. Since all failed queries are not billed (Refer to BQ billing).


What are the Operations that Operate on a Single Node?

Operations that need to see all the data in the resulting table at once have to operate on a single node.
Un-partitioned window functions like RANK() OVER() or ROW_NUMBER() OVER() will operate on a single node.
Another operation that is mentioned to face a lot of problems is the ORDER BY clause. It is an operation that requires a single node to see all data, so it may hit the memory limit of a single.
ORDER BY clause:

  • If your query contains an ORDER BY clause, all the data is still going to be passed to a single node and then sorted.
  • If it doesn’t fit on the node then it fails. Since BQ is a scanning database (it scans the whole column for each query) sorting the data is unnecessary.
  • So don’t use ORDER BY in ordering derived table results.

When will it matter?

When you want to generate a primary key for a derived table. A common way to do this is to use ROW_NUMBER() OVER() as id to generate a sequence number for a each row in your table. If the table is big we fail with “Resources exceeded” because this function tries to operate on a single node.
Other situations is when ordering table that requires all data to be present in order to do proper sorting.
Or in fact any other tasks that is not based on grouping data, or tasks that cannot be easily split up to work in parallel will have a chance of facing this type of issue.


How to avoid operating on a single node, to prevent query from failing?

Unfortunately, there seems to be no way to increase the per node memory limit of BQ and it is not likely that Google would do something about it soon. At least at the time of writing, there seems to be no known way of customizing the node BQ uses for its calculation too. Thus in this case, it is better if we choose to work with this limitation in mind and design solutions that does not rely heavily on operations that will overload single node processes. Below are some ways to deal with the issue:

Partitioning using a generated key

Luckily, there’s an easy way around this. You can PARTITION the the window function by date and build a string as the primary key. For example:

CONCAT(CAST(ROWNUMBER() OVER(PARTITION BY eventdate) AS STRING),
'|',(CAST(event_date AS STRING)) as id

Splitting up the query and using "temporary tables"

You can try to split your query up.
Write the results of every individual sub/inner query to another table as a temporary storage space for further processing. Since Big Query also has a limitation for the maximum response size that can be processed.
You can do that either if you are using Legacy or Standard SQL, and you can follow the steps to do it in the documentation pretty easily.
By doing this, you effectively reduce the load of per query data scanning and may help to solve your memory limit issue.


More reading resources:

Some other resources on the internet that answers this question too. Can take a look at all of them to get different perspectives on this.

Data @Carousell "Meeting minutes"

Hi all, this article is a meeting's minutes during a visit to Carousell HQ, to meet and talk with the Data Engineering lead there! Some context; This was a meeting on how to do data engineering and analytics for a company in its early stages of data analytics. Hope you enjoy the read and be able to hopefully learn something new too!

Metrics is important, but only the important metrics are important. Focus on what matters rather than trying to come up with more metrics.
Note that this whole note is written in chronological order, meaning that the flow of the conversation is recorded from top to bottom. Scroll all the way to the bottom for the key points and summary :)

Data as a practice within the company.

Questions to think about when you build out data needs and thinking within your company:

  • What data foundation do you have right now?
  • How do you actually use data to help understand your users?
  • What do you focus on to drive the company?
  • Don't just look at things like revenue and more, look at metrics to improve the user experience and quality of your product.

The AAARRR framework

The AAARRR framework should be used to build out your data needs for the business. Variations of these are also useful, using the AAARRR framework is not a hard and fast rule.

  • Awareness
  • Acquisition
  • Activation
  • Retention
  • Revenue
  • Referral

For example, at Carousell, the "users" are always the buyer and seller. They are the major stakeholders for their app, because Users are not very valuable until they buy and sell.
Thus for Carousell, buyers and sellers are what the key metrics focuses on.

Frame all of AAARRR metrics together, so you can look at all of them together as a big picture. Do not over rely on just 1 or 2 metric from the AAARRR framework, it is only effective when viewed together.
Example is that, when you look at something like paid vs organic acquisition, you should map it to the other items in the AAARRR framework, and don't just see it as a acquisition thing.

Type of metrics to have

Focus on creating metrics that help to elevate user experience and drive the product.
Create metrics that help you understand your user base and their pain points.
Create metrics that help you identify and work with Magic moments rather than just have numbers to see.

Product market fit:

When does your product really fit? Rather than obsessing over getting to a specific number, make sure that you are driving improvements as measured by the AAARRR framework.
Driving awareness with referral source, so that you can easily attach a attribute to a specific user cohort/group.
When thinking about data analytics, it is important to look at the metric and Optimize for what is the best for users, rather than trying to hit a specific number in your metrics.

Magic moments:

For example in carousell, the magic moment for user retention is having 3 chats and 1 sale in the first month
You need to find the magic moment for user retention, then after finding it, you need to focus on driving that magic moment.
A/B testing to see if your magic moment is causality thing or if it is just a random chance.
For A/B testing, do remember to space them appropriately. Not too long and not too short.

Question --> How magic is magic?

Magic moment for carousell is when user makes a sale or purchase, because at that moment they become of value to carousell.
Magic moment, is not something that is fixed, it is something that should change and evolve following your app and your different types of users.
Meaning that for example, there can be different magic moment for different age groups or feature from A/B testing for your app.

Features and their target group

See if you are solving a strong problem for users.
Make sure the feature you push, is for as wide a audience amongst your current user base as possible.
You need to subgroup your analysis between different type of users and then compare it to your general user base.
Meaning that let say, you have a feature that is only meant for 18 or younger, you need to create the metric where the user base is 18 and below, and not the whole user base.
After you have the metric for that specific group, compare it with other features that are meant for other user groups, so that your features metrics can be put into context and compared correctly.

Use AAARRR to segment users based on that acquisition source
So meaning that paid acquisition vs organic acquisition will most likely have different magic moments, so you need to figure the best signals to create the best metrics.
Don't just have a pile of dashboard that nobody can use or understand just because it is shiny and cool.

When you are optimizing/improving the features for a certain group of users, make sure the feature does not worsen the experience of the other users in the user base.

User focus groups

Focus on the user group that have the most opportunities in the company.
Only do analysis after you have this focused group, so that when you do metrics, you know exactly what and who every feature is for.
Business users should focus on identifying the most important user group, before the product team can create/improve features for these users and finally have the data team to create metrics focused on those user groups, rather than across all data.
Although it is good to split your users for more detailed analysis, if you don't really think that there is a lot of different cohorts of users, than you should not split your user group too much as it will just cause you to create too many metrics that are hard to be viewed/compared in context to each other.
Remember that when you are grouping/segmenting your user bases, 1 user can be classified as more than "1" user. Because user may use the app for 2 or more different purposes, so that can fit under more than 1 group. Group your users into their different use cases and usage patterns rather than just simple head count.
You can think of it as users who fall under 2 circles in the same Venn diagram.

Feature cycles

Sometimes, changes are not reflected very fast on your metrics, so like let say something take 1 month before it shows the result, than in your data needs to be tuned to take into account of that.
Don't have success metrics that are clickbait like. Don't focus on things like, "users who click this button is most likely to retain". This is because you will just screw up your user pattern and data gathered after that.
Make sure that your metrics and magic moments are focused on customer success and experience.
In that case, then you can see the real improvements for the app and more.

Cause and effect not as important as improving customer experience to drive magic moment.
Optimize for magic moments, then let the experiment ride out for sometime.

Question --> How to identify issues that are not there?

Use UX touchpoints.
You can also use things like "Conversion from X into Y screen", to tell u if the journey is successful.
However the above method of looking at conversion is very low level, meaning that it should be fixed asap as you do not want to dwell in that context for too long as it is not important to the big picture.

Data Infrastructure at Carousell.

On the foundation of a good Data infrastructure, you should think about 2 critical elements first:

  • How is your data tracked?
    • What kind of strategy do you have for this?
    • Will this be changing in the future either in the short or long run?
      • What are the effects of it changing in the future? Does your whole data infrastructure needs to be overhauled?
  • How are you organizing your data?
    • How do you organize your data and split them up?
    • By tags / attributes? How are you coming up with these tags and attributes?

Roles of data engineers in a early stage company

Data engineers from early stage companies should have 2 main focus area:

  • Linking actions to data
    • Link actions to data and link data to actions means that you need to understand what your data means to your product, your company, and your users.
  • Tracking and organizing data
    • Tracking and organizing data, in simple terms means to know where and how your data is collected and how they are stored.
    • It is important to get this step right, as most work is built on this.

Other tasks and work of data engineers in early stage companies:

  • Building out the infrastructure before working on analysis tools.
  • Do audit checks
    • So that you can get a better understanding on what you already have and what you need to gather.
    • To plan for what are the data to be gathered next and what are the unnecessary items. (Essentially plan for the next step to do to make your metrics/dash more complete)
  • Create automations so that you are not doing things manually every single time.
  • A huge ongoing tasks for data engineers is tracking all the data and trying to automate as much of the tasks at hand as possible.

Whenever there is data, there needs to be a clear idea of how your product creates data.
Make sure you understand where your data comes from and how it is generated.
You want to make sure that all your data have enough meta data to have a good idea of what that piece of data means.

The data engineering team should focus on thinking about the data and the infrastructure with above 2 focal points rather than the product. Because their job is strictly about understanding the data and not the product.

Data engineers need to fundamentally understand the long term data structure and plan.
Because in the long term, there is a lot of things that you would need to change because most of the tools you use today, is just not as good or suited for the job as your data grows.

Events fired can have ambiguous meanings, thus it is important to understand the data, rather than just try to build dashboards. Think about, "is the data that I am collecting really what is needed for the analysis?"
Role of the data engineer is to make sure all the data needed is easily accessible for use, regardless if the data is needed for visualization or for the business users.
Find the gaps for user and product data, come up with a plan to gather those data and execute it.

Advice for early stage companies

Figure out when the time comes when you need a full data team.
You would need a full time data engineer/team when you need to do things like unify data from marketing and product.
This is because features like data auto tagging from systems like firebase can only help you so much until you need someone to do it.
At some point, you will probably hire an analyst too. An analyst is the person who gets the data out into a good and usable form to make a business point and to help make use of the data for a business decision.

Tools used by Carousell

Carousell uses alot of google tools, from things like data ingestion/collection with google pub/sub, Big Query for their data ETL process and to using looker for visualization.
I quote, "Figuring out the right tool is important, and making sure you made a good choice of tools before moving on. However at the same time, when you are just starting out you really dont need to over invest, the BI tool is not the most important thing."
The most important thing is to figure out what is your key metrics and use cases, and just get those up however is the most convenient, because 90% of dashboard goes unread and for usually for your most important metrics/dashboards they are not very complex and do not need alot of effort to setup. The key is thus to really understand what is really needed and focus on that instead.

Data processing

  • Having a preprocessing method is very important.
  • Choose to preprocess data rather than doing everything in the last mile with your custom SQL statements.
  • Querying directly from source is bad, and we should focus on things like incremental processing layers that are automated. This is because after a certain point in time, even your DB would not be able to handle all your complex queries as the data grows too big. Thus it helps to do incremental processing as soon as possible so as to avoid incurring too much technical debt. "Don't incur too much tech debt because of the ease of extracting directly from source."
  • Word of caution, before you design your processing layer and infrastructure out, make sure you fully understand the needs of your data and not just blindly use everything.

Analysis

In relation to the AAARRR framework

"Correlation between acquisition with retention". Map retention numbers with their acquisition source.
Really understand your cohort / groups of user, by sub grouping them by things like data and app version numbers.

Magic moment

"Analyze depth of activation and magic moment rather than quantity", meaning that you should focus more on how deep a user is in the activation cycle rather than trying to count the number of users who is just activated on the surface and using the count as a gauge.
However how deep into the magic moment you analyze really depends on how you identify your magic moment and engagement.
You should be analyzing it and trying to apply it from the user's perspective


Key takeaways:

  • Focus on identifying magic moments during user engagement for customer retention
  • Using the focused magic moments identified by business users to build out a data infrastructure that extracts and cleans the data to build metrics for these magic moments.
  • Make sure your metrics are focused on the user. What the user do, is what should drive your company.
  • Basically magic moment is important, figure out the magic moment for the different type of users as you do not want to generalize magic moment too much across the different users.
  • So after you do that you should focus on building data infrastructure and metrics that are specifically for understanding the magic moments.
  • Automate as much processes as possible

In essence, how you use data is really dependent on your product. Don't worry too much, just focus on improvements for your customer using the AAARRR framework!