# Uninstall tracking via BigQuery and Airflow

At [InCred](https://www.incred.com), our [mobile app](https://play.google.com/store/apps/details?id=com.incred.customer) is an integral part of our lending process. It helps customers complete their loan application, gives them information about their loan, and payment history, allows them to make payments, and stays in touch with InCred. That's why it is crucial for our business that our customers keep the app installed on their devices. However, sometimes customers may remove the app inadvertently or otherwise. In that situation, they may miss payment reminders and other vital communication related to their loan potentially causing them financial harm. So we wanted to reach out to customers in case of an uninstall quickly. In this article, we will talk about how we are using [BigQuery](https://cloud.google.com/bigquery/) to track user uninstalls in conjunction with [Airflow](https://airflow.apache.org/) to fire remedial action notifications to our users.

### Hunt for an API

We would like to build a REST API on top of BigQuery which can tell the app status for a given user on the given month range. At InCred, language is not the barrier where a microservice can be written in Java/GoLang/NodeJs. For this case, we went with NodeJs and used [BigQuery NodeJs client](https://github.com/googleapis/nodejs-bigquery) along the way. The API would have the following request body:

```json
{
  "userIds": [
  <list of user identifiers>
  ],
  "startMonth": 1,
  "startYear": 2020,
  "endMonth": 5,
  "endYear": 2020
}
```

For every user, this API should return the status of our app with values ranging from *Uninstalled/Installed/Never Installed*.

Our Android app fires many events to [Firebase Analytics](https://firebase.google.com/docs/analytics/). These events help in analyzing user behaviour. To gain deeper insights, we have BigQuery integrated into [Firebase analytics](https://firebase.google.com/docs/analytics/). To track uninstalls, we started looking out whether we can explicitly fire any event when the app is getting uninstalled by bundling it with user information. Unfortunately, this was not available and then we came across Firebase [auto-logged events](https://support.google.com/firebase/answer/6317485?hl=en), where one of the events `app_remove` was specifically meant for that purpose. To define the uniqueness of a user, we send `[user_id](https://firebase.google.com/docs/analytics/userid)`when the users authenticate via phone. Right, so just querying with `app_remove` event name will not solve the problem. That wasn’t meant to be!!

#### What if the user had reinstalled the app again?

If we just went with listening to the`app_remove` event, the user having reinstalled the app, will be shown as *Uninstalled.* Not the desired result. We also wanted to check if we could find anything unique about the user before logging into our app and then we saw `user_pseudo_id`. Firebase automatically generates this unique ID within BigQuery for each user. So, we created a map of `user_id` and `user_pseudo_id` and then analyzed the uninstalls. This failed too as it started giving false-positive results.

#### Keep-It-Simple-Silly(KISS)

When the problem becomes complex, it is always good to break it down into sub-problems. Then, solve the sub-problems and connect them all together. For a user in the given month range, we broke this problem into 3 parts:

* We would get the maximum `event_timestamp` for the `app_remove` event. This would also club the fact that the user has installed/uninstalled the app multiple times. Since BigQuery shards the table based on date, we used [wildcard(\*)](https://cloud.google.com/bigquery/docs/reference/standard-sql/wildcard-table-reference) to cover all the tables for a particular month.
    
    ```sql
    select max(event_timestamp) as app_remove_timestamp, user_id as user_id from " + <table_name.yyyymm*> + " where event_name = 'app_remove' and user_id is not null
    ```
    
* We would get the maximum engagement timestamp of a user. This would get the last timestamp when the user engaged with our app.
    
    ```sql
    select max(event_timestamp) as max_engage_timestamp, user_id as user_id from " + <table_name.yyyymm*> + " where user_id is not null
    ```
    
* Both the results would also be run on the same day since the BigQuery intraday table name is a little different. We would combine the last results to give us the bucket for *Installed/Uninstalled/Never Installed* users. Let’s see how:
    
    * If the `app_remove_timestamp` is greater than `max_engage_timestamp`, then the user has ***Uninstalled*** the app.
        
    * If the `app_remove_timestamp` is null or less than `max_engage_timestamp`, then the user has the app ***Installed***
        
    * If both the timestamps are null, then the user has ***Never Installed*** the app in that range.
        

> Note: All this works for a given month range. We know for sure that when the user has taken a loan from us and then gives an estimated month range on this API to give the desired result.

#### Why not include a date too to make this API complete?

With month ranges, we were not able to track the uninstalls every week or in the given time frame, say, we wanted to track from 5th Jan 2020 to 3rd Mar 2020 when we ran a campaign or disbursed loans via some offer. To cater to that, we changed the request body a little to include optional `startDay` and `endDay` and then tweaked the earlier solution. With the help of [`TABLE_SUFFIX`](https://cloud.google.com/bigquery/docs/querying-wildcard-tables#scanning_a_range_of_tables_using_table_suffix), we were able to scan the table for the given date range.

For the range of 5th Jan 2020 to 3rd Mar 2020, we used [`TABLE_SUFFIX`](https://cloud.google.com/bigquery/docs/querying-wildcard-tables#scanning_a_range_of_tables_using_table_suffix) to get the data set from 5th Jan to 30th Jan, then ran a [Wildcard(\*)](https://cloud.google.com/bigquery/docs/reference/standard-sql/wildcard-table-reference) for Feb month, and lastly ran a range query with [`TABLE_SUFFIX`](https://cloud.google.com/bigquery/docs/querying-wildcard-tables#scanning_a_range_of_tables_using_table_suffix) to query from 1st Mar to 3rd Mar.

With this, we successfully developed an API over BigQuery that can track user uninstalls on the given date range.

### Periodic notifications via Airflow

Once we developed an API, we wanted to utilize it the most by pushing automatic reminders to our users to have the app installed. At InCred, we use [Apache Airflow](https://airflow.apache.org/) to program, schedule, and monitor our workflows. Airflow provides a very neat dashboard to monitor your jobs, logging, retry policy, and much more via [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph) runs.

**Let’s break down the problem**. We would get the users with active loans(SQL query to our database), run those users on our Uninstall tracking API, and then send notifications via SMS to all those who have uninstalled our Android app.

Since the users with active loans would be huge, running every logic on a single DAG would take a lot of time to complete. Also, if there are a huge number of users then our SQL query underneath BigQuery would significantly increase thus hitting the [Maximum unresolved standard SQL query length of 1 MB](https://cloud.google.com/bigquery/quotas). So, we broke this problem even further and created multiple child DAGs.

* We set our master DAG to get the users having active loans.
    

![Master DAG](https://cdn.hashnode.com/res/hashnode/image/upload/v1697537556723/6104cffc-c546-4d39-a8a0-7d8001982a7c.png align="left")

* We configured the batch size on Airflow to create a child DAG for every batch.
    

![Child DAG](https://cdn.hashnode.com/res/hashnode/image/upload/v1697537558910/30b731c8-8160-4d1c-a5f9-dbc21dc0e3e9.png align="left")

* These child DAGs would run in parallel to get install status and send notifications.
    
* Child DAGs would have their retry policy along with master DAG.
    

Once we brainstormed on the solution, with Airflow, creating this setup was a piece of cake. We did just that and executed our workflow to work like a charm. We then went ahead and scheduled it bi-weekly for periodic reminders to our users.
