What is this?
It was early January 2024, and I was trying to understand how to decide on a piece of land purchase. I have plans in the near future to build a house for my family and me.
At that point, I had only one yearly report that mentioned the market was at its lowest point. This source did not provide a dataset; it only contained references to non-disclosed privately gathered data and some Excel charts. However, it was respectable enough to trust. Additionally, news articles suggested that this was the best moment to buy.
There is a fairly good (in my Grug eyes) rule of thumb that asserts that large public mentions of a market event signal the beginning of the next typical cyclical phase. In other words, popular opinions tend to lag behind true phenomena, such as:
My professional bias led me to build a sensor, a thermometer to validate public opinion and confirm my rule of thumb, so I could time my purchase correctly. I started thinking about setting up a couple of scrapers on some popular and high-traffic real estate websites to monitor trends.
real-state-company-webapp.com
stack
: PHP + Angular?items_per_page=1000
worked --> my Grug monkey brain was happyreal-state-company.com/api/
: the server state was dumped in the response and contained references to this unused API, at least not used by the browser.real-state-company.com
, led me to a CRM
in construction website.debug=True
; they were developing in production --> I had complete access to most of their infrastructure resources.After discovering the environment dump, I reached out to them:
Talking to them and using their data freely:
Having access to both listing and final selling prices was a gold mine. I began analyzing the distributions of the ratio between them, regardless of the absolute values. I wanted to gauge what were the negotiations margins allowed in my price range.
As you can see, the data was quite noisy—after all, it was entered by humans who either made mistakes when updating listing prices or purposefully misrepresented the values. Why lie? Well, in my country, lying is a common practice in real estate to enable tax evasion and facilitate under-the-table transactions. For example, a house listed at 100 ends up "officially" selling at 50, but in reality, the buyer pays 50 in white money and 25 under the table in cash. This way, both parties benefit—the seller receives more money, and the buyer pays fewer taxes. My Grug brain started seeing patterns in the noise.
To perform some qualitative analysis, I removed samples with a ratio greater than 1.
This led me to our first image. What are those frontiers? They are subsets of our domain highly populated by real samples—actual sales that occurred. After visually inspecting a few examples, I noticed that:
0.8534
. Therefore, it was evident that there was no phenomenon of a high population of, lets say, 10% discount properties, but rather a continuum depending on price.To informally test this hypothesis, I created an artificial property valued at 91.3K. I then scatter-plotted the discount ratios for selling values with discounts that are multiples of 5, such as:
monkey brain likes 5*N