Hybrid AI Search 4 - Get tweet content fast and free

At the beginning, I used https://jina.ai/reader/ to convert any web page into markdown format, split it and store it in the vector database
The issues of Jina Reader
I found two issues with indexing Twitter content:
- Jina Reader returns a lot of useless information, you can refer to https://r.jina.ai/https://x.com/ahaapple2023/status/1816118637696279006
- Jina Reader has a high latency, occasionally up to 1 minute, and most of the time it takes 1 or 2 seconds
How to get tweet content fast and free
1 Use Twitter API
which is too expensive.
2 Use Exa API
import Exa from 'exa-js';
const exa = new Exa('API_TOKEN');
const result = await exa.getContents(['https://x.com/ahaapple2023/status/1816118637696279006'], {
text: true,
});
This is faster, but costs $1 for every 1,000 requests, and only displays plain text content. I also need to return the image associated with each tweet.
3. Use cdn.syndication.twimg.com API
You could change the id to the tweet id you want to get.
The interface is fast and free. The react-tweet library uses this interface
Below is my implementation of extracting tweet text and images. You are welcome to give a star
https://github.com/memfreeme/memfree/blob/main/vector/tweet.ts
4. Use document.title In chrome extension
Finally, I found that the fastest, easiest, free, and unlimited method is document.title.
In the Chrome extension, you can directly get document.title. Its content is the same as the text content you get through cdn.syndication.twimg.com
You could refer to https://github.com/memfreeme/memfree/blob/main/extention/public/inject.js#L68
5. Use the puppeteer and puppeteer-extra-plugin-stealth
When we found that document.title can get the text content of Twitter, a natural idea is to crawl this web page and get the title.
It should be noted that using puppeteer directly in headless mode does not work. We need to use puppeteer-extra
and puppeteer-extra-plugin-stealth
to simulate real user requests.
The sample code is as follows:
import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
puppeteer.use(StealthPlugin());
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36');
await page.setViewport({ width: 1280, height: 800 });
await page.goto('https://x.com/ahaapple2023/status/1818291443049615786', {
waitUntil: 'networkidle2',
timeout: 60000,
});
await page.waitForSelector('title');
const title = await page.title();
console.log(title);
await browser.close();
})();