(My) Overview: NewsAPI
The Problem:
I have a project where I create a personal news aggregator that will provide the latest or most important three articles about a specific topic.
I need a source of information, either a news API or a crawler targeting news websites.
The Solution:
Create a script that use a news API or a series of “script” to crawl relevant news sites searching for important or news articles on specific topics.
The Implementation:
⚠️ I will discard (for now) the usage of spiders (crawlers) since this required some extra “admin” work (checking if it is legal to crawl the target website).
I will use NewsAPI as a source for the news and google sheets and a telegram bot as a way to display the results given by News API.
The filter parameters will include:
- The source.
- The day of publication.
- A maximum of three articles per topic.
❓What is NewsAPI?
The News API is a Rest API that provides a JSON formated results format from more than 800.000 — NewsAPI official website.
Get the API Key
1. Go to the get started page.
2. Click Get API Key.
3. Fill out the form. API Key obtain!.
Be aware there is a limitation to the free API. For most applications the free tier will be enough.
The documentation provides a list of client libraries in different languages. The Python client mattlisiv/newsapi-python is not an official client, but it is simple to use, so this is what I am going to use.
News API Description.
The API is sub-divided into two* endpoints.
/v2/everything
: It gathers all information about a specific topic/v2/top-headlines
: Gets the top-headlines based on country and language./v2/top-headlines/source
This is a specialized endpoint. It returns information (including name, description, and category) about the sources used to provide the headlines.
Authentication
They are three different ways to authenticate with the API:
- As part of the query string,
apiKey="Here the API key”
. - Via
X-Api-Key
HTTP header. - Via the
Authorization
HTTP header. IncludingBearer
is optional.
#Via query string
GET https://newsapi.org/v2/everything?q=keyword&apiKey=db0c830faab34094b9dyyyxxxxxxxx#Via X-Api-Key HTTP header
X-Api-Key: db0c830faab34094b9dyyyyyxxxxxxxx#Via Authorization HTTP header
Authorization: db0c830faab34094b9dyyyyyxxxxxxxx
This is a personal project, so I choose the one i feel the most comfortable with, the header parameter X-Api-Key
.
If the authentication wrong or missing. The 401 Unauthorized HTTP error will be returned.
Endpoints
/v2/everything
This Endpoint is a good option for general-purpose or discovery and analysis. Which is what i need (I will explain why not top headlines later)
For more information, check the official documentation
Request parameters
apiKey
can be passed as part of the string query or as another form previously discussed.q
andqInTitle
The first parameter is used to provide the phrases or keywords to search. The secondqInTitle
focus on keywords and phrases present just in the title of the new.source
with this parameter, we can limit the sources where the articles are obtained.from
andto
It is self-explanatory it will limit the time frame for the news.
GET https://newsapi.org/v2/everything?q=apple&from=2021-10-02&to=2021-10-02&sortBy=popularity&apiKey=db0c830faab340yyyyyyxxxxxxxxxxxx
Response Object
The response Object will be in JSON format, below an illustration.
{
“status”: “ok”,
“totalResults”: 2177,
“articles”: [
{results_1},
{results_2},
]
From the code above:
status
is just an indicator if the response is successful.totalResult
the number of results.articles
is an array of JSON objects that contain the news object response.
The parameters within thearticles
array.
/v2/top-headlines And /v2/top-headlines/sources
Thes endpoints provide breaking news or headlines for a country and the sources of those headlines.
During the implementation, I ran into a simple issue. I am in Asia and even setting the country parameter to CO or US ( Spanish or English), I still got some headlines in Asian languages or headlines from a few weeks ago. To keep it simple, I decided to use the everything endpoint.
More information:
/v2/top-headlines
/v2/top-headlines/sources
🔥Errors
The response will include:
status
a simple stringerror
.code
the HTTP code.message
description of the error.
{
“status”: “error”,
“code”: “apiKeyMissing”,
“message”: “Your API key is missing. Append this to the URL with the apiKey param, or use the x-api-key HTTP header.”
}
HTTP status
200 — OK
success.400 — Bad request
Unacceptable, most likely a missing parameter or an error in one.401 — Unauthorized
Your API key is not correct.429 — To Many Request
too many requests in a short time window.500 — server error
something is wrong with the newsAPI.
Error codes
For a completed list check the documentation.
The Most relevant error or those I might use in a try/except
block will be:
apiKEyDisabled
the key is disabled.apiKeyExhausted
we reach the limit of the plan.parameterInvalid
the request has some invalid parameters.parametersMissing
the request is missing some parameters.
Client Library
As mentioned in the documentation, there is an unofficial python library
Installing
pip install newsapi-python
Example Code
Some recommendations:
- For
get_top_headlines
it is not possible to make a request usingsource
andcategory
/country
at the same time, it will return an error. - For
get_everything
pay attention to the time frame, on the free tier, the API will limit the search to one month back or one month-old news.
👉(My) Implementation
First, I need to decide what type of information. I will extract it from the response object.
I don't need all the information in the response object. The idea is to have a snippet of the news, if I feel interested, I will use the url to navigate to the source.
- I need the content of
articles
- I don't need
content
. It is truncated - I will focus on
sources>name
,author
,title
,description, publishedAt
andurl.
Next steps
- I want to create a Telegram bot where i will input a list of topics or a single topic, and I will get back a message with the parameter i alredy mentioned, especially the descritpion and the url to the original article.
- I will set the raspberry to run a webserver that will take care to run the script and get the telegram bot request.
Final Thoughts
- I hardcoded the time frame to fourdays in the past, I don’t think i need information older than four days but still, it is a hardcode parameter.
- I am using the query parameter q will look for the topic key-word on the article body, so it is possible the article will not have relevant information.
- I filter the articles to get the three most relevant, but i don’t check if those three articles are coming from the same source, it might be a good idea to create an extra function to ensure each article comes from a different source.
- In some cases the topic doesn’t yield any result, i don’t have any function handling this scenario.