Title: | A Simple Web Scraper |
---|---|
Description: | A group of functions to scrape data from different websites, for academic purposes. |
Authors: | Roberto Villegas-Diaz [aut, cre] |
Maintainer: | Roberto Villegas-Diaz <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.2 |
Built: | 2024-11-12 02:39:23 UTC |
Source: | https://github.com/villegar/scrappy |
Digimap’s Ordnance Survey collection provides a full range of topographic Ordnance Survey data for the UK. Note that this function helps you to request the data, once your request has been processed, you will have to manually download the data (email instructions will be provided by Digimap).
digimap_os( client, area_name = NULL, dataset = NULL, format = NULL, version = NULL, org = Sys.getenv("ORG"), sleep = 1 )
digimap_os( client, area_name = NULL, dataset = NULL, format = NULL, version = NULL, org = Sys.getenv("ORG"), sleep = 1 )
client |
|
area_name |
String with UK national grid name (e.g., 'SD', 'SD20'). See ordnancesurvey.co.uk/documents/resources/guide-to-nationalgrid.pdf. |
dataset |
String with the name of the data set to download (e.g., 'NTM' for National Tree Map or 'Terrain-5 DTM' for the OS Terrain 5 Digital Terrain Model). See https://digimap.edina.ac.uk/help/our-maps-and-data/os_products/. |
format |
String with the data format (e.g., 'SHAPE'). See https://digimap.edina.ac.uk/help/our-maps-and-data/os_products/. |
version |
String with the version of the data set (e.g., 'July 2023'). See https://digimap.edina.ac.uk/help/our-maps-and-data/os_products/. |
org |
String with your organisation name (for login purposes, this can be done manually). Done only once per session. |
sleep |
Integer with number of seconds to use as pause between actions on the web page. |
Logic value with the status of the data request.
https://digimap.edina.ac.uk/os
Convert duration to date-time Convert a string with a duration (e.g. 'an hour ago') to a date-time string, based on a reference time
duration2datetime( str, ref_time = Sys.time(), output_format = "%Y-%m-%d %H:%M:%S %Z" )
duration2datetime( str, ref_time = Sys.time(), output_format = "%Y-%m-%d %H:%M:%S %Z" )
str |
String with a duration (see examples) |
ref_time |
Reference time (default: |
output_format |
String with the format of the output (default:
|
Date-time object based on the input string, str
, and the reference
time ref_time
.
duration2datetime("a minute ago") duration2datetime("an hour ago") duration2datetime("a day ago") duration2datetime("a week ago") duration2datetime("a month ago") duration2datetime("a year ago") duration2datetime("2 minutes ago") duration2datetime("2 hours ago") duration2datetime("2 days ago") duration2datetime("2 weeks ago") duration2datetime("2 months ago") duration2datetime("2 years ago")
duration2datetime("a minute ago") duration2datetime("an hour ago") duration2datetime("a day ago") duration2datetime("a week ago") duration2datetime("a month ago") duration2datetime("a year ago") duration2datetime("2 minutes ago") duration2datetime("2 hours ago") duration2datetime("2 days ago") duration2datetime("2 weeks ago") duration2datetime("2 months ago") duration2datetime("2 years ago")
Scrape GP practices near a given postcode
find_a_gp( client, postcode, base = "https://www.nhs.uk/service-search/find-a-gp", sleep = 1 )
find_a_gp( client, postcode, base = "https://www.nhs.uk/service-search/find-a-gp", sleep = 1 )
client |
|
postcode |
String with the target postcode. |
base |
String with the base URL for Google Maps website. |
sleep |
Integer with number of seconds to use as pause between actions on the web page. |
Data frame with GP practices near the given postcode
.
## Not run: # Create RSelenium session rD <- RSelenium::rsDriver(browser = "firefox", port = 4544L, verbose = FALSE) # Retrieve GP practices near L69 3GL # (Waterhouse building, University of Liverpool) wh_gps_tb <- scrappy::find_a_gp(rD$client, postcode = "L69 3GL") # Stop server rD$server$stop() ## End(Not run)
## Not run: # Create RSelenium session rD <- RSelenium::rsDriver(browser = "firefox", port = 4544L, verbose = FALSE) # Retrieve GP practices near L69 3GL # (Waterhouse building, University of Liverpool) wh_gps_tb <- scrappy::find_a_gp(rD$client, postcode = "L69 3GL") # Stop server rD$server$stop() ## End(Not run)
Scrape Google Maps' reviews
google_maps( client, name, place_id = NULL, base = "https://www.google.com/maps/search/?api=1&query=", sleep = 1, max_reviews = 100, result_id = 1, with_text = FALSE )
google_maps( client, name, place_id = NULL, base = "https://www.google.com/maps/search/?api=1&query=", sleep = 1, max_reviews = 100, result_id = 1, with_text = FALSE )
client |
|
name |
String with the name of the target place. |
place_id |
String with the unique ID of the target place, useful when more than one place has the same name. |
base |
String with the base URL for Google Maps website. |
sleep |
Integer with number of seconds to use as pause between actions on the web page. |
max_reviews |
Integer with the maximum number of reviews to scrape. The number of existing reviews will define the actual number of reviews returned. |
result_id |
Integer with the result position to use, only relevant when
multiple matches for the given |
with_text |
Boolean value to indicate if the |
Tibble with the reviews extracted from Google Maps.
## Not run: # Create RSelenium session rD <- RSelenium::rsDriver(browser = "firefox", port = 4544L, verbose = FALSE) # Retrieve reviews for Sefton Park in Liverpool sefton_park_reviews_tb <- scrappy::google_maps( client = rD$client, name = "Sefton Park", place_id = "ChIJrTCHJVkge0gRm1LWF0fSPgw", max_reviews = 20 ) sefton_park_reviews_tb_with_text <- scrappy::google_maps( client = rD$client, name = "Sefton Park", place_id = "ChIJrTCHJVkge0gRm1LWF0fSPgw", max_reviews = 20, with_text = TRUE ) # Stop server rD$server$stop() ## End(Not run)
## Not run: # Create RSelenium session rD <- RSelenium::rsDriver(browser = "firefox", port = 4544L, verbose = FALSE) # Retrieve reviews for Sefton Park in Liverpool sefton_park_reviews_tb <- scrappy::google_maps( client = rD$client, name = "Sefton Park", place_id = "ChIJrTCHJVkge0gRm1LWF0fSPgw", max_reviews = 20 ) sefton_park_reviews_tb_with_text <- scrappy::google_maps( client = rD$client, name = "Sefton Park", place_id = "ChIJrTCHJVkge0gRm1LWF0fSPgw", max_reviews = 20, with_text = TRUE ) # Stop server rD$server$stop() ## End(Not run)
Retrieve Weather data from the Network for Environment and Weather Applications (NEWA) at Cornell University.
newa_nrcc( client, year, month, station, base = "http://newa.nrcc.cornell.edu/newaLister", interval = "hly", sleep = 6, table_id = "#dtable", path = getwd(), save_file = TRUE )
newa_nrcc( client, year, month, station, base = "http://newa.nrcc.cornell.edu/newaLister", interval = "hly", sleep = 6, table_id = "#dtable", path = getwd(), save_file = TRUE )
client |
|
year |
Numeric value with the year. |
month |
Numeric value with the month. |
station |
String with the station abbreviation. Check the http://newa.cornell.edu/index.php?page=station-pages for a list. |
base |
Base URL (default: http://newa.nrcc.cornell.edu/newaLister). |
interval |
String with data interval (default: hly, hourly). |
sleep |
Numeric value with the number of seconds to wait for the page to load the results (default: 6 seconds). |
table_id |
String with the unique HTML ID assigned to the table
containing the data (default: |
path |
String with path to location where CSV files should be stored
(default: |
save_file |
Boolean flag to indicate whether or not the output should be stored as a CSV file. |
Tibble with the data retrieved from the server.
## Not run: # Create RSelenium session rD <- RSelenium::rsDriver(browser = "firefox", port = 4544L, verbose = FALSE) # Retrieve data for the Geneva (Bejo) station on 2020/12 scrappy::newa_nrcc(rD$client, 2020, 12, "gbe") # Stop server rD$server$stop() ## End(Not run)
## Not run: # Create RSelenium session rD <- RSelenium::rsDriver(browser = "firefox", port = 4544L, verbose = FALSE) # Retrieve data for the Geneva (Bejo) station on 2020/12 scrappy::newa_nrcc(rD$client, 2020, 12, "gbe") # Stop server rD$server$stop() ## End(Not run)
Retrieve Weather data from the Network for Environment and Weather Applications (NEWA) version 3.0 at Cornell University.
newa_nrcc3( year, month, day, hour, station, base = "https://hrly.nrcc.cornell.edu/stnHrly" )
newa_nrcc3( year, month, day, hour, station, base = "https://hrly.nrcc.cornell.edu/stnHrly" )
year |
Numeric value with the start year. |
month |
Numeric value with the start month. |
day |
Numeric value with the start day. |
hour |
Numeric value with the start hour. |
station |
String with the station abbreviation. Check
|
base |
Base URL (default: https://hrly.nrcc.cornell.edu/stnHrly). |
List of data frames with hourly
, daily
,
hourly_forecast
, and daily forecast (daily_forecast
) data.
scrappy::newa_nrcc3(2021, 12, 01, 00, "gbe")
scrappy::newa_nrcc3(2021, 12, 01, 00, "gbe")
A dataset containing information of 718 weather stations in the Network for Environment and Weather Applications (NEWA) at Cornell University.
data(newa_stations)
data(newa_stations)
A data frame with 718 rows and 3 variables:
Station's name.
State where the station is located.
Station's code.
Network for Environment and Weather Applications [email protected]
http://newa.cornell.edu/index.php?page=station-pages
A dataset containing information of 801 weather stations in the Network for Environment and Weather Applications (NEWA) version 3 at Cornell University.
data(newa3_stations)
data(newa3_stations)
A data frame with 801 rows and 10 variables:
Station's name.
State where the station is located.
Station's code.
Entity to which the entity is affiliated.
Entity's URL.
Station's latitude.
Station's longitude.
Station's elevation.
Start year (data available).
Boolean flag to indicate if the station is part of the International Civil Aviation Organization (ICAO) (e.g is an airport).
Network for Environment and Weather Applications [email protected]
Print Values
Print Google Maps' reviews
## S3 method for class 'gmaps_reviews' print(x, ...)
## S3 method for class 'gmaps_reviews' print(x, ...)
x |
an object used to select a method. |
... |
further arguments passed to or from other methods. |
Wait until page has finished loading the element with the tag value
wait_to_load(client, using = "css", value = "body", sleep = 1)
wait_to_load(client, using = "css", value = "body", sleep = 1)
client |
|
using |
String with the property to use to find the element (e.g. "css", "xpath", etc.) (default: "css"). |
value |
String with the tag of the page element to wait to load (default: "body"). |
sleep |
Numeric value with the number of seconds to wait for the page to load the results (default: 1 second). |