

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
jjdocker description steps jjdocker description steps
Typology: Schemes and Mind Maps
1 / 2
This page cannot be seen from the preview
Don't miss anything!


Chapter 1 GENERAL CONCEPTS
1.2.1. Why is it used?
Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow developers to package an application with all its dependencies (code, libraries, environment variables, etc.) so that it works consistently across different environments. This is especially useful when moving applications from one machine to another, like from a developer's local environment to a production server.
Docker solves the problem of inconsistencies between environments by bundling everything needed to run an application inside a container, which is lightweight, portable, and isolated from the host machine. Containers can run on any machine that has Docker installed, ensuring that our application behaves the same regardless of where it's deployed.
1.2.2. How Docker is Used in Web Scraping:
In web scraping, Docker helps package your scraping environment (Python libraries, scraping tools, browsers like ChromeDriver for Selenium, etc.) into a single container. This eliminates issues with software dependencies across different machines. To note that a container is a lightweight, standalone executable package that includes everything needed to run an application (e.g., source code, libraries, settings). For example, web scraping tools like Selenium or BeautifulSoup often require specific dependencies (e.g., browser drivers, Python packages). Docker makes sure everything is bundled correctly so our scraping code will run smoothly wherever the container is executed.
1.2.3. Docker Example for Web Scraping
Here’s how I started building a simple web scraping tool with Docker:
mkdir webscraper cd webscraper
import requests from bs4 import BeautifulSoup
Chapter 1 GENERAL CONCEPTS
# Make a request to a website URL = 'https://example.com' response = requests.get(URL)
# Parse the HTML content soup = BeautifulSoup(response.text, 'html.parser')
# Find the title of the page title = soup.title.string print(f"Page title: {title}")
requests beautifulsoup
# Use an official Python runtime as a base image FROM python:3.8-slim
# Set the working directory to /app WORKDIR /app
# Copy the current directory contents into the container at /app COPY. /app
# Install the dependencies RUN pip install --no-cache-dir -r requirements.txt
# Run the scraper script when the container starts CMD ["python", "scraper.py"]
∑ FROM python:3.8-slim: This line tells Docker to use the official Python image as the base for my container. The 3.8-slim version is a lightweight version of Python 3.8, which is smaller in size and includes just enough libraries to run Python applications. Using a slim image makes the container smaller, faster to build, and more efficient. ∑ WORKDIR /app: This sets the working directory inside the container to /app. Every subsequent command (like copying files or installing dependencies) will happen within this directory. ∑ COPY. /app: This copies all files and folders from my local machine’s current directory into the /app directory inside the container, allowing Docker to see my scraper.py script, requirements.txt, and any other necessary files. ∑ RUN pip install --no-cache-dir -r requirements.txt: This installs the Python packages listed in requirements.txt inside the container using pip. The --no- cache-dir option ensures that pip doesn’t save cache files for installed packages, keeping the container small and efficient.
Chapter 1 GENERAL CONCEPTS
∑ CMD ["python", "scraper.py"]: This specifies the command to run when the container starts, telling Docker to run the scraper.py Python script.
docker build -t webscraper
This command tells Docker to create an image named webscraper based on the Dockerfile in the current directory.
∑ -t webscraper: This flag assigns the name webscraper to the image. I can use any name here, but webscraper is just an example. ∑. (dot): The dot at the end tells Docker to look for the Dockerfile in the current directory.
docker run webscraper
This executes the scraper.py file inside the container and outputs the title of the web page. Docker creates a new container using the webscraper image and then executes the command specified in the Dockerfile.
Chapter 1 GENERAL CONCEPTS
docker container ls -a # List all containers docker rm <container_id> # Remove the container by ID
If I have many containers running or exited, it’s a good idea to remove them once they’re no longer needed to free up system resources.
1.2.4. Benefits of Docker in Scraping:
∑ Consistency : The same code will run identically on any machine where Docker is installed, eliminating issues caused by different environments. ∑ Dependency Management : All the necessary dependencies are packaged inside the Docker container. ∑ Portability : You can easily share the Docker image with others, and they can run the same code on their machines without any setup issues. ∑ Isolation : The containerized environment isolates the scraping tool from your host machine, ensuring that any issues inside the container won’t affect the host.
This refers to collecting data from websites or apps that we do not own. It can be tricky because some websites block scraping, or they might have legal restrictions on scraping their data. In fact, we scrape third-party apps and websites when we need information that they display publicly, like prices, reviews, or product details, but don’t offer an API to access the data easily.
To do it, we use tools like BeautifulSoup (for HTML), Selenium (for dynamic content), or APIs (if available) to extract data. However, it’s important to always check the website’s terms of service to make sure we’re not violating any rules.