A perspective camera is just an idealized camera model used for our understanding of computer vision. For this, we will be using the frontal projection model (which is a bit different from what you learned awhile ago. Specifically that the projected plane is on the same side as the object.)

TLDR
Given we have points in world frame and the transform between the camera and the world frame is given by
The pinhole camera model is defined as
Which expanded looks like
- is our camera intrinsic matrix
- is our horizontal focal length accounting for pixel size
- is our vertical focal length accounting for pixel size
- is our skew coefficient (usually 0)
- is the horizontal offset to center the image plane with the camera frame
- is the vertical offset to center the image place with the camera frame
- is our camera extrinsic matrix (chopped transformation matrix)
- is a valid rotation matrix
- is our transformation vector
- is the depth of the point in camera frame (must be divided out at the end to get our and ). The act of dividing out the depth is know as perspective projection
Accounting for Distortion
Distortion is a non-linear function, and there are number of different models for distortion.
Brown-Conrady Distortion Model
AKA Plumb Bob Model
where
This is the distortion model used by OpenCV
However, most of the time OpenCV uses the standard 5-parameter, simplified model for most calibrations.
Adding Distortion into the model
We add in the distortion after the extrinsic and perspective transform, before passing through camera intrinsics
Explanation
We have our pinhole and a point in 3D space, . The vector to the point from the pinhole is
important to note that the axis is normal to the image plane
The projected point which will end up on the plane is given by a vector
It is homogeneous to work with matrix calculations. This is called a normalized image coordinate
Essential Matrix
If the same point is collected by the same camera after a transformation, the two observations of the same point are related by
Where is called the essential matrix.
And its related to the pose of the camera
Lens Distortion
It exists, and we need to characterize it and deal with it, This affects how close a real camera is to our idealized model. But once it is characterized, we can run an undistortion procedure to get the camera image to something we can actually use.
Intrinsic Parameters
We were assuming right before this that the focal length is 1, so we actually need to deal with that (as well as map to pixel coordinates which start from the top left of the image)
These have to be determined through camera calibration
Fundamental Matrix
Similar to the Essential Matrix, but defined between two different cameras. Lets say we have two different cameras
The fundamental matrix, exists such that
Reasoning
The constraint associated with the fundamental matrix is also called the epipolar constraint

Homography
If an observed point is on a plane of known geometry, it is possible to work out what the point will look like on another camera of a known pose change. This is called homography.


Like before, we have two cameras
Say we know the equation of the Plane expressed in both camera frames to be
This implies that
Substituting our equations for
This implies that we can write the coordinates of with respect to any camera frame as
Extending this to our coodinates in the image plane, we get that
The Homography Matrix lets us determine the how a point in one image plane is gonna look like in another plane. Given that we know the geometry of the point
The Homography matrix is invertible
